CN113255570A - Sequential action detection method for sensing video clip relation - Google Patents

Sequential action detection method for sensing video clip relation Download PDF

Info

Publication number
CN113255570A
CN113255570A CN202110659154.1A CN202110659154A CN113255570A CN 113255570 A CN113255570 A CN 113255570A CN 202110659154 A CN202110659154 A CN 202110659154A CN 113255570 A CN113255570 A CN 113255570A
Authority
CN
China
Prior art keywords
video
global
video segments
candidate
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110659154.1A
Other languages
Chinese (zh)
Other versions
CN113255570B (en
Inventor
徐行
任燚梵
沈复民
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN202110659154.1A priority Critical patent/CN113255570B/en
Publication of CN113255570A publication Critical patent/CN113255570A/en
Application granted granted Critical
Publication of CN113255570B publication Critical patent/CN113255570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation, which comprises the following steps: step S1: sampling a video; step S2: performing primary feature extraction on the video; step S3: performing feature enhancement on the extracted features to generate boundary prediction of time sequence nodes, and extracting the features of all candidate video segments; step S4: capturing the relation between the candidate video segment characteristics; step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score; step S6: removing repeated candidate video segments; step S7: classifying the candidate video segments to obtain class information of the candidate video segments; more effective video segment characteristics are generated by capturing global and local relationships between them, thereby generating more effective prediction results.

Description

Sequential action detection method for sensing video clip relation
Technical Field
The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation.
Background
In recent years, with the continuous development of streaming media, videos on various website platforms are increased explosively, compared with traditional image information, information contained in the videos is richer and gets more attention of researchers, video understanding gradually becomes a popular research field in the industry and academia, a timing action detection task is an important branch of the video, and the task is mainly to detect a timing boundary of each action instance in a long video and determine the type of the action, so that a user can be helped to locate the action information in the videos more conveniently and quickly.
The time sequence action detection task can help people to quickly determine key contents in the long-distance video, can be used as a preprocessing step for problems such as video understanding and man-machine interaction, and is widely applied to actual life: (1) the video monitoring field: with the rapid development of the internet of things, cameras are already spread in various public places such as roads, schools and the like, play an important security role, but bring extremely large video data, if the cameras are simply analyzed manually, the method is obviously unrealistic, and the time sequence action detection task can quickly extract effective information in the monitoring video, liberate a large amount of human resources, and (2) the video retrieval field: nowadays, with the popularity of videos on various social software, common users also upload and share various video data, in order to recommend videos of interest to the users, classification and labeling are required for each uploaded video, if manual processing is used, the labor cost is very high, and the time sequence action detection can detect actions in the videos, perform preliminary classification, sorting and labeling according to a preset definition, and further integrate the actions into a subsequent video retrieval and video recommendation related algorithm.
Due to the wide application of the time sequence action detection task in the industry and academia, many effective algorithms are proposed, and at present, the algorithms can be roughly divided into two types: one-stage and two-stage:
1) one-stage: firstly, detecting video segments with different lengths by utilizing a hierarchical structure, a cascade structure and the like, and simultaneously giving the category information of the video segment in the prediction process;
2) two-stage: firstly, extracting all possible candidate video segments, and then classifying the candidate video segments by using some classical classifiers (such as Unet) to obtain the class information of the candidate video segments;
although the two-stage based methods have achieved good results, they generate the candidate video segments individually in the process of generating the candidate video segments, and ignore the relationship between the video segments.
Disclosure of Invention
Based on the above problems, the present invention provides a time sequence action detection method for sensing video segment relation, which generates more effective video segment characteristics by capturing global relation and local relation between them, thereby generating more effective prediction results.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a time sequence action detection method for sensing video clip relation comprises the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features
Figure 421825DEST_PATH_IMAGE001
Step S3: using BaseNet model to pair extracted features
Figure 670404DEST_PATH_IMAGE001
Feature enhancement to generate boundary prediction for time series nodes
Figure 206558DEST_PATH_IMAGE002
And
Figure 958614DEST_PATH_IMAGE003
in addition, all the characteristics of the candidate video segments are extracted and recorded
Figure 577814DEST_PATH_IMAGE004
Wherein
Figure 579268DEST_PATH_IMAGE004
Each element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement module
Figure 184693DEST_PATH_IMAGE004
The relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
Further, the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
Further, the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence node
Figure 119151DEST_PATH_IMAGE002
And
Figure 784618DEST_PATH_IMAGE003
and, in addition, extracting all the characteristics of the candidate video segment and outputting
Figure 742210DEST_PATH_IMAGE005
Further, the global sensing module specifically includes:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for input
Figure 154256DEST_PATH_IMAGE006
First, the
Figure 615324DEST_PATH_IMAGE006
Inputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer
Figure 576327DEST_PATH_IMAGE007
The calculation formula is as follows:
Figure 958898DEST_PATH_IMAGE008
wherein,
Figure 702863DEST_PATH_IMAGE007
to represent
Figure 877492DEST_PATH_IMAGE010
The results after being input to the horizontal pooling,
Figure 619183DEST_PATH_IMAGE011
representing all possible start times in the video,
Figure 879263DEST_PATH_IMAGE012
is shown as
Figure 364602DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 393738DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 306331DEST_PATH_IMAGE013
Figure 256969DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 342737DEST_PATH_IMAGE014
,
Figure 429641DEST_PATH_IMAGE015
Then it indicates to
Figure 903348DEST_PATH_IMAGE013
Is a start time
Figure 744878DEST_PATH_IMAGE014
For a duration of a video segment
Figure 696653DEST_PATH_IMAGE012
The value of each channel;
for vertically pooled
Figure 575747DEST_PATH_IMAGE007
The formula is as follows:
Figure 954776DEST_PATH_IMAGE016
wherein,
Figure 286532DEST_PATH_IMAGE017
to represent
Figure 41998DEST_PATH_IMAGE010
The results after being input to the vertical pooling,
Figure 775599DEST_PATH_IMAGE018
representing all possible durations of the video segments,
Figure 591108DEST_PATH_IMAGE012
is shown as
Figure 675739DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 969317DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 557424DEST_PATH_IMAGE013
Figure 543835DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 850182DEST_PATH_IMAGE014
,
Figure 150713DEST_PATH_IMAGE019
Then it indicates to
Figure 717961DEST_PATH_IMAGE013
Is a start time
Figure 750639DEST_PATH_IMAGE014
For a duration of a video segment
Figure 668917DEST_PATH_IMAGE012
The value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result
Figure 170874DEST_PATH_IMAGE020
The following are:
Figure 858208DEST_PATH_IMAGE021
the output of the global sensing unit is then:
Figure 327366DEST_PATH_IMAGE022
wherein,
Figure 467361DEST_PATH_IMAGE023
represents the result of the operation of the global sensing unit,
Figure 47378DEST_PATH_IMAGE024
which represents the input to the unit or units,
Figure 589217DEST_PATH_IMAGE025
it is shown that the activation function is,
Figure 963698DEST_PATH_IMAGE026
represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
Figure 590988DEST_PATH_IMAGE027
wherein,
Figure 443538DEST_PATH_IMAGE028
the output of the global perception module is represented,
Figure 105463DEST_PATH_IMAGE029
representing a global sense unit operation that is,
Figure 978742DEST_PATH_IMAGE026
representing a convolution operation.
Further, the feature enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for input
Figure 968694DEST_PATH_IMAGE006
First, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
Figure 421672DEST_PATH_IMAGE030
when in use
Figure 141367DEST_PATH_IMAGE031
When the temperature of the water is higher than the set temperature,
Figure 388808DEST_PATH_IMAGE032
is to input
Figure 990691DEST_PATH_IMAGE033
If the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layer
Figure 244430DEST_PATH_IMAGE034
Can be calculated according to the following formula:
Figure 818631DEST_PATH_IMAGE035
Figure 705816DEST_PATH_IMAGE036
represents the output result of each layer of the global perception module,
Figure 529415DEST_PATH_IMAGE038
representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
Figure 793037DEST_PATH_IMAGE039
wherein,
Figure 18482DEST_PATH_IMAGE040
the representation is fused with different layers
Figure 76568DEST_PATH_IMAGE013
The output result of the layer is then,
Figure 653043DEST_PATH_IMAGE041
it is indicated that the up-sampling operation,
Figure 454777DEST_PATH_IMAGE042
the second of the aforementioned global perception modules
Figure 534728DEST_PATH_IMAGE013
And +1 layer outputs the result.
Further, the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerization
Figure 294874DEST_PATH_IMAGE004
The relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
Further, the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
Compared with the prior art, the invention has the beneficial effects that: capturing the video frequency range relationship from different angles (local information and global information), establishing a relationship for a remote candidate video from a global view angle by using the average pooling characteristic for the global information, aggregating information between adjacent candidate video segments by using a hierarchical structure for the local information due to the distribution characteristic of the candidate video segments, and achieving a better detection result by the whole model due to the complementation of the local information and the global information;
for the prediction of candidate video segments, if they are processed separately, a constraint relation between them is ignored, and by exploring the relation between candidate video segments, more complete and accurate results can be generated, for different candidate video segments in the same video, they are often highly correlated, the relation of all candidate video segments is established, the characteristics of action instances can be enhanced by using background information, and for adjacent candidate video segments, there is a great amount of overlap between them, and effective information between them is aggregated by using average pooling, so as to generate more accurate results.
Drawings
Fig. 1 is a flowchart of this embodiment 1.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
Example 1
In this embodiment, the system mainly includes a global perception module and a feature enhancement module, where:
in the global perception module, GA units are designed to establish the relation between candidate video segments in the same row and column, and for input
Figure 93066DEST_PATH_IMAGE006
First, the
Figure 167332DEST_PATH_IMAGE006
Input into two parallel paths, eachOne path comprising a vertical pooling layer and a horizontal pooling layer, respectively, for a horizontally pooled one
Figure 367369DEST_PATH_IMAGE007
The calculation formula is as follows:
Figure 32837DEST_PATH_IMAGE008
wherein,
Figure 52746DEST_PATH_IMAGE007
to represent
Figure 464791DEST_PATH_IMAGE010
The results after being input to the horizontal pooling,
Figure 519335DEST_PATH_IMAGE011
representing all possible start times in the video,
Figure 90124DEST_PATH_IMAGE012
is shown as
Figure 862908DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 810136DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 719186DEST_PATH_IMAGE013
Figure 726456DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 720957DEST_PATH_IMAGE014
,
Figure 471875DEST_PATH_IMAGE015
Then it indicates to
Figure 501011DEST_PATH_IMAGE013
Is a start time
Figure 679183DEST_PATH_IMAGE014
For a duration of a video segment
Figure 426559DEST_PATH_IMAGE012
The value of each channel;
for vertically pooled
Figure 715589DEST_PATH_IMAGE007
The formula is as follows:
Figure 599231DEST_PATH_IMAGE016
wherein,
Figure 682725DEST_PATH_IMAGE017
to represent
Figure 651818DEST_PATH_IMAGE010
The results after being input to the vertical pooling,
Figure 741609DEST_PATH_IMAGE018
representing all possible durations of the video segments,
Figure 479758DEST_PATH_IMAGE012
is shown as
Figure 999732DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 659384DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 414850DEST_PATH_IMAGE013
Figure 148451DEST_PATH_IMAGE014
Display viewThe frequency band has a duration of
Figure 636064DEST_PATH_IMAGE014
,
Figure 48591DEST_PATH_IMAGE019
Then it indicates to
Figure 483114DEST_PATH_IMAGE013
Is a start time
Figure 399118DEST_PATH_IMAGE014
For a duration of a video segment
Figure 995315DEST_PATH_IMAGE012
The value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result
Figure 691876DEST_PATH_IMAGE020
The following are:
Figure 664511DEST_PATH_IMAGE021
the output of the global sensing unit is then:
Figure 231759DEST_PATH_IMAGE022
wherein,
Figure 279085DEST_PATH_IMAGE023
represents the result of the operation of the global sensing unit,
Figure 931783DEST_PATH_IMAGE024
which represents the input to the unit or units,
Figure 708110DEST_PATH_IMAGE025
presentation activation letterThe number of the first and second groups is,
Figure 598705DEST_PATH_IMAGE026
represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
Figure 926918DEST_PATH_IMAGE027
wherein,
Figure 942279DEST_PATH_IMAGE028
the output of the global perception module is represented,
Figure 381351DEST_PATH_IMAGE029
representing a global sense unit operation that is,
Figure 798557DEST_PATH_IMAGE026
represents a convolution operation;
in addition, in the feature enhancement module, a hierarchy is used to capture local information between candidate video segments. For input
Figure 297671DEST_PATH_IMAGE043
First, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
Figure 800328DEST_PATH_IMAGE044
when in use
Figure 43090DEST_PATH_IMAGE045
When the temperature of the water is higher than the set temperature,
Figure 845961DEST_PATH_IMAGE046
is to input
Figure 250398DEST_PATH_IMAGE047
Can only be polymerized to due to the limitation of average poolingThe relationship between adjacent candidate frames cannot capture the long-distance dependency relationship, so that the global perception module is embedded into the feature enhancement module, all layers in the feature enhancement module share the same global perception module, and the output of each layer
Figure 240350DEST_PATH_IMAGE049
Can be calculated according to the following formula:
Figure 286804DEST_PATH_IMAGE050
by doing so, the relationship between all candidate video segments can be established from two complementary levels (local and global), and then the upsampling operation is used to aggregate features between different levels, since each level is supervised by label information, the fusion of information of different levels can reduce the generation of noise to the maximum extent, and the calculation formula is as follows:
Figure 678602DEST_PATH_IMAGE051
thus, the relationship between different candidate video segments can be captured from different levels and different scales.
Based on the above, the method for detecting the time sequence action of sensing the relation of the video segments as shown in fig. 1 includes the following steps:
step S1: sampling a video;
selecting a proper training and testing data set, wherein the training and testing are mainly performed on the public data sets ActivityNet-1.3 and THUMOS-14;
the ActivityNet-1.3 dataset is an open dataset for video segment generation and detection, which contains mainly 19994 videos and contains 200 action classes, these videos are mainly crawled from youtube website, all of which are different in resolution and time, it was an ActivityNet change 2016 and 2017 game dataset, which divides all videos into training, evaluation and test sets in a 2:1:1 ratio;
the THIMOS-14 data set contains 413 videos and contains 20 categories of information, wherein the test set contains 212 videos, the validation set contains 200 videos used for the time series motion detection task, the entire model is trained on the validation set and the performance of the entire model is evaluated on the tester.
Step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features
Figure 519519DEST_PATH_IMAGE052
Firstly, for an unprocessed long video, extracting a corresponding video frame to represent the video frame as
Figure 728259DEST_PATH_IMAGE053
Wherein
Figure 312824DEST_PATH_IMAGE054
Representing the total number of video frames,
Figure 559129DEST_PATH_IMAGE055
the nth frame represented in the video, for which the set of tags may be represented as
Figure 570947DEST_PATH_IMAGE056
Wherein
Figure 535492DEST_PATH_IMAGE057
Indicating the number of motion video segments contained in a video,
Figure 658169DEST_PATH_IMAGE058
and
Figure 24559DEST_PATH_IMAGE059
respectively represent
Figure 207279DEST_PATH_IMAGE061
The start time, end time and class information of each label are used only during training, and then the label is usedThe TSN model is used to extract the characteristics of each video, and the characteristics are expressed as
Figure 393541DEST_PATH_IMAGE062
Step S3: feature enhancement of extracted features using BaseNet model, resulting in boundary prediction of time series nodes
Figure 319908DEST_PATH_IMAGE063
And
Figure 540805DEST_PATH_IMAGE064
in addition, the characteristics of all candidate video segments are extracted and output
Figure 894426DEST_PATH_IMAGE065
Wherein features are aligned using graph convolution (including GCN)
Figure 567984DEST_PATH_IMAGE052
Enhancing and obtaining the characteristics of richer semantic information, wherein the calculation formula is as follows:
Figure 32464DEST_PATH_IMAGE066
wherein,
Figure 373446DEST_PATH_IMAGE067
representing features enhanced by graph convolution
Figure 897968DEST_PATH_IMAGE068
Then, the feature
Figure 61752DEST_PATH_IMAGE067
Shared by two branch networks, one of which is used to determine whether each timing position is a start node or an end node, which may be denoted as
Figure 595502DEST_PATH_IMAGE069
And
Figure 259832DEST_PATH_IMAGE070
wherein
Figure 955256DEST_PATH_IMAGE072
Representing the length of a sequence of video features that will output the features of all candidate video segments for another branching network
Figure 603406DEST_PATH_IMAGE073
Step S4: capturing candidate video segment features using global perception module and feature enhancement module
Figure 675267DEST_PATH_IMAGE065
The relationship between;
for the output in the above step S3
Figure 459684DEST_PATH_IMAGE065
Each position of which represents the characteristics of a candidate video segment, and inputting the characteristics into the characteristics enhancing module, the relationship between the candidate video segments can be captured from the local view and the global view, the characteristics of the action instance can be enhanced, and the background information can be suppressed, so that more accurate and more complete results can be generated, and the output is
Figure 857167DEST_PATH_IMAGE074
Then use
Figure 727034DEST_PATH_IMAGE075
The final result of each candidate video segment is predicted by the convolution of (a):
Figure 602586DEST_PATH_IMAGE076
wherein,
Figure 241509DEST_PATH_IMAGE077
respectively representing two types of prediction results obtained by regression and classification supervision,
Figure 544314DEST_PATH_IMAGE078
it is shown that the activation function is,
Figure 901477DEST_PATH_IMAGE079
representing a convolution operation.
Step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
in order to fully utilize the output result of the whole model, each layer of output in the feature enhancement module is fused into the final score of the candidate video segment, and in addition, the boundary information of each video segment is considered, so that the output of each layer of output in the feature enhancement module is combined into the final score of the candidate video segment
Figure 518404DEST_PATH_IMAGE081
To
Figure 402046DEST_PATH_IMAGE083
Video segment of (1), score thereof
Figure 751119DEST_PATH_IMAGE085
The calculation formula of (a) is as follows:
Figure 720212DEST_PATH_IMAGE086
Figure 810003DEST_PATH_IMAGE087
wherein,
Figure 548152DEST_PATH_IMAGE088
Figure 68126DEST_PATH_IMAGE089
and
Figure 790094DEST_PATH_IMAGE090
all the ingredients are super-ginseng,
Figure 420927DEST_PATH_IMAGE091
is shown as
Figure 279162DEST_PATH_IMAGE091
The output of the layer(s) is,
Figure 704458DEST_PATH_IMAGE092
the total number of layers of the hierarchical result is indicated.
Step S6: removing repeated candidate video segments by using a Soft-NMS model;
after all possible candidate frames are acquired, as most of the possible candidate frames have large overlap, the Soft-NMS model is used for removing the candidate frames again, and the scores of the candidate video segments are counted
Figure 382564DEST_PATH_IMAGE093
Sorting according to the size of the video segments, selecting the candidate video segment with the largest score, and then calculating the iou of the candidate video segment with the other video segments, wherein the video segments with high overlapping degree are attenuated according to the following formula.
Figure 817087DEST_PATH_IMAGE094
Wherein
Figure 529828DEST_PATH_IMAGE095
The parameters representing the gaussian function are then calculated,
Figure 657184DEST_PATH_IMAGE096
which represents a pre-defined threshold value that is,
Figure 822587DEST_PATH_IMAGE097
respectively representing the currently selected candidate video segment and other video segments.
Step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
After all possible candidate video segments are obtained, use is made ofThe Unet classifier classifies the video segments to obtain the final class information, which can be expressed as
Figure 795222DEST_PATH_IMAGE098
Wherein
Figure 628049DEST_PATH_IMAGE099
As the information of the category, it is,
Figure 660727DEST_PATH_IMAGE100
is the score to which it corresponds,
Figure 516687DEST_PATH_IMAGE101
to predict the number of instances of an action.
Example 2
In this embodiment, the overall model in embodiment 1 needs to be trained, and the overall loss function is represented as:
Figure 80962DEST_PATH_IMAGE102
wherein
Figure 971558DEST_PATH_IMAGE103
Is super ginseng and used
Figure 299771DEST_PATH_IMAGE104
To determine whether each time node is a start node or an end node, which can be expressed as:
Figure 315131DEST_PATH_IMAGE105
Figure 754203DEST_PATH_IMAGE107
and
Figure 171409DEST_PATH_IMAGE108
respectively representing weighted cross entropy loss functions,
Figure 670523DEST_PATH_IMAGE109
respectively expressed as the prediction result of the start node and its corresponding label,
Figure 173180DEST_PATH_IMAGE110
respectively representing the prediction result of the end node and the corresponding label, and training the feature enhancement module, wherein the label is generated for the 0 th layer in the feature enhancement module and is represented as
Figure 415942DEST_PATH_IMAGE111
Since the feature enhancement module is a hierarchical structure and each layer is supervised, a label is generated for each layer using the following formula.
Figure 687655DEST_PATH_IMAGE112
Thus, the loss function for each layer can be defined as:
Figure 357671DEST_PATH_IMAGE113
wherein,
Figure 347623DEST_PATH_IMAGE114
a loss function for each layer is represented,
Figure 394077DEST_PATH_IMAGE115
respectively representing the prediction result obtained by each layer by using regression supervision and the corresponding label,
Figure 785875DEST_PATH_IMAGE116
respectively representing the prediction result obtained by each layer by using classification supervision and the corresponding label.
Wherein
Figure 626792DEST_PATH_IMAGE118
And
Figure 104041DEST_PATH_IMAGE119
respectively represent the square difference loss and the weighted cross entropy loss, then
Figure 688606DEST_PATH_IMAGE120
Wherein
Figure DEST_PATH_IMAGE121
a loss function representing the hierarchy is used to,
Figure 463139DEST_PATH_IMAGE091
is shown as
Figure 350324DEST_PATH_IMAGE091
A layer of a material selected from the group consisting of,
Figure 439503DEST_PATH_IMAGE122
the number of total layers is indicated,
Figure DEST_PATH_IMAGE123
representing hyper-parameters, and finally, loss functions
Figure 968704DEST_PATH_IMAGE124
Optimization can be done in an end-to-end fashion.
Example 3
In this embodiment, the validity of the method is verified by using the selected data set, which specifically includes:
the method is verified on the selected data set, in order to well evaluate the effectiveness of the embodiment, the average accuracy mAP is selected as a main evaluation index, the mAP is respectively calculated on an iou set {0.3,0.4,0.5,0.6 and 0.7} on the THUMOS-14 data set, the mAP on the iou set {0.5,0.75 and 0.95} is calculated on an activityNet1.3 data set, and in addition, the average mAP of ten different ious is calculated on the activityNet 1.3.
The present embodiment performs verification on the data set ActivityNet-1.3 of the current main stream, and the final verification result is shown in the following table (comparing the model performance on the ActivityNet-1.3 data set (%):
TABLE 1 comparison of model Performance on ActivinyNet-1.3 dataset
Figure DEST_PATH_IMAGE125
The present example performed verification on the currently mainstream data set thumb-14, and the final verification results are shown in the following table (model performance vs. (%) on thumb-14 data set).
TABLE 2 modeling Performance on THUMOS-14 dataset vs
Figure 600674DEST_PATH_IMAGE126
The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims (7)

1. A method for detecting time sequence action of sensing video clip relation is characterized by comprising the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features
Figure 342145DEST_PATH_IMAGE001
Step S3: using BaseNet model to pair extracted features
Figure 983342DEST_PATH_IMAGE001
Feature enhancement to generate boundary prediction for time series nodes
Figure 872801DEST_PATH_IMAGE002
And
Figure 814212DEST_PATH_IMAGE003
in addition, all the characteristics of the candidate video segments are extracted and recorded
Figure 865344DEST_PATH_IMAGE004
Wherein
Figure 760881DEST_PATH_IMAGE004
Each element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement module
Figure 454031DEST_PATH_IMAGE004
The relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
2. The method according to claim 1, wherein said method comprises: the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
3. The method according to claim 1, wherein said method comprises: the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence node
Figure 984369DEST_PATH_IMAGE002
And
Figure 737562DEST_PATH_IMAGE003
and, in addition, extracting all the characteristics of the candidate video segment and outputting
Figure 353351DEST_PATH_IMAGE005
4. The method according to claim 1, wherein said method comprises: the global perception module specifically comprises the following modules:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for input
Figure 584612DEST_PATH_IMAGE006
First, the
Figure 235036DEST_PATH_IMAGE006
Inputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer
Figure 392086DEST_PATH_IMAGE007
The calculation formula is as follows:
Figure 760750DEST_PATH_IMAGE008
wherein,
Figure 530123DEST_PATH_IMAGE007
to represent
Figure 35054DEST_PATH_IMAGE009
The results after being input to the horizontal pooling,
Figure 864470DEST_PATH_IMAGE010
representing all possible start times in the video,
Figure 454851DEST_PATH_IMAGE011
is shown as
Figure 517661DEST_PATH_IMAGE011
A plurality of channels, each of which is provided with a plurality of channels,
Figure 877098DEST_PATH_IMAGE012
representing the start time of a video segment as
Figure 142994DEST_PATH_IMAGE012
Figure 955092DEST_PATH_IMAGE013
Representing a video segment of duration
Figure 331847DEST_PATH_IMAGE013
,
Figure 545791DEST_PATH_IMAGE014
Then it indicates to
Figure 481124DEST_PATH_IMAGE012
Is a start time
Figure 46097DEST_PATH_IMAGE013
For a duration of a video segment
Figure 960963DEST_PATH_IMAGE011
The value of each channel;
for vertically pooled
Figure 294993DEST_PATH_IMAGE007
The formula is as follows:
Figure 902692DEST_PATH_IMAGE015
wherein,
Figure 954961DEST_PATH_IMAGE016
to represent
Figure 407939DEST_PATH_IMAGE009
The results after being input to the vertical pooling,
Figure 363519DEST_PATH_IMAGE017
representing all possible durations of the video segments,
Figure 876540DEST_PATH_IMAGE011
is shown as
Figure 416106DEST_PATH_IMAGE011
A plurality of channels, each of which is provided with a plurality of channels,
Figure 938354DEST_PATH_IMAGE012
representing the start time of a video segment as
Figure 981397DEST_PATH_IMAGE012
Figure 930898DEST_PATH_IMAGE013
Representing a video segment of duration
Figure 426601DEST_PATH_IMAGE013
,
Figure 251076DEST_PATH_IMAGE018
Then it indicates to
Figure 414204DEST_PATH_IMAGE012
Is a start time
Figure 269027DEST_PATH_IMAGE013
For a duration of a video segment
Figure 783185DEST_PATH_IMAGE011
The value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result
Figure 381657DEST_PATH_IMAGE019
The following are:
Figure 399291DEST_PATH_IMAGE020
the output of the global sensing unit is then:
Figure 956175DEST_PATH_IMAGE021
wherein,
Figure 927935DEST_PATH_IMAGE022
represents the result of the operation of the global sensing unit,
Figure 330098DEST_PATH_IMAGE023
which represents the input to the unit or units,
Figure 202239DEST_PATH_IMAGE024
it is shown that the activation function is,
Figure 930023DEST_PATH_IMAGE025
represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
Figure 622036DEST_PATH_IMAGE026
wherein,
Figure 93468DEST_PATH_IMAGE027
the output of the global perception module is represented,
Figure 318651DEST_PATH_IMAGE028
representing a global sense unit operation that is,
Figure 951758DEST_PATH_IMAGE025
representing a convolution operation.
5. The method according to claim 4, wherein said method comprises: the characteristic enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for input
Figure 396646DEST_PATH_IMAGE006
First, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
Figure 406190DEST_PATH_IMAGE029
when in use
Figure 987344DEST_PATH_IMAGE030
When the temperature of the water is higher than the set temperature,
Figure 56931DEST_PATH_IMAGE031
is to input
Figure 225001DEST_PATH_IMAGE032
If the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layer
Figure 38236DEST_PATH_IMAGE033
Can be calculated according to the following formula:
Figure 739476DEST_PATH_IMAGE034
Figure 714385DEST_PATH_IMAGE035
represents the output result of each layer of the global perception module,
Figure 399444DEST_PATH_IMAGE036
representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
Figure 485212DEST_PATH_IMAGE037
wherein,
Figure 306537DEST_PATH_IMAGE038
the representation is fused with different layers
Figure 685304DEST_PATH_IMAGE012
The output result of the layer is then,
Figure 592080DEST_PATH_IMAGE039
it is indicated that the up-sampling operation,
Figure 481539DEST_PATH_IMAGE040
the second of the aforementioned global perception modules
Figure 157371DEST_PATH_IMAGE012
And +1 layer outputs the result.
6. The method according to claim 5, wherein said method comprises: the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerization
Figure 474082DEST_PATH_IMAGE004
The relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
7. The method according to claim 1, wherein said method comprises: the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
CN202110659154.1A 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation Active CN113255570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110659154.1A CN113255570B (en) 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110659154.1A CN113255570B (en) 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation

Publications (2)

Publication Number Publication Date
CN113255570A true CN113255570A (en) 2021-08-13
CN113255570B CN113255570B (en) 2021-09-24

Family

ID=77187848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110659154.1A Active CN113255570B (en) 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation

Country Status (1)

Country Link
CN (1) CN113255570B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920467A (en) * 2021-12-13 2022-01-11 成都考拉悠然科技有限公司 Tourist and commercial detection method and system combining booth detection and scene segmentation

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180350144A1 (en) * 2018-07-27 2018-12-06 Yogesh Rathod Generating, recording, simulating, displaying and sharing user related real world activities, actions, events, participations, transactions, status, experience, expressions, scenes, sharing, interactions with entities and associated plurality types of data in virtual world
CN109241829A (en) * 2018-07-25 2019-01-18 中国科学院自动化研究所 The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111079594A (en) * 2019-12-04 2020-04-28 成都考拉悠然科技有限公司 Video action classification and identification method based on double-current cooperative network
CN111372123A (en) * 2020-03-03 2020-07-03 南京信息工程大学 Video time sequence segment extraction method based on local to global
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112364852A (en) * 2021-01-13 2021-02-12 成都考拉悠然科技有限公司 Action video segment extraction method fusing global information
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241829A (en) * 2018-07-25 2019-01-18 中国科学院自动化研究所 The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time
US20180350144A1 (en) * 2018-07-27 2018-12-06 Yogesh Rathod Generating, recording, simulating, displaying and sharing user related real world activities, actions, events, participations, transactions, status, experience, expressions, scenes, sharing, interactions with entities and associated plurality types of data in virtual world
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111079594A (en) * 2019-12-04 2020-04-28 成都考拉悠然科技有限公司 Video action classification and identification method based on double-current cooperative network
CN111372123A (en) * 2020-03-03 2020-07-03 南京信息工程大学 Video time sequence segment extraction method based on local to global
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112364852A (en) * 2021-01-13 2021-02-12 成都考拉悠然科技有限公司 Action video segment extraction method fusing global information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HAISHENG SU等: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation", 《ARXIV》 *
JINGRAN ZHANG等: "Cooperative Cross-Stream Network for Discriminative Action Representation", 《ARXIV》 *
ZHIWU QING等: "Temporal Context Aggregation Network for Temporal Action Proposal Refinement", 《ARXIV》 *
李怡颖: "基于深度学习的视频人体动作识别研究", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920467A (en) * 2021-12-13 2022-01-11 成都考拉悠然科技有限公司 Tourist and commercial detection method and system combining booth detection and scene segmentation
CN113920467B (en) * 2021-12-13 2022-03-15 成都考拉悠然科技有限公司 Tourist and commercial detection method and system combining booth detection and scene segmentation

Also Published As

Publication number Publication date
CN113255570B (en) 2021-09-24

Similar Documents

Publication Publication Date Title
Qi et al. Exploiting multi-domain visual information for fake news detection
CN110175580B (en) Video behavior identification method based on time sequence causal convolutional network
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
Liu et al. Generalized zero-shot learning for action recognition with web-scale video data
CN110516536A (en) A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification
Wang et al. Spatial–temporal pooling for action recognition in videos
WO2021082589A1 (en) Content check model training method and apparatus, video content check method and apparatus, computer device, and storage medium
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN107341508B (en) Fast food picture identification method and system
CN111783712A (en) Video processing method, device, equipment and medium
CN113010705A (en) Label prediction method, device, equipment and storage medium
CN114092819B (en) Image classification method and device
CN113255570B (en) Sequential action detection method for sensing video clip relation
CN114741599A (en) News recommendation method and system based on knowledge enhancement and attention mechanism
CN112364852B (en) Action video segment extraction method fusing global information
CN117743611B (en) Automatic classification system for digital media content
CN115761599A (en) Video anomaly detection method and system
Cao et al. Adaptive receptive field U-shaped temporal convolutional network for vulgar action segmentation
Jayanthiladevi et al. Text, images, and video analytics for fog computing
Liu et al. Infmae: A foundation model in infrared modality
CN113033500B (en) Motion segment detection method, model training method and device
Lin et al. Temporal action localization with two-stream segment-based RNN
Das et al. A comparative analysis and study of a fast parallel cnn based deepfake video detection model with feature selection (fpc-dfm)
Batta et al. Predicting popularity of YouTube videos using viewer engagement features
Saddam Bekhet A comparative study for video classification techniques using direct features matching, machine learning, and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant