CN113255570B - Sequential action detection method for sensing video clip relation - Google Patents

Sequential action detection method for sensing video clip relation Download PDF

Info

Publication number
CN113255570B
CN113255570B CN202110659154.1A CN202110659154A CN113255570B CN 113255570 B CN113255570 B CN 113255570B CN 202110659154 A CN202110659154 A CN 202110659154A CN 113255570 B CN113255570 B CN 113255570B
Authority
CN
China
Prior art keywords
video
global
video segments
candidate
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110659154.1A
Other languages
Chinese (zh)
Other versions
CN113255570A (en
Inventor
徐行
任燚梵
沈复民
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN202110659154.1A priority Critical patent/CN113255570B/en
Publication of CN113255570A publication Critical patent/CN113255570A/en
Application granted granted Critical
Publication of CN113255570B publication Critical patent/CN113255570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Abstract

The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation, which comprises the following steps: step S1: sampling a video; step S2: performing primary feature extraction on the video; step S3: performing feature enhancement on the extracted features to generate boundary prediction of time sequence nodes, and extracting the features of all candidate video segments; step S4: capturing the relation between the candidate video segment characteristics; step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score; step S6: removing repeated candidate video segments; step S7: classifying the candidate video segments to obtain class information of the candidate video segments; more effective video segment characteristics are generated by capturing global and local relationships between them, thereby generating more effective prediction results.

Description

Sequential action detection method for sensing video clip relation
Technical Field
The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation.
Background
In recent years, with the continuous development of streaming media, videos on various website platforms are increased explosively, compared with traditional image information, information contained in the videos is richer and gets more attention of researchers, video understanding gradually becomes a popular research field in the industry and academia, a timing action detection task is an important branch of the video, and the task is mainly to detect a timing boundary of each action instance in a long video and determine the type of the action, so that a user can be helped to locate the action information in the videos more conveniently and quickly.
The time sequence action detection task can help people to quickly determine key contents in the long-distance video, can be used as a preprocessing step for problems such as video understanding and man-machine interaction, and is widely applied to actual life: (1) the video monitoring field: with the rapid development of the internet of things, cameras are already spread in various public places such as roads, schools and the like, play an important security role, but bring extremely large video data, if the cameras are simply analyzed manually, the method is obviously unrealistic, and the time sequence action detection task can quickly extract effective information in the monitoring video, liberate a large amount of human resources, and (2) the video retrieval field: nowadays, with the popularity of videos on various social software, common users also upload and share various video data, in order to recommend videos of interest to the users, classification and labeling are required for each uploaded video, if manual processing is used, the labor cost is very high, and the time sequence action detection can detect actions in the videos, perform preliminary classification, sorting and labeling according to a preset definition, and further integrate the actions into a subsequent video retrieval and video recommendation related algorithm.
Due to the wide application of the time sequence action detection task in the industry and academia, many effective algorithms are proposed, and at present, the algorithms can be roughly divided into two types: one-stage and two-stage:
1) one-stage: firstly, detecting video segments with different lengths by utilizing a hierarchical structure, a cascade structure and the like, and simultaneously giving the category information of the video segment in the prediction process;
2) two-stage: firstly, extracting all possible candidate video segments, and then classifying the candidate video segments by using some classical classifiers (such as Unet) to obtain the class information of the candidate video segments;
although the two-stage based methods have achieved good results, they generate the candidate video segments individually in the process of generating the candidate video segments, and ignore the relationship between the video segments.
Disclosure of Invention
Based on the above problems, the present invention provides a time sequence action detection method for sensing video segment relation, which generates more effective video segment characteristics by capturing global relation and local relation between them, thereby generating more effective prediction results.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a time sequence action detection method for sensing video clip relation comprises the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features
Figure 421825DEST_PATH_IMAGE001
Step S3: using BaseNet model to pair extracted features
Figure 670404DEST_PATH_IMAGE001
Feature enhancement to generate boundary prediction for time series nodes
Figure 206558DEST_PATH_IMAGE002
And
Figure 958614DEST_PATH_IMAGE003
in addition, features of all candidate video segments are extractedIs signed and marked as
Figure 577814DEST_PATH_IMAGE004
Wherein
Figure 579268DEST_PATH_IMAGE004
Each element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement module
Figure 184693DEST_PATH_IMAGE004
The relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
Further, the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
Further, the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence node
Figure 119151DEST_PATH_IMAGE002
And
Figure 784618DEST_PATH_IMAGE003
and, in addition, extracting all the characteristics of the candidate video segment and outputting
Figure 742210DEST_PATH_IMAGE005
Further, the global sensing module specifically includes:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for input
Figure 154256DEST_PATH_IMAGE006
First, the
Figure 615324DEST_PATH_IMAGE006
Inputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer
Figure 576327DEST_PATH_IMAGE007
The calculation formula is as follows:
Figure 958898DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 702863DEST_PATH_IMAGE007
to represent
Figure 877492DEST_PATH_IMAGE010
The results after being input to the horizontal pooling,
Figure 619183DEST_PATH_IMAGE011
representing all possible start times in the video,
Figure 879263DEST_PATH_IMAGE012
is shown as
Figure 364602DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 393738DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 306331DEST_PATH_IMAGE013
Figure 256969DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 342737DEST_PATH_IMAGE014
,
Figure 429641DEST_PATH_IMAGE015
Then it indicates to
Figure 903348DEST_PATH_IMAGE013
Is a start time
Figure 744878DEST_PATH_IMAGE014
For a duration of a video segment
Figure 696653DEST_PATH_IMAGE012
The value of each channel;
for vertically pooled
Figure 575747DEST_PATH_IMAGE007
The formula is as follows:
Figure 954776DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 286532DEST_PATH_IMAGE017
to represent
Figure 41998DEST_PATH_IMAGE010
The results after being input to the vertical pooling,
Figure 775599DEST_PATH_IMAGE018
display viewThe frequency segments are of all possible durations and,
Figure 591108DEST_PATH_IMAGE012
is shown as
Figure 675739DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 969317DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 557424DEST_PATH_IMAGE013
Figure 543835DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 850182DEST_PATH_IMAGE014
,
Figure 150713DEST_PATH_IMAGE019
Then it indicates to
Figure 717961DEST_PATH_IMAGE013
Is a start time
Figure 750639DEST_PATH_IMAGE014
For a duration of a video segment
Figure 668917DEST_PATH_IMAGE012
The value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result
Figure 170874DEST_PATH_IMAGE020
The following are:
Figure 858208DEST_PATH_IMAGE021
the output of the global sensing unit is then:
Figure 327366DEST_PATH_IMAGE022
wherein the content of the first and second substances,
Figure 467361DEST_PATH_IMAGE023
represents the result of the operation of the global sensing unit,
Figure 47378DEST_PATH_IMAGE024
which represents the input to the unit or units,
Figure 589217DEST_PATH_IMAGE025
it is shown that the activation function is,
Figure 963698DEST_PATH_IMAGE026
represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
Figure 590988DEST_PATH_IMAGE027
wherein the content of the first and second substances,
Figure 443538DEST_PATH_IMAGE028
the output of the global perception module is represented,
Figure 105463DEST_PATH_IMAGE029
representing a global sense unit operation that is,
Figure 978742DEST_PATH_IMAGE026
representing a convolution operation.
Further, the feature enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for input
Figure 968694DEST_PATH_IMAGE006
First, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
Figure 421672DEST_PATH_IMAGE030
when in use
Figure 141367DEST_PATH_IMAGE031
When the temperature of the water is higher than the set temperature,
Figure 388808DEST_PATH_IMAGE032
is to input
Figure 990691DEST_PATH_IMAGE033
If the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layer
Figure 244430DEST_PATH_IMAGE034
Can be calculated according to the following formula:
Figure 818631DEST_PATH_IMAGE035
Figure 705816DEST_PATH_IMAGE036
represents the output result of each layer of the global perception module,
Figure 529415DEST_PATH_IMAGE038
representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
Figure 793037DEST_PATH_IMAGE039
wherein the content of the first and second substances,
Figure 18482DEST_PATH_IMAGE040
the representation is fused with different layers
Figure 76568DEST_PATH_IMAGE013
The output result of the layer is then,
Figure 653043DEST_PATH_IMAGE041
it is indicated that the up-sampling operation,
Figure 454777DEST_PATH_IMAGE042
the second of the aforementioned global perception modules
Figure 534728DEST_PATH_IMAGE013
And +1 layer outputs the result.
Further, the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerization
Figure 294874DEST_PATH_IMAGE004
The relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
Further, the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
Compared with the prior art, the invention has the beneficial effects that: capturing the video frequency range relationship from different angles (local information and global information), establishing a relationship for a remote candidate video from a global view angle by using the average pooling characteristic for the global information, aggregating information between adjacent candidate video segments by using a hierarchical structure for the local information due to the distribution characteristic of the candidate video segments, and achieving a better detection result by the whole model due to the complementation of the local information and the global information;
for the prediction of candidate video segments, if they are processed separately, a constraint relation between them is ignored, and by exploring the relation between candidate video segments, more complete and accurate results can be generated, for different candidate video segments in the same video, they are often highly correlated, the relation of all candidate video segments is established, the characteristics of action instances can be enhanced by using background information, and for adjacent candidate video segments, there is a great amount of overlap between them, and effective information between them is aggregated by using average pooling, so as to generate more accurate results.
Drawings
Fig. 1 is a flowchart of this embodiment 1.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
Example 1
In this embodiment, the system mainly includes a global perception module and a feature enhancement module, where:
in the global perception module, GA units are designed to establish the relation between candidate video segments in the same row and column, and for input
Figure 93066DEST_PATH_IMAGE006
First, the
Figure 167332DEST_PATH_IMAGE006
Inputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer, for horizontal poolingIs/are as follows
Figure 367369DEST_PATH_IMAGE007
The calculation formula is as follows:
Figure 32837DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 52746DEST_PATH_IMAGE007
to represent
Figure 464791DEST_PATH_IMAGE010
The results after being input to the horizontal pooling,
Figure 519335DEST_PATH_IMAGE011
representing all possible start times in the video,
Figure 90124DEST_PATH_IMAGE012
is shown as
Figure 862908DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 810136DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 719186DEST_PATH_IMAGE013
Figure 726456DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 720957DEST_PATH_IMAGE014
,
Figure 471875DEST_PATH_IMAGE015
Then it indicates to
Figure 501011DEST_PATH_IMAGE013
Is a start time
Figure 679183DEST_PATH_IMAGE014
For a duration of a video segment
Figure 426559DEST_PATH_IMAGE012
The value of each channel;
for vertically pooled
Figure 715589DEST_PATH_IMAGE007
The formula is as follows:
Figure 599231DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 682725DEST_PATH_IMAGE017
to represent
Figure 651818DEST_PATH_IMAGE010
The results after being input to the vertical pooling,
Figure 741609DEST_PATH_IMAGE018
representing all possible durations of the video segments,
Figure 479758DEST_PATH_IMAGE012
is shown as
Figure 999732DEST_PATH_IMAGE012
A plurality of channels, each of which is provided with a plurality of channels,
Figure 659384DEST_PATH_IMAGE013
representing the start time of a video segment as
Figure 414850DEST_PATH_IMAGE013
Figure 148451DEST_PATH_IMAGE014
Representing a video segment of duration
Figure 636064DEST_PATH_IMAGE014
,
Figure 48591DEST_PATH_IMAGE019
Then it indicates to
Figure 483114DEST_PATH_IMAGE013
Is a start time
Figure 399118DEST_PATH_IMAGE014
For a duration of a video segment
Figure 995315DEST_PATH_IMAGE012
The value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result
Figure 691876DEST_PATH_IMAGE020
The following are:
Figure 664511DEST_PATH_IMAGE021
the output of the global sensing unit is then:
Figure 231759DEST_PATH_IMAGE022
wherein the content of the first and second substances,
Figure 279085DEST_PATH_IMAGE023
represents the result of the operation of the global sensing unit,
Figure 931783DEST_PATH_IMAGE024
which represents the input to the unit or units,
Figure 708110DEST_PATH_IMAGE025
it is shown that the activation function is,
Figure 598705DEST_PATH_IMAGE026
represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
Figure 926918DEST_PATH_IMAGE027
wherein the content of the first and second substances,
Figure 942279DEST_PATH_IMAGE028
the output of the global perception module is represented,
Figure 381351DEST_PATH_IMAGE029
representing a global sense unit operation that is,
Figure 798557DEST_PATH_IMAGE026
represents a convolution operation;
in addition, in the feature enhancement module, a hierarchy is used to capture local information between candidate video segments. For input
Figure 297671DEST_PATH_IMAGE043
First, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
Figure 800328DEST_PATH_IMAGE044
when in use
Figure 43090DEST_PATH_IMAGE045
When the temperature of the water is higher than the set temperature,
Figure 845961DEST_PATH_IMAGE046
is to input
Figure 250398DEST_PATH_IMAGE047
Due to the limitation of average pooling, only the relationships between neighboring candidate boxes can be aggregatedAnd the long-distance dependency relationship cannot be captured, so that the global perception module is embedded into the feature enhancement module, all layers in the feature enhancement module share the same global perception module, and the output of each layer
Figure 240350DEST_PATH_IMAGE049
Can be calculated according to the following formula:
Figure 286804DEST_PATH_IMAGE050
by doing so, the relationship between all candidate video segments can be established from two complementary levels (local and global), and then the upsampling operation is used to aggregate features between different levels, since each level is supervised by label information, the fusion of information of different levels can reduce the generation of noise to the maximum extent, and the calculation formula is as follows:
Figure 678602DEST_PATH_IMAGE051
thus, the relationship between different candidate video segments can be captured from different levels and different scales.
Based on the above, the method for detecting the time sequence action of sensing the relation of the video segments as shown in fig. 1 includes the following steps:
step S1: sampling a video;
selecting a proper training and testing data set, wherein the training and testing are mainly performed on the public data sets ActivityNet-1.3 and THUMOS-14;
the ActivityNet-1.3 dataset is an open dataset for video segment generation and detection, which contains mainly 19994 videos and contains 200 action classes, these videos are mainly crawled from youtube website, all of which are different in resolution and time, it was an ActivityNet change 2016 and 2017 game dataset, which divides all videos into training, evaluation and test sets in a 2:1:1 ratio;
the THIMOS-14 data set contains 413 videos and contains 20 categories of information, wherein the test set contains 212 videos, the validation set contains 200 videos used for the time series motion detection task, the entire model is trained on the validation set and the performance of the entire model is evaluated on the tester.
Step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features
Figure 519519DEST_PATH_IMAGE052
Firstly, for an unprocessed long video, extracting a corresponding video frame to represent the video frame as
Figure 728259DEST_PATH_IMAGE053
Wherein
Figure 312824DEST_PATH_IMAGE054
Representing the total number of video frames,
Figure 559129DEST_PATH_IMAGE055
the nth frame represented in the video, for which the set of tags may be represented as
Figure 570947DEST_PATH_IMAGE056
Wherein
Figure 535492DEST_PATH_IMAGE057
Indicating the number of motion video segments contained in a video,
Figure 658169DEST_PATH_IMAGE058
and
Figure 24559DEST_PATH_IMAGE059
respectively represent
Figure 207279DEST_PATH_IMAGE061
The start time, end time and category information of each label are only used in the training process, and thenThe TSN model is used to extract the features of each video, which are expressed as
Figure 393541DEST_PATH_IMAGE062
Step S3: feature enhancement of extracted features using BaseNet model, resulting in boundary prediction of time series nodes
Figure 319908DEST_PATH_IMAGE063
And
Figure 540805DEST_PATH_IMAGE064
in addition, the characteristics of all candidate video segments are extracted and output
Figure 894426DEST_PATH_IMAGE065
Wherein features are aligned using graph convolution (including GCN)
Figure 567984DEST_PATH_IMAGE052
Enhancing and obtaining the characteristics of richer semantic information, wherein the calculation formula is as follows:
Figure 32464DEST_PATH_IMAGE066
wherein the content of the first and second substances,
Figure 373446DEST_PATH_IMAGE067
representing features enhanced by graph convolution
Figure 897968DEST_PATH_IMAGE068
Then, the feature
Figure 61752DEST_PATH_IMAGE067
Shared by two branch networks, one of which is used to determine whether each timing position is a start node or an end node, which may be denoted as
Figure 595502DEST_PATH_IMAGE069
And
Figure 259832DEST_PATH_IMAGE070
wherein
Figure 955256DEST_PATH_IMAGE072
Representing the length of a sequence of video features that will output the features of all candidate video segments for another branching network
Figure 603406DEST_PATH_IMAGE073
Step S4: capturing candidate video segment features using global perception module and feature enhancement module
Figure 675267DEST_PATH_IMAGE065
The relationship between;
for the output in the above step S3
Figure 459684DEST_PATH_IMAGE065
Each position of which represents the characteristics of a candidate video segment, and inputting the characteristics into the characteristics enhancing module, the relationship between the candidate video segments can be captured from the local view and the global view, the characteristics of the action instance can be enhanced, and the background information can be suppressed, so that more accurate and more complete results can be generated, and the output is
Figure 857167DEST_PATH_IMAGE074
Then use
Figure 727034DEST_PATH_IMAGE075
The final result of each candidate video segment is predicted by the convolution of (a):
Figure 602586DEST_PATH_IMAGE076
wherein the content of the first and second substances,
Figure 241509DEST_PATH_IMAGE077
respectively representing two types of prediction results obtained by regression and classification supervision,
Figure 544314DEST_PATH_IMAGE078
it is shown that the activation function is,
Figure 901477DEST_PATH_IMAGE079
representing a convolution operation.
Step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
in order to fully utilize the output result of the whole model, each layer of output in the feature enhancement module is fused into the final score of the candidate video segment, and in addition, the boundary information of each video segment is considered, so that the output of each layer of output in the feature enhancement module is combined into the final score of the candidate video segment
Figure 518404DEST_PATH_IMAGE081
To
Figure 402046DEST_PATH_IMAGE083
Video segment of (1), score thereof
Figure 751119DEST_PATH_IMAGE085
The calculation formula of (a) is as follows:
Figure 720212DEST_PATH_IMAGE086
Figure 810003DEST_PATH_IMAGE087
wherein the content of the first and second substances,
Figure 548152DEST_PATH_IMAGE088
Figure 68126DEST_PATH_IMAGE089
and
Figure 790094DEST_PATH_IMAGE090
all the ingredients are super-ginseng,
Figure 420927DEST_PATH_IMAGE091
is shown as
Figure 279162DEST_PATH_IMAGE091
The output of the layer(s) is,
Figure 704458DEST_PATH_IMAGE092
the total number of layers of the hierarchical result is indicated.
Step S6: removing repeated candidate video segments by using a Soft-NMS model;
after all possible candidate frames are acquired, as most of the possible candidate frames have large overlap, the Soft-NMS model is used for removing the candidate frames again, and the scores of the candidate video segments are counted
Figure 382564DEST_PATH_IMAGE093
Sorting according to the size of the video segments, selecting the candidate video segment with the largest score, and then calculating the iou of the candidate video segment with the other video segments, wherein the video segments with high overlapping degree are attenuated according to the following formula.
Figure 817087DEST_PATH_IMAGE094
Wherein
Figure 529828DEST_PATH_IMAGE095
The parameters representing the gaussian function are then calculated,
Figure 657184DEST_PATH_IMAGE096
which represents a pre-defined threshold value that is,
Figure 822587DEST_PATH_IMAGE097
respectively representing the currently selected candidate video segment and other video segments.
Step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
After all possible candidate video segments are obtained, the video segments are classified by using a Unet classifier to obtain final class information of the video segments, and the final class information can be expressed as
Figure 795222DEST_PATH_IMAGE098
Wherein
Figure 628049DEST_PATH_IMAGE099
As the information of the category, it is,
Figure 660727DEST_PATH_IMAGE100
is the score to which it corresponds,
Figure 516687DEST_PATH_IMAGE101
to predict the number of instances of an action.
Example 2
In this embodiment, the overall model in embodiment 1 needs to be trained, and the overall loss function is represented as:
Figure 80962DEST_PATH_IMAGE102
wherein
Figure 971558DEST_PATH_IMAGE103
Is super ginseng and used
Figure 299771DEST_PATH_IMAGE104
To determine whether each time node is a start node or an end node, which can be expressed as:
Figure 315131DEST_PATH_IMAGE105
Figure 754203DEST_PATH_IMAGE107
and
Figure 171409DEST_PATH_IMAGE108
respectively representing weighted cross entropy loss functions,
Figure 670523DEST_PATH_IMAGE109
respectively expressed as the prediction result of the start node and its corresponding label,
Figure 173180DEST_PATH_IMAGE110
respectively representing the prediction result of the end node and the corresponding label, and training the feature enhancement module, wherein the label is generated for the 0 th layer in the feature enhancement module and is represented as
Figure 415942DEST_PATH_IMAGE111
Since the feature enhancement module is a hierarchical structure and each layer is supervised, a label is generated for each layer using the following formula.
Figure 687655DEST_PATH_IMAGE112
Thus, the loss function for each layer can be defined as:
Figure 357671DEST_PATH_IMAGE113
wherein the content of the first and second substances,
Figure 347623DEST_PATH_IMAGE114
a loss function for each layer is represented,
Figure 394077DEST_PATH_IMAGE115
respectively representing the prediction result obtained by each layer by using regression supervision and the corresponding label,
Figure 785875DEST_PATH_IMAGE116
respectively representing the prediction result obtained by each layer by using classification supervision and the corresponding label.
Wherein
Figure 626792DEST_PATH_IMAGE118
And
Figure 104041DEST_PATH_IMAGE119
respectively represent the square difference loss and the weighted cross entropy loss, then
Figure 688606DEST_PATH_IMAGE120
Wherein, in the step (A),
Figure DEST_PATH_IMAGE121
a loss function representing the hierarchy is used to,
Figure 463139DEST_PATH_IMAGE091
is shown as
Figure 350324DEST_PATH_IMAGE091
A layer of a material selected from the group consisting of,
Figure 439503DEST_PATH_IMAGE122
the number of total layers is indicated,
Figure DEST_PATH_IMAGE123
representing hyper-parameters, and finally, loss functions
Figure 968704DEST_PATH_IMAGE124
Optimization can be done in an end-to-end fashion.
Example 3
In this embodiment, the validity of the method is verified by using the selected data set, which specifically includes:
the method is verified on the selected data set, in order to well evaluate the effectiveness of the embodiment, the average accuracy mAP is selected as a main evaluation index, the mAP is respectively calculated on an iou set {0.3,0.4,0.5,0.6 and 0.7} on the THUMOS-14 data set, the mAP on the iou set {0.5,0.75 and 0.95} is calculated on an activityNet1.3 data set, and in addition, the average mAP of ten different ious is calculated on the activityNet 1.3.
The present embodiment performs verification on the data set ActivityNet-1.3 of the current main stream, and the final verification result is shown in the following table (comparing the model performance on the ActivityNet-1.3 data set (%):
TABLE 1 comparison of model Performance on ActivinyNet-1.3 dataset
Figure DEST_PATH_IMAGE125
The present example performed verification on the currently mainstream data set thumb-14, and the final verification results are shown in the following table (model performance vs. (%) on thumb-14 data set).
TABLE 2 modeling Performance on THUMOS-14 dataset vs
Figure 600674DEST_PATH_IMAGE126
The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims (5)

1. A method for detecting time sequence action of sensing video clip relation is characterized by comprising the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain featuresF
Step S3: using BaseNet model to pair extracted featuresFFeature enhancement to generate boundary prediction for time series nodesP S AndP ein addition, all the characteristics of the candidate video segments are extracted and recordedX ACFMWhereinX ACFMEach element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement moduleX ACFMThe relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: classifying the candidate video segments by using a Unet classifier to obtain class information of the candidate video segments;
the global perception module specifically comprises the following modules:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for inputXFirst, theXInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer
Figure 513071DEST_PATH_IMAGE001
The calculation formula is as follows:
Figure 677336DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 918962DEST_PATH_IMAGE003
to representXThe results after being input to the horizontal pooling,Trepresenting all possible start times in the video, c representing the c-th channel, i representing a video segment start time of i, j representing a video segment duration of j,
Figure 194085DEST_PATH_IMAGE004
then represents the value of the c channel of the video segment with i as the start time and j as the duration;
for vertically pooled
Figure 775239DEST_PATH_IMAGE005
The formula thereofComprises the following steps:
Figure 110406DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 839327DEST_PATH_IMAGE007
represents the result of X input to the vertical pooling, D represents all possible durations of the video segments, c represents the c-th channel, i represents the video segment start time i, j represents the video segment duration j,
Figure 918142DEST_PATH_IMAGE008
then represents the value of the c channel of the video segment with i as the start time and j as the duration;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result
Figure 557064DEST_PATH_IMAGE009
The following are:
Figure 328711DEST_PATH_IMAGE010
the output of the global sensing unit is then:
X unit= Sig(Conv(Y))×X×X
wherein the content of the first and second substances, X unitrepresents the result of the operation of the global sensing unit,Xrepresents the input of the unit, Sig () represents the activation function, Conv () represents the convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
Figure 279350DEST_PATH_IMAGE011
wherein the content of the first and second substances,X gathe output of the global perception module is represented,
Figure 161855DEST_PATH_IMAGE012
() Representing a global sense unit operation, Conv () representing a convolution operation;
the characteristic enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for inputXFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
Figure 514339DEST_PATH_IMAGE013
when the ratio of i =0, the ratio of the total of the sum of the values of i =0,X 0is to inputXIf the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layer
Figure 191308DEST_PATH_IMAGE014
Can be calculated according to the following formula:
Figure 629243DEST_PATH_IMAGE015
Figure 49860DEST_PATH_IMAGE016
represents the output result of each layer of the global perception module,
Figure 522429DEST_PATH_IMAGE012
() Representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
Figure 573562DEST_PATH_IMAGE017
wherein the content of the first and second substances,
Figure 498793DEST_PATH_IMAGE018
showing the output result of the ith layer after fusing different layers,
Figure 723101DEST_PATH_IMAGE019
() It is indicated that the up-sampling operation,
Figure 784598DEST_PATH_IMAGE020
and (3) representing the output result of the i +1 th layer of the global perception module.
2. The method according to claim 1, wherein said method comprises: the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
3. The method according to claim 1, wherein said method comprises: the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence nodeP S AndP eand, in addition, extracting all the characteristics of the candidate video segment and outputtingX ACFM
4. The method according to claim 1, wherein said method comprises: the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerizationX ACFMThe relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
5. The method according to claim 1, wherein said method comprises: the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
CN202110659154.1A 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation Active CN113255570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110659154.1A CN113255570B (en) 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110659154.1A CN113255570B (en) 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation

Publications (2)

Publication Number Publication Date
CN113255570A CN113255570A (en) 2021-08-13
CN113255570B true CN113255570B (en) 2021-09-24

Family

ID=77187848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110659154.1A Active CN113255570B (en) 2021-06-15 2021-06-15 Sequential action detection method for sensing video clip relation

Country Status (1)

Country Link
CN (1) CN113255570B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920467B (en) * 2021-12-13 2022-03-15 成都考拉悠然科技有限公司 Tourist and commercial detection method and system combining booth detection and scene segmentation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241829A (en) * 2018-07-25 2019-01-18 中国科学院自动化研究所 The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111079594A (en) * 2019-12-04 2020-04-28 成都考拉悠然科技有限公司 Video action classification and identification method based on double-current cooperative network
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112364852A (en) * 2021-01-13 2021-02-12 成都考拉悠然科技有限公司 Action video segment extraction method fusing global information
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11103773B2 (en) * 2018-07-27 2021-08-31 Yogesh Rathod Displaying virtual objects based on recognition of real world object and identification of real world object associated location or geofence
CN111372123B (en) * 2020-03-03 2022-08-09 南京信息工程大学 Video time sequence segment extraction method based on local to global

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241829A (en) * 2018-07-25 2019-01-18 中国科学院自动化研究所 The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111079594A (en) * 2019-12-04 2020-04-28 成都考拉悠然科技有限公司 Video action classification and identification method based on double-current cooperative network
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112364852A (en) * 2021-01-13 2021-02-12 成都考拉悠然科技有限公司 Action video segment extraction method fusing global information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation;Haisheng Su等;《arXiv》;20210301;1-9 *
Cooperative Cross-Stream Network for Discriminative Action Representation;Jingran Zhang等;《arXiv》;20190827;1-10 *
Temporal Context Aggregation Network for Temporal Action Proposal Refinement;Zhiwu Qing等;《arXiv》;20210324;1-10 *
基于深度学习的视频人体动作识别研究;李怡颖;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》;20210115(第01期);I138-1923 *

Also Published As

Publication number Publication date
CN113255570A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN110175580B (en) Video behavior identification method based on time sequence causal convolutional network
CN107330362B (en) Video classification method based on space-time attention
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
Wang et al. Spatial–temporal pooling for action recognition in videos
CN107341508B (en) Fast food picture identification method and system
CN110532911B (en) Covariance measurement driven small sample GIF short video emotion recognition method and system
CN111783712A (en) Video processing method, device, equipment and medium
CN114187311A (en) Image semantic segmentation method, device, equipment and storage medium
CN107220902A (en) The cascade scale forecast method of online community network
CN110533053B (en) Event detection method and device and electronic equipment
CN113255570B (en) Sequential action detection method for sensing video clip relation
CN112364852B (en) Action video segment extraction method fusing global information
CN115659966A (en) Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN110263638B (en) Video classification method based on significant information
Zou et al. STA3D: Spatiotemporally attentive 3D network for video saliency prediction
CN114741599A (en) News recommendation method and system based on knowledge enhancement and attention mechanism
Sharma et al. Construction of large-scale misinformation labeled datasets from social media discourse using label refinement
CN113298015A (en) Video character social relationship graph generation method based on graph convolution network
CN113010705A (en) Label prediction method, device, equipment and storage medium
CN114092819B (en) Image classification method and device
CN113033500B (en) Motion segment detection method, model training method and device
Li et al. Human perception evaluation system for urban streetscapes based on computer vision algorithms with attention mechanisms
Lin et al. Temporal action localization with two-stream segment-based RNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant