CN110059584B - Event naming method combining boundary distribution and correction - Google Patents
Event naming method combining boundary distribution and correction Download PDFInfo
- Publication number
- CN110059584B CN110059584B CN201910245568.2A CN201910245568A CN110059584B CN 110059584 B CN110059584 B CN 110059584B CN 201910245568 A CN201910245568 A CN 201910245568A CN 110059584 B CN110059584 B CN 110059584B
- Authority
- CN
- China
- Prior art keywords
- event
- nomination
- network
- distribution network
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012937 correction Methods 0.000 title claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 31
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 20
- 238000013528 artificial neural network Methods 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 14
- 230000008859 change Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 238000009966 trimming Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012847 principal component analysis method Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention provides an event naming method combining boundary distribution and correction, which forms an event naming network by constructing a starting point distribution network, an end point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating; the starting point distribution network and the ending point distribution network are used for predicting event nomination; the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries. According to the event naming method combining boundary distribution and correction, provided by the invention, the event naming fitting the real event distribution is generated by combining the event starting and ending point distribution rule in the real video, and the boundary of the event naming is corrected by utilizing a cyclic correction network, so that the event naming which is more in line with the real event and enables the boundary of the event to be more accurate is obtained.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to an event naming method combining boundary distribution and correction.
Background
With the rapid development of the internet and portable devices, it is more convenient and easier to shoot videos, and a large amount of videos are uploaded to the internet, and the content, the time length and the like of the videos have larger differences. Most video-based computer vision algorithms perform analysis processing, such as motion recognition, on short video after trimming to a certain extent on long video. And the trimming of long video brings great labor cost and time cost consumption, so that processing analysis of untrimmed long video becomes necessary to meet the real life requirement.
Dense video description [1] is a new task proposed based on long video without pruning, and aims to describe a plurality of events occurring in the video respectively by using natural language. The dense video description can be divided into two parts, namely, positioning events in the video, and finding out the start and stop time of all the events in the video, namely, extracting event nomination in the video; and secondly, describing the located event by using natural language. The potential application of the system is very wide, such as early education of infants, daily coaching of the blind, movie captions, video retrieval and classification and the like.
Robust dense video descriptions rely on high quality event nomination, not only requiring that the generated event nomination cover the time span of all possible events on a time scale, but also requiring that the boundaries of the generated event nomination be the same as the real events in the video. Compared with image-based tasks such as target detection, high-quality event naming is required to extract not only object information in a video, but also motion information of the object in the video in time, so that relevant dynamic information is obtained. In real life, the video generation conditions are not limited, a certain overlap exists among multiple events in the video in time, and certain changes exist in shooting angles, shooting distances and the like of the video, so that great challenges are brought to event nomination.
Event nomination in video has some similarity to object detection in images, and many research works on event nomination are inspired by object detection at present. However, in event naming, not only the apparent features of the video but also the dynamic features in time sequence are required to be paid attention to, and the time span change of the event is often relatively large. Shou et al [2] apply sliding windows of different scales to the video feature sequence, and use [3] to extract features in the window to further predict event nomination. However, the method can only obtain the event nomination with a limited preset length, and cannot flexibly adapt to the actual length of the event. In addition, repeated scanning of the same frame by sliding windows with different scales also brings about redundant calculation. Gao et al [4] further regress sliding window based event nomination boundaries to obtain more flexible and accurate nomination boundaries. Chao et al [5] expands the Faster R-CNN [6] framework into a dual-stream network structure to be applied to video, and applies expansion convolution to time sequence to expand receptive field, thereby obtaining event nomination with larger time span. The method comprises the following steps of (7) inputting video features in a sliding window into the cyclic neural network to predict the starting and ending time and the confidence of event nomination of different scales by virtue of the memorization of the cyclic neural network on the video feature extraction. However, sliding windows typically result in a large number of repetitive operations, [8] avoids repetitive calculations at different scales by predicting event nominations at each node of the recurrent neural network for a number of different scales. However, the event nominated boundary predicted by the method is fixed, the accuracy of the boundary is limited by the feature extraction span, and the real event boundary is difficult to fit accurately.
Disclosure of Invention
The invention provides an event naming method combining boundary distribution and correction, which aims to overcome the technical defects that the existing event naming method based on a cyclic neural network is fixed in naming boundary, the accuracy of the boundary is limited by a feature extraction span, and the real event boundary is difficult to fit.
In order to solve the technical problems, the technical scheme of the invention is as follows:
an event naming method combining boundary distribution and correction is characterized in that an event naming network is formed by constructing a starting point distribution network, an ending point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;
the starting point distribution network and the ending point distribution network are used for predicting event nomination;
the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
The construction process of the starting point distribution network and the ending point distribution network comprises the following steps:
normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;
counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line s0 ,w e0 ;w s0 ,w e0 Respectively represent the probability distribution of the starting point and the ending point of the event to obtainTo a start distribution network and an end distribution network.
The process of event nomination prediction by the starting point distribution network and the ending point distribution network specifically comprises the following steps:
acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;
outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:
[t-k,t+1],k∈[0,K];
wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood that the interval is an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.
The boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating video characteristics output at each time point according to video characteristics of a sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
The process for correcting the event nomination boundary specifically comprises the following steps of:
calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:
Δc=(G c -P c )/P c ;
Δl=log(G l /P l );
wherein: g c Representing the actual event nomination center coordinates, P c Representation predictionEvent nomination center coordinates of (a); g l Representing the size, P, of the actual event nomination scale l Representing a predicted event nomination scale size;
taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';
correcting the predicted event nomination boundary through the offset information delta c 'and delta l', and obtaining the corrected event nomination center position P c ' sum dimension size P l ' specifically, it is:
P c '=P c /(1+Δ' c );
center position P according to corrected event nomination c ' sum dimension size P l The "corrected start-stop time" is specifically:
wherein P' start Indicating the corrected event nomination starting time, P' end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.
The event nomination network loss function is formed by weighting and superposing a starting point distribution network loss sub-function, an ending point distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss s (c s ,t,X,y s ) The method comprises the following steps:
the endpoint distribution network loss subfunction loss e (c e ,t,X,y e ) The method comprises the following steps:
the boundary circulation corrects the network loss subfunction loss reg (t i ) The method comprises the following steps:
wherein X represents the entire dataset;whether the kth event nominated for the kth time point is a genuine supervision signal,Suggesting confidence levels for the event under the start distribution network and the end distribution network;
k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc k And Deltal k Respectively at the t th of video i A supervision signal for correcting the nomination of the kth event at each time point; Δc' k And Deltal' k The bias information predicted by nominating the same event at the same time point is obtained;
therefore, the loss function loss (c, t, X, y) is specifically:
loss(c,t,x,y)=α*loss s (c s ,t,x,y s )+β*loss e (c e ,t,x,y e )+γ*loss reg (t i );
wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.
And training and updating all the circulating neural networks in the event naming network by using the loss function, so that the training and updating of the event naming network are completed, and the event naming network after the training and updating is obtained.
The specific process of carrying out naming prediction on the video event by using the event naming network after training and updating is as follows:
s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;
s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;
s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;
s4: the boundary circulation correction network generates bias information of predicted event nomination;
s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
according to the event naming method combining boundary distribution and correction, provided by the invention, the event naming fitting the real event distribution is generated by combining the event starting and ending point distribution rule in the real video, and the boundary of the event naming is corrected by utilizing a cyclic correction network, so that the event naming which is more in line with the real event and enables the boundary of the event to be more accurate is obtained.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, an event naming method combining boundary distribution and correction forms an event naming network by constructing a start distribution network, an end distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;
the starting point distribution network and the ending point distribution network are used for predicting event nomination;
the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
More specifically, the construction process of the starting point distribution network and the ending point distribution network comprises the following steps:
normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;
counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line s0 ,w e0 ;w s0 ,w e0 And respectively representing the starting point probability distribution and the ending point probability distribution of the event to obtain a starting point distribution network and an ending point distribution network.
In the implementation process, when the data set is counted, event nomination with the time crossing ratio tIOU of the real event being larger than a certain threshold sigma is divided into positive samples, and the start and stop point distribution of all the positive samples is counted.
More specifically, the process of predicting event nomination by the starting point distribution network and the ending point distribution network is specifically as follows:
acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;
outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:
[t-k,t+1],k∈[0,K];
wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood that the interval is an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.
More specifically, the boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating the video characteristics output at each time point according to the video characteristics of the sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
More specifically, the process of generating the bias information of the predicted event nomination and correcting the event nomination boundary specifically includes:
calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:
Δc=(G c -P c )/P c ;
Δl=log(G l /P l );
wherein: g c Representing the actual event nomination center coordinates, P c Representing predicted event nomination center coordinates; g l Representing the size, P, of the actual event nomination scale l Representing a predicted event nomination scale size;
taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';
correcting the predicted event nomination boundary through the offset information delta c 'and delta l', and obtaining the corrected event nomination center position P c ' sum dimension size P l ' specifically, it is:
P c '=P c /(1+Δ' c );
center position P according to corrected event nomination c ' sum dimension size P l The "corrected start-stop time" is specifically:
wherein P' start Indicating the corrected event nomination starting time, P' end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.
More specifically, the event nomination network loss function is formed by weighting and superposing a starting point distribution network loss sub-function, an ending point distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss s (c s ,t,X,y s ) The method comprises the following steps:
the endpoint distribution network loss subfunction loss e (c e ,t,X,y e ) The method comprises the following steps:
the boundary circulation corrects the network loss subfunction loss reg (t i ) The method comprises the following steps:
wherein X represents the entire dataset;whether the kth event nominated for the kth time point is a genuine supervision signal,Suggesting confidence levels for the event under the start distribution network and the end distribution network;
k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc k And Deltal k Respectively at the t th of video i A supervision signal for correcting the nomination of the kth event at each time point; Δc' k And Deltal' k The bias information predicted by nominating the same event at the same time point is obtained;
therefore, the loss function loss (c, t, X, y) is specifically:
loss(c,t,x,y)=α*loss s (c s ,t,x,y s )+P*loss e (c e ,t,x,y e )+γ*loss reg (t i );
wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.
More specifically, training and updating are carried out on all the circulating neural networks in the event naming network by using the loss function, so that the training and updating of the event naming network are completed, and the event naming network after the training and updating is obtained.
More specifically, the specific process of carrying out nomination prediction on the video event by using the event nomination network after training update is as follows:
s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;
s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;
s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;
s4: the boundary circulation correction network generates bias information of predicted event nomination;
s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.
Example 2
On the basis of embodiment 1, more specifically, training and verification are performed on ActivityNet, which contains 20000 untrimmed videos, which is composed of 849 hours and has about 10 tens of thousands of pieces of tag information of descriptive sentences, and in the dataset, each video contains a plurality of events and descriptive sentences for the events, and the start-stop time and duration of time in the same video are different. ActivityNet contains three parts: the training set, the verification set and the test set have 10024, 4926 and 5044 videos respectively, and the embodiment mainly performs experiments on the training set and the verification set.
When the three-dimensional convolution network [9] is used for extracting the features, one video feature is extracted every 64 frames, and the feature dimension is compressed to 500 by using a principal component analysis method. The cyclic neural network used is a long and short term memory network LSTM, the dimension is 512, and K in the model is set to 256. In the language model for generating sentences, the upper limit of the word for each sentence is set to 32, and words having a number of occurrences of less than 3 are deleted when the word stock is established. Training the event nomination network to be stable, and then training by combining a language model, wherein the learning rate is set to be 5e-5.
In the specific implementation process, when evaluating the quality of event nomination, two indexes are generally used: recall rate, accuracy rate. The recall rate evaluates how many real events are covered in the predicted event nomination, and the accuracy rate evaluates how much is correct in the predicted event nomination. In addition, there is an index for comprehensive evaluation of event nomination, i.e., f 1 The score, the index has both accuracy and recall, and is calculated by the recall and accuracy, and the calculation method is as follows:
f 1 =2 x (accuracy rate x recall/(accuracy rate+recall));
the comparison of the event naming method combining boundary distribution and correction on the ActivityNet provided by the invention with the existing method is shown in the table 1:
TABLE 1
Method | Recall @1000 | Accuracy @1000 | f 1 Score @1000 |
SST[5] | 0.716 | 0.533 | 0.571 |
Modeling of start and stop points | 0.731 | 0.530 | 0.573 |
Boundary regression | 0.704 | 0.561 | 0.590 |
Start-stop distribution + boundary point regression | 0.716 | 0.560 | 0.592 |
As shown in table 1, @1000 represents event nomination for the top 1000 confidence rank; comparing the method of the invention with the existing method, SST [13 ]]The method is based on the method and predicts the event nomination based on the cyclic neural network. However, in SST, data such as the distribution of start and stop points of event nomination is not processed, and the predicted event nomination boundary is fixed. Therefore, by counting the event starting and ending points, the recall rate is obviously improved, and under the condition that the recall rate is basically unchanged after the combination of carrying out regression on the event nominated boundary, the accuracy rate and f 1 The score is greatly improved.
In the implementation process, when the event nomination is used for evaluating the expression of the dense video description, main indexes are BLEU-1, BLEU-2, BLEU-3, BLEU-4, meteor, rouge-L and CIDEr-D, and the main indexes are used for evaluating the similarity between the description statement and the real description statement generated by the event nomination network. Among the above-mentioned indexes, meteor mainly focuses on the performance of an event-naming network on the index because its measurement result is more similar to the result of manual judgment.
As shown in Table 2, compared with the existing method in the dense video description task, the method of the invention has better performance in most indexes than the existing method, especially in the Meteor index, and shows the effectiveness of the event nomination network.
Table 2 ActivityNet Caption validation Experimental results
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.
[1]R.Krishna,K.Hata,F.Ren,L.Fei-Fei,and J.C.Niebles,“Dense-captioning events in videos,”in Proc.IEEE International Conference on Computer Vision,2017,pp.706-715.
[2]Z.Shou,D.Wang,and S.Chang,“Temporal action localization in untrimmed videos via multi-stage CNNs,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2016,pp.1049-1058.
[3]S.Ji,W.Xu,M.Yang,and K.Yu,“3D convolutional neural networks for human action recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.35,no.1,pp.221-231,2013.
[4]J.Gao,Z.Yang,C.Sun,K.Chen,and R.Nevatia,“TURN TAP:Temporal unit regression network for temporal action proposals,”in Proc.IEEE International Conference on Computer Vision,2017,pp.3648-3656.
[5]Y.Chao,S.Vijayanarasimhan,B.Seybold,D.A.Ross,J.Deng,and R.Sukthankar,“Rethinking the faster R-CNN architecture for temporal action localization,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2018,pp.1130-1139.
[6]S.Ren,K.He,R.Girshick,and J.Sun,“Faster R-CNN:Towards real-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.39,no.6,pp.1137-1149,2017.
[7]V.Escorcia,F.Caba,J.C.Niebles,and B.Ghanem,“Daps:Deep action proposals for action understanding,”in Proc.European Conference on Computer Vision,2016,pp.768–784.
[8]S.Buch,V.Escorcia,C.Shen,B.Ghanem,and J.C.Niebles,“SST:Single-stream temporal action proposals,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.6373-6382.
[9]S.Ji,W.Xu,M.Yang,and K.Yu,“3D convolutional neural networks for human action recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.35,no.1,pp.221-231,2013.
[10]R.Krishna,K.Hata,F.Ren,L.Fei-Fei,and J.C.Niebles,“Dense-captioning events in videos,”2017 IEEE International Conference on Computer Vision(ICCV),pp.706–715,2017.
[11]Y.Li,T.Yao,Y.Pan,H.Chao,and T.Mei,“Jointly localizing and describing events for dense video captioning,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2018,pp.7492-7500.
[12]J.Wang,W.Jiang,L.Ma,W.Liu,and Y.Xu,“Bidirectional attentive fusion with context gating for dense video captioning,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2018,pp.7190-7198.
[13]S.Buch,V.Escorcia,C.Shen,B.Ghanem,and J.C.Niebles,“SST:Single-stream temporal action proposals,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.6373-6382.
Claims (6)
1. An event naming method combining boundary distribution and correction is characterized in that: forming an event nomination network by constructing a starting point distribution network, an ending point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;
the starting point distribution network and the ending point distribution network are used for predicting event nomination;
the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries;
the construction process of the starting point distribution network and the ending point distribution network comprises the following steps:
normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;
counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line s0 ,w e0 ;w s0 ,w e0 Respectively representing the starting point probability distribution and the ending point probability distribution of the event to obtain a starting point distribution network and an ending point distribution network;
the process of predicting event nomination by the starting point distribution network and the ending point distribution network specifically comprises the following steps:
acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;
outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:
[t-k,t+1],k∈[0,K];
wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood of being an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.
2. The event naming method combining boundary distribution and correction according to claim 1, wherein the boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating video characteristics output at each time point according to video characteristics of the sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
3. The method for event naming by combining boundary distribution and correction according to claim 2, wherein the generating of the bias information of the predicted event naming and the correcting of the event naming boundary are specifically:
calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:
Δc=(G c -P c )/P c ;
Δl=log(G l /P l );
wherein: g c Representing the actual event nomination center coordinates, P c Representing predicted event nomination center coordinates; g l Representing the size, P, of the actual event nomination scale l Representing a predicted event nomination scale size;
taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';
predicted by the pair of offset information Δc' and ΔlCorrecting the event nomination boundary to obtain a corrected event nomination center position P '' c And a dimension of P' l The method specifically comprises the following steps:
P′ c =P c /(1+Δ' c );
center position P 'noted from corrected event' c And dimension size P l The "corrected start-stop time" is specifically:
wherein P' start Indicating the corrected event nomination starting time, P' end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.
4. The event naming method combining boundary distribution and correction according to claim 3, wherein the event naming network loss function is formed by weighted superposition of a start distribution network loss sub-function, an end distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss s (c s ,t,X,y s ) The method comprises the following steps:
the endpoint distribution network loss subfunction loss e (c e ,t,X,y e ) The method comprises the following steps:
the boundary circulation corrects the network loss subfunction loss reg (t i ) The method comprises the following steps:
wherein X represents the entire dataset; whether the kth event nominated for the kth time point is a genuine supervision signal,Suggesting confidence levels for the event under the start distribution network and the end distribution network;
k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc k And Deltal k Respectively at the t th of video i A supervision signal for correcting the nomination of the kth event at each time point; Δc' k And Deltal' k The bias information predicted by nominating the same event at the same time point is obtained;
therefore, the loss function loss (c, t, X, y) is specifically:
loss(c,t,X,y)=α*loss s (c s ,t,X,y s )+β*loss e (c e ,t,X,y e )+γ*loss reg (t i );
wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.
5. The method for event naming by combining boundary distribution and correction according to claim 4, wherein training update is performed on all the cyclic neural networks in the event naming network by using the loss function, so as to complete training update of the event naming network and obtain the event naming network after training update.
6. The method for event naming by combining boundary distribution and correction according to claim 5, wherein the specific process of predicting the video event by using the event naming network after training update is:
s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;
s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;
s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;
s4: the boundary circulation correction network generates bias information of predicted event nomination;
s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910245568.2A CN110059584B (en) | 2019-03-28 | 2019-03-28 | Event naming method combining boundary distribution and correction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910245568.2A CN110059584B (en) | 2019-03-28 | 2019-03-28 | Event naming method combining boundary distribution and correction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110059584A CN110059584A (en) | 2019-07-26 |
CN110059584B true CN110059584B (en) | 2023-06-02 |
Family
ID=67317857
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910245568.2A Active CN110059584B (en) | 2019-03-28 | 2019-03-28 | Event naming method combining boundary distribution and correction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059584B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114445757A (en) * | 2022-02-24 | 2022-05-06 | 腾讯科技(深圳)有限公司 | Nomination obtaining method, network training method, device, storage medium and equipment |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9881380B2 (en) * | 2016-02-16 | 2018-01-30 | Disney Enterprises, Inc. | Methods and systems of performing video object segmentation |
WO2018125580A1 (en) * | 2016-12-30 | 2018-07-05 | Konica Minolta Laboratory U.S.A., Inc. | Gland segmentation with deeply-supervised multi-level deconvolution networks |
CN109101859A (en) * | 2017-06-21 | 2018-12-28 | 北京大学深圳研究生院 | The method for punishing pedestrian in detection image using Gauss |
CN108875624B (en) * | 2018-06-13 | 2022-03-25 | 华南理工大学 | Face detection method based on multi-scale cascade dense connection neural network |
CN108805083B (en) * | 2018-06-13 | 2022-03-01 | 中国科学技术大学 | Single-stage video behavior detection method |
CN109271876B (en) * | 2018-08-24 | 2021-10-15 | 南京理工大学 | Video motion detection method based on time evolution modeling and multi-example learning |
-
2019
- 2019-03-28 CN CN201910245568.2A patent/CN110059584B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110059584A (en) | 2019-07-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109117876B (en) | Dense small target detection model construction method, dense small target detection model and dense small target detection method | |
CN111488807B (en) | Video description generation system based on graph rolling network | |
CN110399850B (en) | Continuous sign language recognition method based on deep neural network | |
CN111177446B (en) | Method for searching footprint image | |
CN111161311A (en) | Visual multi-target tracking method and device based on deep learning | |
CN112767554B (en) | Point cloud completion method, device, equipment and storage medium | |
CN106709936A (en) | Single target tracking method based on convolution neural network | |
CN110348632A (en) | A kind of wind power forecasting method based on singular spectrum analysis and deep learning | |
CN113591968A (en) | Infrared weak and small target detection method based on asymmetric attention feature fusion | |
CN111259940A (en) | Target detection method based on space attention map | |
CN111259735B (en) | Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network | |
CN111652175A (en) | Real-time surgical tool detection method applied to robot-assisted surgical video analysis | |
CN114863348B (en) | Video target segmentation method based on self-supervision | |
CN109947923A (en) | A kind of elementary mathematics topic type extraction method and system based on term vector | |
Xu et al. | Multi-task spatiotemporal neural networks for structured surface reconstruction | |
CN116052276A (en) | Human body posture estimation behavior analysis method | |
CN117521512A (en) | Bearing residual service life prediction method based on multi-scale Bayesian convolution transducer model | |
CN110059584B (en) | Event naming method combining boundary distribution and correction | |
Heidler et al. | A deep active contour model for delineating glacier calving fronts | |
CN110288026A (en) | A kind of image partition method and device practised based on metric relation graphics | |
CN117137435B (en) | Rehabilitation action recognition method and system based on multi-mode information fusion | |
CN117115474A (en) | End-to-end single target tracking method based on multi-stage feature extraction | |
CN116704609A (en) | Online hand hygiene assessment method and system based on time sequence attention | |
CN113837910B (en) | Test question recommending method and device, electronic equipment and storage medium | |
CN116958724A (en) | Training method and related device for product classification model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |