CN110059584B - Event naming method combining boundary distribution and correction - Google Patents

Event naming method combining boundary distribution and correction Download PDF

Info

Publication number
CN110059584B
CN110059584B CN201910245568.2A CN201910245568A CN110059584B CN 110059584 B CN110059584 B CN 110059584B CN 201910245568 A CN201910245568 A CN 201910245568A CN 110059584 B CN110059584 B CN 110059584B
Authority
CN
China
Prior art keywords
event
nomination
network
distribution network
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910245568.2A
Other languages
Chinese (zh)
Other versions
CN110059584A (en
Inventor
田茜
郑慧诚
王腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201910245568.2A priority Critical patent/CN110059584B/en
Publication of CN110059584A publication Critical patent/CN110059584A/en
Application granted granted Critical
Publication of CN110059584B publication Critical patent/CN110059584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides an event naming method combining boundary distribution and correction, which forms an event naming network by constructing a starting point distribution network, an end point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating; the starting point distribution network and the ending point distribution network are used for predicting event nomination; the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries. According to the event naming method combining boundary distribution and correction, provided by the invention, the event naming fitting the real event distribution is generated by combining the event starting and ending point distribution rule in the real video, and the boundary of the event naming is corrected by utilizing a cyclic correction network, so that the event naming which is more in line with the real event and enables the boundary of the event to be more accurate is obtained.

Description

Event naming method combining boundary distribution and correction
Technical Field
The invention relates to the technical field of computer vision, in particular to an event naming method combining boundary distribution and correction.
Background
With the rapid development of the internet and portable devices, it is more convenient and easier to shoot videos, and a large amount of videos are uploaded to the internet, and the content, the time length and the like of the videos have larger differences. Most video-based computer vision algorithms perform analysis processing, such as motion recognition, on short video after trimming to a certain extent on long video. And the trimming of long video brings great labor cost and time cost consumption, so that processing analysis of untrimmed long video becomes necessary to meet the real life requirement.
Dense video description [1] is a new task proposed based on long video without pruning, and aims to describe a plurality of events occurring in the video respectively by using natural language. The dense video description can be divided into two parts, namely, positioning events in the video, and finding out the start and stop time of all the events in the video, namely, extracting event nomination in the video; and secondly, describing the located event by using natural language. The potential application of the system is very wide, such as early education of infants, daily coaching of the blind, movie captions, video retrieval and classification and the like.
Robust dense video descriptions rely on high quality event nomination, not only requiring that the generated event nomination cover the time span of all possible events on a time scale, but also requiring that the boundaries of the generated event nomination be the same as the real events in the video. Compared with image-based tasks such as target detection, high-quality event naming is required to extract not only object information in a video, but also motion information of the object in the video in time, so that relevant dynamic information is obtained. In real life, the video generation conditions are not limited, a certain overlap exists among multiple events in the video in time, and certain changes exist in shooting angles, shooting distances and the like of the video, so that great challenges are brought to event nomination.
Event nomination in video has some similarity to object detection in images, and many research works on event nomination are inspired by object detection at present. However, in event naming, not only the apparent features of the video but also the dynamic features in time sequence are required to be paid attention to, and the time span change of the event is often relatively large. Shou et al [2] apply sliding windows of different scales to the video feature sequence, and use [3] to extract features in the window to further predict event nomination. However, the method can only obtain the event nomination with a limited preset length, and cannot flexibly adapt to the actual length of the event. In addition, repeated scanning of the same frame by sliding windows with different scales also brings about redundant calculation. Gao et al [4] further regress sliding window based event nomination boundaries to obtain more flexible and accurate nomination boundaries. Chao et al [5] expands the Faster R-CNN [6] framework into a dual-stream network structure to be applied to video, and applies expansion convolution to time sequence to expand receptive field, thereby obtaining event nomination with larger time span. The method comprises the following steps of (7) inputting video features in a sliding window into the cyclic neural network to predict the starting and ending time and the confidence of event nomination of different scales by virtue of the memorization of the cyclic neural network on the video feature extraction. However, sliding windows typically result in a large number of repetitive operations, [8] avoids repetitive calculations at different scales by predicting event nominations at each node of the recurrent neural network for a number of different scales. However, the event nominated boundary predicted by the method is fixed, the accuracy of the boundary is limited by the feature extraction span, and the real event boundary is difficult to fit accurately.
Disclosure of Invention
The invention provides an event naming method combining boundary distribution and correction, which aims to overcome the technical defects that the existing event naming method based on a cyclic neural network is fixed in naming boundary, the accuracy of the boundary is limited by a feature extraction span, and the real event boundary is difficult to fit.
In order to solve the technical problems, the technical scheme of the invention is as follows:
an event naming method combining boundary distribution and correction is characterized in that an event naming network is formed by constructing a starting point distribution network, an ending point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;
the starting point distribution network and the ending point distribution network are used for predicting event nomination;
the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
The construction process of the starting point distribution network and the ending point distribution network comprises the following steps:
normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;
counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line s0 ,w e0 ;w s0 ,w e0 Respectively represent the probability distribution of the starting point and the ending point of the event to obtainTo a start distribution network and an end distribution network.
The process of event nomination prediction by the starting point distribution network and the ending point distribution network specifically comprises the following steps:
acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;
outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:
[t-k,t+1],k∈[0,K];
wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood that the interval is an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.
The boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating video characteristics output at each time point according to video characteristics of a sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
The process for correcting the event nomination boundary specifically comprises the following steps of:
calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:
Δc=(G c -P c )/P c
Δl=log(G l /P l );
wherein: g c Representing the actual event nomination center coordinates, P c Representation predictionEvent nomination center coordinates of (a); g l Representing the size, P, of the actual event nomination scale l Representing a predicted event nomination scale size;
taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';
correcting the predicted event nomination boundary through the offset information delta c 'and delta l', and obtaining the corrected event nomination center position P c ' sum dimension size P l ' specifically, it is:
P c '=P c /(1+Δ' c );
Figure BDA0002010966520000031
center position P according to corrected event nomination c ' sum dimension size P l The "corrected start-stop time" is specifically:
Figure BDA0002010966520000032
Figure BDA0002010966520000041
wherein P' start Indicating the corrected event nomination starting time, P' end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.
The event nomination network loss function is formed by weighting and superposing a starting point distribution network loss sub-function, an ending point distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss s (c s ,t,X,y s ) The method comprises the following steps:
Figure BDA0002010966520000042
the endpoint distribution network loss subfunction loss e (c e ,t,X,y e ) The method comprises the following steps:
Figure BDA0002010966520000043
the boundary circulation corrects the network loss subfunction loss reg (t i ) The method comprises the following steps:
Figure BDA0002010966520000044
wherein X represents the entire dataset;
Figure BDA0002010966520000045
whether the kth event nominated for the kth time point is a genuine supervision signal,
Figure BDA0002010966520000046
Suggesting confidence levels for the event under the start distribution network and the end distribution network;
k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc k And Deltal k Respectively at the t th of video i A supervision signal for correcting the nomination of the kth event at each time point; Δc' k And Deltal' k The bias information predicted by nominating the same event at the same time point is obtained;
therefore, the loss function loss (c, t, X, y) is specifically:
loss(c,t,x,y)=α*loss s (c s ,t,x,y s )+β*loss e (c e ,t,x,y e )+γ*loss reg (t i );
wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.
And training and updating all the circulating neural networks in the event naming network by using the loss function, so that the training and updating of the event naming network are completed, and the event naming network after the training and updating is obtained.
The specific process of carrying out naming prediction on the video event by using the event naming network after training and updating is as follows:
s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;
s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;
s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;
s4: the boundary circulation correction network generates bias information of predicted event nomination;
s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
according to the event naming method combining boundary distribution and correction, provided by the invention, the event naming fitting the real event distribution is generated by combining the event starting and ending point distribution rule in the real video, and the boundary of the event naming is corrected by utilizing a cyclic correction network, so that the event naming which is more in line with the real event and enables the boundary of the event to be more accurate is obtained.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, an event naming method combining boundary distribution and correction forms an event naming network by constructing a start distribution network, an end distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;
the starting point distribution network and the ending point distribution network are used for predicting event nomination;
the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
More specifically, the construction process of the starting point distribution network and the ending point distribution network comprises the following steps:
normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;
counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line s0 ,w e0 ;w s0 ,w e0 And respectively representing the starting point probability distribution and the ending point probability distribution of the event to obtain a starting point distribution network and an ending point distribution network.
In the implementation process, when the data set is counted, event nomination with the time crossing ratio tIOU of the real event being larger than a certain threshold sigma is divided into positive samples, and the start and stop point distribution of all the positive samples is counted.
More specifically, the process of predicting event nomination by the starting point distribution network and the ending point distribution network is specifically as follows:
acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;
outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:
[t-k,t+1],k∈[0,K];
wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood that the interval is an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.
More specifically, the boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating the video characteristics output at each time point according to the video characteristics of the sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
More specifically, the process of generating the bias information of the predicted event nomination and correcting the event nomination boundary specifically includes:
calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:
Δc=(G c -P c )/P c
Δl=log(G l /P l );
wherein: g c Representing the actual event nomination center coordinates, P c Representing predicted event nomination center coordinates; g l Representing the size, P, of the actual event nomination scale l Representing a predicted event nomination scale size;
taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';
correcting the predicted event nomination boundary through the offset information delta c 'and delta l', and obtaining the corrected event nomination center position P c ' sum dimension size P l ' specifically, it is:
P c '=P c /(1+Δ' c );
Figure BDA0002010966520000071
center position P according to corrected event nomination c ' sum dimension size P l The "corrected start-stop time" is specifically:
Figure BDA0002010966520000072
Figure BDA0002010966520000073
wherein P' start Indicating the corrected event nomination starting time, P' end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.
More specifically, the event nomination network loss function is formed by weighting and superposing a starting point distribution network loss sub-function, an ending point distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss s (c s ,t,X,y s ) The method comprises the following steps:
Figure BDA0002010966520000074
the endpoint distribution network loss subfunction loss e (c e ,t,X,y e ) The method comprises the following steps:
Figure BDA0002010966520000075
the boundary circulation corrects the network loss subfunction loss reg (t i ) The method comprises the following steps:
Figure BDA0002010966520000076
wherein X represents the entire dataset;
Figure BDA0002010966520000077
whether the kth event nominated for the kth time point is a genuine supervision signal,
Figure BDA0002010966520000078
Suggesting confidence levels for the event under the start distribution network and the end distribution network;
k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc k And Deltal k Respectively at the t th of video i A supervision signal for correcting the nomination of the kth event at each time point; Δc' k And Deltal' k The bias information predicted by nominating the same event at the same time point is obtained;
therefore, the loss function loss (c, t, X, y) is specifically:
loss(c,t,x,y)=α*loss s (c s ,t,x,y s )+P*loss e (c e ,t,x,y e )+γ*loss reg (t i );
wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.
More specifically, training and updating are carried out on all the circulating neural networks in the event naming network by using the loss function, so that the training and updating of the event naming network are completed, and the event naming network after the training and updating is obtained.
More specifically, the specific process of carrying out nomination prediction on the video event by using the event nomination network after training update is as follows:
s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;
s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;
s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;
s4: the boundary circulation correction network generates bias information of predicted event nomination;
s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.
Example 2
On the basis of embodiment 1, more specifically, training and verification are performed on ActivityNet, which contains 20000 untrimmed videos, which is composed of 849 hours and has about 10 tens of thousands of pieces of tag information of descriptive sentences, and in the dataset, each video contains a plurality of events and descriptive sentences for the events, and the start-stop time and duration of time in the same video are different. ActivityNet contains three parts: the training set, the verification set and the test set have 10024, 4926 and 5044 videos respectively, and the embodiment mainly performs experiments on the training set and the verification set.
When the three-dimensional convolution network [9] is used for extracting the features, one video feature is extracted every 64 frames, and the feature dimension is compressed to 500 by using a principal component analysis method. The cyclic neural network used is a long and short term memory network LSTM, the dimension is 512, and K in the model is set to 256. In the language model for generating sentences, the upper limit of the word for each sentence is set to 32, and words having a number of occurrences of less than 3 are deleted when the word stock is established. Training the event nomination network to be stable, and then training by combining a language model, wherein the learning rate is set to be 5e-5.
In the specific implementation process, when evaluating the quality of event nomination, two indexes are generally used: recall rate, accuracy rate. The recall rate evaluates how many real events are covered in the predicted event nomination, and the accuracy rate evaluates how much is correct in the predicted event nomination. In addition, there is an index for comprehensive evaluation of event nomination, i.e., f 1 The score, the index has both accuracy and recall, and is calculated by the recall and accuracy, and the calculation method is as follows:
f 1 =2 x (accuracy rate x recall/(accuracy rate+recall));
the comparison of the event naming method combining boundary distribution and correction on the ActivityNet provided by the invention with the existing method is shown in the table 1:
TABLE 1
Method Recall @1000 Accuracy @1000 f 1 Score @1000
SST[5] 0.716 0.533 0.571
Modeling of start and stop points 0.731 0.530 0.573
Boundary regression 0.704 0.561 0.590
Start-stop distribution + boundary point regression 0.716 0.560 0.592
As shown in table 1, @1000 represents event nomination for the top 1000 confidence rank; comparing the method of the invention with the existing method, SST [13 ]]The method is based on the method and predicts the event nomination based on the cyclic neural network. However, in SST, data such as the distribution of start and stop points of event nomination is not processed, and the predicted event nomination boundary is fixed. Therefore, by counting the event starting and ending points, the recall rate is obviously improved, and under the condition that the recall rate is basically unchanged after the combination of carrying out regression on the event nominated boundary, the accuracy rate and f 1 The score is greatly improved.
In the implementation process, when the event nomination is used for evaluating the expression of the dense video description, main indexes are BLEU-1, BLEU-2, BLEU-3, BLEU-4, meteor, rouge-L and CIDEr-D, and the main indexes are used for evaluating the similarity between the description statement and the real description statement generated by the event nomination network. Among the above-mentioned indexes, meteor mainly focuses on the performance of an event-naming network on the index because its measurement result is more similar to the result of manual judgment.
As shown in Table 2, compared with the existing method in the dense video description task, the method of the invention has better performance in most indexes than the existing method, especially in the Meteor index, and shows the effectiveness of the event nomination network.
Table 2 ActivityNet Caption validation Experimental results
Figure BDA0002010966520000091
Figure BDA0002010966520000101
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.
[1]R.Krishna,K.Hata,F.Ren,L.Fei-Fei,and J.C.Niebles,“Dense-captioning events in videos,”in Proc.IEEE International Conference on Computer Vision,2017,pp.706-715.
[2]Z.Shou,D.Wang,and S.Chang,“Temporal action localization in untrimmed videos via multi-stage CNNs,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2016,pp.1049-1058.
[3]S.Ji,W.Xu,M.Yang,and K.Yu,“3D convolutional neural networks for human action recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.35,no.1,pp.221-231,2013.
[4]J.Gao,Z.Yang,C.Sun,K.Chen,and R.Nevatia,“TURN TAP:Temporal unit regression network for temporal action proposals,”in Proc.IEEE International Conference on Computer Vision,2017,pp.3648-3656.
[5]Y.Chao,S.Vijayanarasimhan,B.Seybold,D.A.Ross,J.Deng,and R.Sukthankar,“Rethinking the faster R-CNN architecture for temporal action localization,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2018,pp.1130-1139.
[6]S.Ren,K.He,R.Girshick,and J.Sun,“Faster R-CNN:Towards real-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.39,no.6,pp.1137-1149,2017.
[7]V.Escorcia,F.Caba,J.C.Niebles,and B.Ghanem,“Daps:Deep action proposals for action understanding,”in Proc.European Conference on Computer Vision,2016,pp.768–784.
[8]S.Buch,V.Escorcia,C.Shen,B.Ghanem,and J.C.Niebles,“SST:Single-stream temporal action proposals,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.6373-6382.
[9]S.Ji,W.Xu,M.Yang,and K.Yu,“3D convolutional neural networks for human action recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.35,no.1,pp.221-231,2013.
[10]R.Krishna,K.Hata,F.Ren,L.Fei-Fei,and J.C.Niebles,“Dense-captioning events in videos,”2017 IEEE International Conference on Computer Vision(ICCV),pp.706–715,2017.
[11]Y.Li,T.Yao,Y.Pan,H.Chao,and T.Mei,“Jointly localizing and describing events for dense video captioning,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2018,pp.7492-7500.
[12]J.Wang,W.Jiang,L.Ma,W.Liu,and Y.Xu,“Bidirectional attentive fusion with context gating for dense video captioning,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2018,pp.7190-7198.
[13]S.Buch,V.Escorcia,C.Shen,B.Ghanem,and J.C.Niebles,“SST:Single-stream temporal action proposals,”in Proc.IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.6373-6382.

Claims (6)

1. An event naming method combining boundary distribution and correction is characterized in that: forming an event nomination network by constructing a starting point distribution network, an ending point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;
the starting point distribution network and the ending point distribution network are used for predicting event nomination;
the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries;
the construction process of the starting point distribution network and the ending point distribution network comprises the following steps:
normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;
counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line s0 ,w e0 ;w s0 ,w e0 Respectively representing the starting point probability distribution and the ending point probability distribution of the event to obtain a starting point distribution network and an ending point distribution network;
the process of predicting event nomination by the starting point distribution network and the ending point distribution network specifically comprises the following steps:
acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;
outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:
[t-k,t+1],k∈[0,K];
wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood of being an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.
2. The event naming method combining boundary distribution and correction according to claim 1, wherein the boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating video characteristics output at each time point according to video characteristics of the sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.
3. The method for event naming by combining boundary distribution and correction according to claim 2, wherein the generating of the bias information of the predicted event naming and the correcting of the event naming boundary are specifically:
calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:
Δc=(G c -P c )/P c
Δl=log(G l /P l );
wherein: g c Representing the actual event nomination center coordinates, P c Representing predicted event nomination center coordinates; g l Representing the size, P, of the actual event nomination scale l Representing a predicted event nomination scale size;
taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';
predicted by the pair of offset information Δc' and ΔlCorrecting the event nomination boundary to obtain a corrected event nomination center position P '' c And a dimension of P' l The method specifically comprises the following steps:
P′ c =P c /(1+Δ' c );
Figure QLYQS_1
center position P 'noted from corrected event' c And dimension size P l The "corrected start-stop time" is specifically:
Figure QLYQS_2
Figure QLYQS_3
wherein P' start Indicating the corrected event nomination starting time, P' end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.
4. The event naming method combining boundary distribution and correction according to claim 3, wherein the event naming network loss function is formed by weighted superposition of a start distribution network loss sub-function, an end distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss s (c s ,t,X,y s ) The method comprises the following steps:
Figure QLYQS_4
the endpoint distribution network loss subfunction loss e (c e ,t,X,y e ) The method comprises the following steps:
Figure QLYQS_5
the boundary circulation corrects the network loss subfunction loss reg (t i ) The method comprises the following steps:
Figure QLYQS_6
wherein X represents the entire dataset;
Figure QLYQS_7
Figure QLYQS_8
whether the kth event nominated for the kth time point is a genuine supervision signal,
Figure QLYQS_9
Suggesting confidence levels for the event under the start distribution network and the end distribution network;
k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc k And Deltal k Respectively at the t th of video i A supervision signal for correcting the nomination of the kth event at each time point; Δc' k And Deltal' k The bias information predicted by nominating the same event at the same time point is obtained;
therefore, the loss function loss (c, t, X, y) is specifically:
loss(c,t,X,y)=α*loss s (c s ,t,X,y s )+β*loss e (c e ,t,X,y e )+γ*loss reg (t i );
wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.
5. The method for event naming by combining boundary distribution and correction according to claim 4, wherein training update is performed on all the cyclic neural networks in the event naming network by using the loss function, so as to complete training update of the event naming network and obtain the event naming network after training update.
6. The method for event naming by combining boundary distribution and correction according to claim 5, wherein the specific process of predicting the video event by using the event naming network after training update is:
s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;
s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;
s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;
s4: the boundary circulation correction network generates bias information of predicted event nomination;
s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.
CN201910245568.2A 2019-03-28 2019-03-28 Event naming method combining boundary distribution and correction Active CN110059584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910245568.2A CN110059584B (en) 2019-03-28 2019-03-28 Event naming method combining boundary distribution and correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910245568.2A CN110059584B (en) 2019-03-28 2019-03-28 Event naming method combining boundary distribution and correction

Publications (2)

Publication Number Publication Date
CN110059584A CN110059584A (en) 2019-07-26
CN110059584B true CN110059584B (en) 2023-06-02

Family

ID=67317857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910245568.2A Active CN110059584B (en) 2019-03-28 2019-03-28 Event naming method combining boundary distribution and correction

Country Status (1)

Country Link
CN (1) CN110059584B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114445757A (en) * 2022-02-24 2022-05-06 腾讯科技(深圳)有限公司 Nomination obtaining method, network training method, device, storage medium and equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9881380B2 (en) * 2016-02-16 2018-01-30 Disney Enterprises, Inc. Methods and systems of performing video object segmentation
WO2018125580A1 (en) * 2016-12-30 2018-07-05 Konica Minolta Laboratory U.S.A., Inc. Gland segmentation with deeply-supervised multi-level deconvolution networks
CN109101859A (en) * 2017-06-21 2018-12-28 北京大学深圳研究生院 The method for punishing pedestrian in detection image using Gauss
CN108875624B (en) * 2018-06-13 2022-03-25 华南理工大学 Face detection method based on multi-scale cascade dense connection neural network
CN108805083B (en) * 2018-06-13 2022-03-01 中国科学技术大学 Single-stage video behavior detection method
CN109271876B (en) * 2018-08-24 2021-10-15 南京理工大学 Video motion detection method based on time evolution modeling and multi-example learning

Also Published As

Publication number Publication date
CN110059584A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
CN109117876B (en) Dense small target detection model construction method, dense small target detection model and dense small target detection method
CN111488807B (en) Video description generation system based on graph rolling network
CN110399850B (en) Continuous sign language recognition method based on deep neural network
CN111177446B (en) Method for searching footprint image
CN111161311A (en) Visual multi-target tracking method and device based on deep learning
CN112767554B (en) Point cloud completion method, device, equipment and storage medium
CN106709936A (en) Single target tracking method based on convolution neural network
CN110348632A (en) A kind of wind power forecasting method based on singular spectrum analysis and deep learning
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN111259940A (en) Target detection method based on space attention map
CN111259735B (en) Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network
CN111652175A (en) Real-time surgical tool detection method applied to robot-assisted surgical video analysis
CN114863348B (en) Video target segmentation method based on self-supervision
CN109947923A (en) A kind of elementary mathematics topic type extraction method and system based on term vector
Xu et al. Multi-task spatiotemporal neural networks for structured surface reconstruction
CN116052276A (en) Human body posture estimation behavior analysis method
CN117521512A (en) Bearing residual service life prediction method based on multi-scale Bayesian convolution transducer model
CN110059584B (en) Event naming method combining boundary distribution and correction
Heidler et al. A deep active contour model for delineating glacier calving fronts
CN110288026A (en) A kind of image partition method and device practised based on metric relation graphics
CN117137435B (en) Rehabilitation action recognition method and system based on multi-mode information fusion
CN117115474A (en) End-to-end single target tracking method based on multi-stage feature extraction
CN116704609A (en) Online hand hygiene assessment method and system based on time sequence attention
CN113837910B (en) Test question recommending method and device, electronic equipment and storage medium
CN116958724A (en) Training method and related device for product classification model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant