CN110348345A

CN110348345A - A kind of Weakly supervised timing operating position fixing method based on continuity of movement

Info

Publication number: CN110348345A
Application number: CN201910575033.1A
Authority: CN
Inventors: 王乐; 翟元浩; 刘子熠
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-10-18
Anticipated expiration: 2039-06-28
Also published as: CN110348345B

Abstract

The invention belongs to field of machine vision, disclose a kind of Weakly supervised timing operating position fixing method based on continuity of movement, comprising: video is divided into RGB frame and light stream and is handled respectively；The movement segment of the different hypothesis of length is proposed first against each time point on video for each movement mode, segment is then acted using convolutional neural networks recurrence according to the continuity of movement of video and classification accuracy.The different movement segments that mode obtains are acted for two, is combined by the module of a characteristic, filters out final operating position fixing result.The present invention can belong to the movement segment of the category in positioning video in the given other situation of video class.

Description

A kind of Weakly supervised timing operating position fixing method based on continuity of movement

Technical field

The invention belongs to technical field of computer vision, are related to Weakly supervised timing operating position fixing method, in particular to a kind of Weakly supervised timing operating position fixing method based on continuity of movement.

Background technique

Timing operating position fixing is an important computer vision problem, it understands task, such as event in abstract video There is very important application in the directions such as detection, Video summary and video problems answer.

Most of timing operating position fixing method needs accurate time-labeling at present, needs to consume a large amount of manpower and material resources； Meanwhile the time-labeling may be made inaccurate because of the ambiguity on the boundary of movement.In addition, current timing operating position fixing method In, RGB and light stream are not handled respectively, have ignored the feature of RGB and light stream itself；Final segment-score is only by classifying to get Out, the difference of RGB and light stream itself are ignored, and big to Classification Neural dependence, hardly results in optimal result.

To sum up, a kind of new Weakly supervised timing operating position fixing method is needed.

Summary of the invention

The Weakly supervised timing operating position fixing method based on continuity of movement that the purpose of the present invention is to provide a kind of, to solve Above-mentioned one or more technical problem.In the present invention, video is divided into RGB and light stream and is handled respectively, proposing respectively can The movement segment of energy, then screening fusion is to propose last as a result, it is possible to obtain preferably positioning result.

In order to achieve the above objectives, the invention adopts the following technical scheme:

A kind of Weakly supervised timing operating position fixing method based on continuity of movement, comprising the following steps:

Step 1, by video to be processed be divided into it is multiple be not overlapped segment, obtain the RGB feature and Optical-flow Feature of each segment；

Step 2, the RGB feature to step 1 acquisition and Optical-flow Feature carry out movement segment recurrence processing respectively, obtain RGB It acts segment and light stream acts segment；The movement segment recurrence processing includes: each time point for video to be processed, piece The imaginary movement segment for lifting different preset lengths, for different length movement segment using scheduled recurrent neural networks into Row returns, and recurrent neural networks are trained using continuity of movement loss function, and obtain movement segment；

Step 3, segment is acted by the RGB that continuity of movement loss function evaluation procedure 2 obtains and light stream acts segment Confidence level；Inhibit to filter out the movement segment that registration is more than threshold value using non-maximum value；

Step 4, after recurrent neural networks training；By the Fusion Module of a printenv, screening fusion RGB movement Segment and light stream act segment, obtain positioning result to the end.

A further improvement of the present invention is that step 1 specifically includes: video to be processed being divided into and multiple is not overlapped segment； To each segment average sample, the feature of sampling frame is extracted using convolutional neural networks, will extract the feature obtained as the piece The expression of section；Wherein, feature is extracted to RGB and light stream respectively.

A further improvement of the present invention is that movement segment recurrence processing specifically includes: for imaginary length in step 2 For the movement segment of P, returned using following formula:

In formula, x_sFor the serial number of start boundary, x_eFor the serial number of end boundary,For what is returned in start boundary position As a result,For end boundary position return as a result, P be movement fragment length.

A further improvement of the present invention is that continuity of movement loss function consists of two parts in step 3；

A part is used to characterize the cosine similarity of movement segment characterizations and its contextual feature；Assuming that the movement segment with The feature of its context is indicated with following symbol respectivelyWherein F It (u) be video at time point is u character representation, the calculation expression of this part are as follows:

Another part is characterized with classification confidence:

Wherein, S (k, u) indicates the classification confidence of segment u at classification k；For acting segment [x_s,x_e], expanded It is charged to [X_s,X_e], whereinContextual information as the segment；

The expression formula of continuity of movement loss function are as follows:

L=α L_c+(1-α)(L_c-1)

Wherein, α is hyper parameter, and value is 0 < α < 1.

A further improvement of the present invention is that after Recurrent networks training, being obtained respectively for RGB and light stream in step 4 The movement segment arrived, screening fusion steps specifically include:

The movement segment obtained by RGB and light stream is respectivelyWherein, N_rAnd N_fRespectively RGB and Light stream acts segments；

For the movement segment of each RGB, its maximum IoU with light stream segment, calculation formula are calculated are as follows:

Final screening fusion results act segment for all light streams and have I (p_r,j) dynamic less than the RGB of preset threshold Make the combination of segment.

A further improvement of the present invention is that final screening fusion results, act segment for all light streams and have I (p_r,jThe combination of the RGB movement segment of) < 0.4.

A further improvement of the present invention is that α is set as 0.6.

A further improvement of the present invention is that in step 2, RGB feature R that step 1 is obtained_sIt is input to multiple recurrence nets Network；Each Recurrent networks are made of 3 layers of 1D convolutional neural networks, and specify a fragment length P；

It is 1 convolution kernel that the last layer of Recurrent networks, which has 2 sizes, and output returns start boundary respectively and terminates side Boundary.

A further improvement of the present invention is that first two layers of Recurrent networks is made of empty convolutional network.

A further improvement of the present invention is that the cavity of first two layers of empty convolution of Recurrent networks is set as

Compared with prior art, the invention has the following advantages:

In the present invention, video is divided into RGB and light stream and is handled respectively, proposed possible movement segment respectively, then screen It merges to propose last as a result, reducing in conventional method since the feature for directly handling RGB and light stream splicing may be brought Interference, preferably positioning result can be obtained.In the present invention, the expression of two movement mode is different, and handling can more dig respectively The characteristics of digging different movement mode；The present invention devises a screening Fusion Module to combine the movement of two different movement mode Positioning result.The continuity of method of the invention based on video actions proposes a kind of loss function based on cosine similarity, Movement segment characterizations and movement segment classification confidence are combined, are avoided in conventional method to a certain extent according only to action movie The limitation of section classification confidence assessment movement segment.Method of the invention, can be in the condition for only knowing action classification in video Under, the movement segment in positioning video.

Further, in order to avoid over-fitting, first two layers of Recurrent networks is made of empty convolutional network；In order to guarantee have Enough contextual informations are input to network, and the receptive field of Recurrent networks has been configured to P, and setting method is by Recurrent networks The cavity of first two layers of empty convolution is set as

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below to embodiment or existing Attached drawing needed in technical description does simple introduction；It should be evident that the accompanying drawings in the following description is of the invention Some embodiments to those skilled in the art without creative efforts, can also be according to this A little attached drawings obtain other attached drawings.

Fig. 1 is that a kind of process of Weakly supervised timing operating position fixing method based on continuity of movement of the embodiment of the present invention is shown It is intended to；

Fig. 2 is to show in the embodiment of the present invention with test result comparison of the history experimental result on THUMOS14 data set It is intended to；

Fig. 3 is to compare in the embodiment of the present invention with test result of the history experimental result on ActivityNet data set Schematic diagram.

Specific embodiment

To keep the purpose, technical effect and technical solution of the embodiment of the present invention clearer, implement below with reference to the present invention Attached drawing in example, technical scheme in the embodiment of the invention is clearly and completely described；Obviously, described embodiment It is a part of the embodiment of the present invention.Based on embodiment disclosed by the invention, those of ordinary skill in the art are not making creation Property labour under the premise of other embodiments obtained, all should belong to the scope of protection of the invention.

Referring to Fig. 1, a kind of Weakly supervised timing operating position fixing method based on continuity of movement of the invention, including it is following Step:

Step 1: feature extraction.Video is divided into multiple 15 frame fragments not being overlapped, 3 frames are taken to each segment, use volume Product neural network extracts feature, using this feature as the expression of the segment.And feature is extracted to RGB and light stream respectively, as The input of next step.

Step 2: movement segment returns.The step carries out following identical processing to RGB and light stream respectively.For video Each time point enumerates the imaginary movement segment of different length.For the movement segment of all equal lengths, use is same Convolutional neural networks are returned.

Specifically, it is assumed that the movement segment for being P for imaginary length starts with the serial number of end boundary to be respectively x_s And x_e, the result that neural network returns in the position is respectivelyWithIt is returned using following formula:

Step 3: movement segment assessment.The movement segment obtained by step 2 is evaluated using continuity of movement loss function Confidence level.Simultaneously using loss function training recurrent neural networks.Specifically, the loss function consists of two parts, A part is used to measure the cosine similarity of movement segment characterizations and its contextual feature, and another part measures the segment and thereon The hereafter difference of classification confidence.

Step 4: movement segment screening fusion.After recurrent neural networks training, by the fusion mould of a printenv The movement segment that block, screening fusion RGB and light stream respectively obtain, obtains result to the end.

Wherein, step 2 specifically includes:

The calculating process of its loss function are as follows: firstly, for movement segment [x_s,x_e], we are extended to [X_s,X_e], WhereinContextual information as the segment.Loss function of the invention is by two parts group At a part is used to characterize the cosine similarity of movement segment characterizations and its contextual feature: assuming that the movement segment and thereon Feature hereafter is indicated with following symbol respectivelyWherein F (u) is Video is u character representation at time point.The calculation method of this part are as follows:

Another part is characterized with classification confidence:

Wherein S (k, u) indicates the classification confidence of segment u at classification k.

Final loss function are as follows:

L=α L_c+(1-α)(L_c-1)

Wherein, α is hyper parameter, and value is 0 < α < 1.

Step 4 specifically includes: after Recurrent networks training, for the movement segment that RGB and light stream are respectively obtained, this hair It is bright to carry out following screening fusion process: assuming that the movement segment obtained by RGB and light stream is respectively Wherein, N_rAnd N_fRespectively RGB and light stream act segments.

For the movement segment of each RGB, its maximum IoU with light stream segment is calculated:

Final screening fusion results are for the movement segment of all light streams and with I (p_r,jThe RGB movement segment of) < 0.4 Combination.

To sum up, the present invention handles RGB and light stream respectively, reduces in conventional method due to directly handling RGB and light stream The possible interference of the feature of splicing.The expression of two movement mode is different, and different movement mode can more be excavated respectively by handling The characteristics of.Meanwhile the present invention devises a screening Fusion Module to combine the operating position fixing result of two different movement mode. The continuity of movement loss function that the present invention designs combines movement segment characterizations and movement segment classification confidence, certain journey It is avoided on degree in conventional method according only to the limitation of movement segment classification confidence assessment movement segment.

Embodiment

Referring to Fig. 1, a kind of Weakly supervised timing operating position fixing method based on continuity of movement of the embodiment of the present invention, tool Body the following steps are included:

Step 1: RGB and light stream being handled as follows respectively: video is divided into the collection for 15 one segment of frame not being overlapped It closes, for each segment, takes representative frame of 3 frames as the segment at random, then use Temporal Segment Network extracts feature to 3 frame, as the feature of the segment after being averaged.

Step 2: by taking RGB as an example (light stream is identical as RGB processing method), the RGB feature R as obtained in step 1_sInput To multiple Recurrent networks.Each Recurrent networks are made of 3 layers of 1D convolutional neural networks, and specify a fragment length P.In order to Over-fitting is avoided, is made of empty convolutional network for first two layers of Recurrent networks, there are 256 sizes is 3 convolution kernel.The last layer Having 2 sizes is 1 convolution kernel, and output returns start boundary and end boundary respectively.In order to guarantee to have enough contexts For information input to network, the receptive field of Recurrent networks has been configured to P.Setting method is by two layers before Recurrent networks of cavity The cavity of convolution is set as

The video for being T for length, for each of which timing position, we initialize imaginary movement segment first and areWherein x_e,i-x_s,i=P.Then by the regression result of Recurrent networksWith

Regression result isRecursive computational procedure are as follows:

Step 3: for acting segment [x_s,x_e], we are extended to [X_s,X_e], whereinContextual information as the segment.In order to assess movement segment, the present invention defines one A continuity of movement loss function, the loss function consist of two parts, a part be used to characterize movement segment characterizations with thereon The cosine similarity of following traits: assuming that the movement segment and the feature of its context are indicated with following symbol respectivelyIt at time point is u character representation that wherein F (u), which is video,.This Partial calculation method are as follows:

Another part is characterized with classification confidence:

Final loss function are as follows:

L=α L_c+(1-α)(L_c-1)

Wherein, α is hyper parameter, is set as 0.6 in practice.When Recurrent networks training, the loss function value of Recurrent networks It is the average value of everything segment loss function value.In test, redundancy is removed using the non-maxima suppression that IoU is 0.4 Movement segment.

Step 4: after Recurrent networks training, for the movement segment that RGB and light stream are respectively obtained, the present invention is carried out such as Under screening fusion process: assuming that the movement segment obtained by RGB and light stream is respectivelyWherein N_rAnd N_f Respectively RGB and light stream residue act segments.For the movement segment of each RGB, the maximum of itself and light stream segment is calculated IoU:

Final screening fusion results are for the movement segment of all light streams and with I (p_r,jThe RGB movement segment of) < 0.4 Combination.Fig. 2 and Fig. 3 are please referred to, the improvement of the Experimental comparison present invention and history experimental data are passed through.

It referring to fig. 2, is the present invention and test result of the history experimental data on THUMOS14 data set.It can be seen that Under all IoU, mAP measured by the present invention has been above history experimental data.

Referring to Fig. 3, for the present invention and test result of the history experimental data on ActivityNet data set.It can see To at all IoU and under average case, mAP measured by the present invention has been above history experimental data.

In conclusion the invention discloses a kind of Weakly supervised timing operating position fixing method based on continuity of movement, that is, exist The movement segment for belonging to the category in the given other situation of video class in positioning video, belongs to field of machine vision.Of the invention Main thought are as follows: video is divided into RGB frame and light stream and is handled respectively, for each movement mode, first against every on video A time point proposes the movement segment of the different hypothesis of length, is then made according to the continuity of movement of video and classification accuracy Segment is acted with convolutional neural networks recurrence.The different movement segments that mode obtain are acted for two, pass through characteristic Module is combined, and filters out final operating position fixing result.

The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, although referring to above-described embodiment pair The present invention is described in detail, those of ordinary skill in the art still can to a specific embodiment of the invention into Row modification perhaps equivalent replacement these without departing from any modification of spirit and scope of the invention or equivalent replacement, applying Within pending claims of the invention.

Claims

1. a kind of Weakly supervised timing operating position fixing method based on continuity of movement, which comprises the following steps:

Step 2, the RGB feature to step 1 acquisition and Optical-flow Feature carry out movement segment recurrence processing respectively, obtain RGB movement Segment and light stream act segment；The movement segment recurrence processing includes: each time point for video to be processed, is enumerated not With the imaginary movement segment of preset length, the movement segment of different length is returned using scheduled recurrent neural networks Return, recurrent neural networks are trained using continuity of movement loss function, and obtain movement segment；

Step 3, the RGB movement segment obtained by continuity of movement loss function evaluation procedure 2 and light stream act setting for segment Letter degree；Inhibit to filter out the movement segment that registration is more than threshold value using non-maximum value；

Step 4, after recurrent neural networks training；By the Fusion Module of a printenv, screening fusion RGB acts segment Segment is acted with light stream, obtains positioning result to the end.

2. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 1, feature exist In step 1 specifically includes: video to be processed being divided into and multiple is not overlapped segment；To each segment average sample, convolution mind is used The feature that sampling frame is extracted through network will extract the feature obtained as the expression of the segment；Wherein, respectively to RGB and light stream Extract feature.

3. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 1, feature exist In in step 2, movement segment recurrence processing specifically includes: the movement segment for being P for imaginary length is returned using following formula Return:

In formula, x_sFor the serial number of start boundary, x_eFor the serial number of end boundary,For start boundary position return as a result,For end boundary position return as a result, P be movement fragment length.

4. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 3, feature exist In in step 2 and step 3, continuity of movement loss function consists of two parts；

A part is used to characterize the cosine similarity of movement segment characterizations and its contextual feature；Assuming that the movement segment with thereon Feature hereafter is indicated with following symbol respectivelyWherein F (u) It at time point is u character representation, the calculation expression of this part for video are as follows:

Another part is characterized with classification confidence:

Wherein, S (k, u) indicates the classification confidence of segment u at classification k；For acting segment [x_s, x_e], it is extended to [X_s, X_e], whereinContextual information as the segment；

The expression formula of continuity of movement loss function are as follows:

L=α L_c+(1-α)(L_c-1)

Wherein, α is hyper parameter, and value is 0 < α < 1.

5. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 4, feature exist In in step 4, after Recurrent networks training, for the movement segment that RGB and light stream are respectively obtained, screening fusion steps tool Body includes:

The movement segment obtained by RGB and light stream is respectivelyWherein, N_rAnd N_fRespectively RGB and light stream Act segments；

Final screening fusion results act segment for all light streams and have I (p_{R, j}) it is less than the RGB action movie of preset threshold The combination of section.

6. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 5, feature exist In final screening fusion results act segment for all light streams and have I (p_{R, j}) < 0.4 RGB movement segment combination.

7. a kind of Weakly supervised timing operating position fixing side based on continuity of movement according to any one of claim 4 to 6 Method, which is characterized in that α is set as 0.6.

8. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 1, feature exist In, in step 2, RGB feature R that step 1 is obtained_sIt is input to multiple Recurrent networks；Each Recurrent networks are by 3 layers of 1D convolution mind It is constituted through network, and specifies a fragment length P；

It is 1 convolution kernel that the last layer of Recurrent networks, which has 2 sizes, and output returns start boundary and end boundary respectively.

9. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 8, feature exist In first two layers of Recurrent networks is made of empty convolutional network.

10. a kind of Weakly supervised timing operating position fixing method based on continuity of movement according to claim 9, feature exist In the cavity of first two layers of empty convolution of Recurrent networks is set as