WO2018086231A1

WO2018086231A1 - Method and system for video sequence alignment

Info

Publication number: WO2018086231A1
Application number: PCT/CN2016/113542
Authority: WO
Inventors: 雷延强
Original assignee: 广州视源电子科技股份有限公司
Priority date: 2016-11-09
Filing date: 2016-12-30
Publication date: 2018-05-17
Also published as: CN106612457B; CN106612457A

Abstract

The present invention relates to a method and system for video sequence alignment. The method comprises the following steps: extracting, from a video sequence to be aligned, a video clip without a scene change; dividing respective video frames in the video clip into a plurality of sub-blocks, and generating a video clip sequence according to the sub-blocks of the respective video frames; inputting the video clip sequence to a pre-trained scene category classifier to calculate probability values that the video clip sequence belongs to respective scene categories, and setting the scene category associated with a maximum probability value to be a first scene category to which the video clip belongs; and aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.

Description

Video sequence alignment method and system

Technical field

The present invention relates to the field of signal detection technologies, and in particular, to a video sequence alignment method and system.

Background technique

A display device is a device that can output images or touch information. In order to ensure the normal operation of the display device, it is usually necessary to detect some performance parameters of the display device. Taking the TV as an example, the sensitivity of the motherboard of the TV is an important performance parameter of the TV.

The existing scheme for detecting the sensitivity of the TV motherboard is: using the original video signal as a reference, aligning the video signal to be detected with the original video signal, and adjusting the signal strength of the aligned video signal to be output by the display device. The critical signal strength between the mosaic effect and the mosaic effect occurs, and the performance parameters of the display device are determined according to the signal strength.

However, this method requires more time for video signal alignment, resulting in lower signal processing efficiency.

Summary of the invention

Based on this, it is necessary to provide a video sequence alignment method and system for the problem of low signal processing efficiency.

A video sequence alignment method includes the following steps:

Grab a video clip without scene switching from the video sequence to be aligned;

Separating each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;

Inputting the video segment sequence into a pre-trained scene class classifier, respectively calculating probability values of the video segment sequence belonging to each scene category, and setting a scene category having the largest probability value to the first scene category to which the video segment belongs ;

Aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.

A video sequence alignment system comprising:

a video capture module, configured to capture a video clip without scene switching from a video sequence to be aligned;

a sequence generating module, configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;

a calculation module, configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;

And an aligning module, configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.

The video sequence alignment method and system, the video segment without scene switching is captured from the video sequence to be aligned, and each video frame in the video segment is divided into several sub-blocks, and a video is generated according to the sub-blocks of each video frame. a segment sequence, respectively calculating a probability value of the video segment sequence belonging to each scene category, setting a scene category having the largest probability value to a first scene category to which the video segment belongs, and the video segment and the pre-stored original video sequence The video segments belonging to the first scene category are aligned, and the video segments belonging to the first scene category in the original video sequence are found by first performing coarse alignment, and then the video sequence to be aligned and the video of the first scene category are The fine alignment of the segments can effectively reduce the time of video alignment and improve the efficiency of video alignment.

DRAWINGS

1 is a flow chart of a video sequence alignment method of an embodiment;

2 is a schematic diagram of classification of an original video sequence by scene according to an embodiment;

3 is a schematic structural diagram of a deep convolution network of an embodiment;

4 is a schematic structural diagram of a video sequence alignment system of an embodiment.

detailed description

The technical solution of the present invention will be described below with reference to the accompanying drawings.

As shown in FIG. 1, the present invention provides a video sequence alignment method, which may include the following steps:

S1, capturing a video clip without scene switching from the video sequence to be aligned;

Wherein, the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation. In general, the longer the length of the video sequence, the longer the alignment process takes. In order to meet the above constraints, a short video segment is generally captured (for example, 1 second in length) Video clip). By setting the time cost constraint, the real-time performance of the alignment result can be improved, the waiting time of the user can be shortened, and the user experience can be improved.

After capturing the video clip, it is necessary to judge the captured video clip, and if it does not meet the conditions, re-crawl. The basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching. The accumulated interframe error can be used as the criterion for evaluation. The accumulated interframe error is:

Where f(z _i ) is the feature of the ith video frame (eg, the color histogram of the sub-region), f(z _i-1 ) is the feature of the i-1th video frame, and ||·|| A distance metric function (eg, an L ₂ distance metric function), T is a preset distance threshold, and n is the total number of video segments in the video sequence to be aligned.

If the above conditions are not met, you will need to recapture the video clip. In general, video clips within 1 second can easily satisfy the above conditions, so the acquisition is not repeated too much. No scene switching means that the video content is basically the same, which is good for classification.

S2, respectively dividing each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of each video frame;

It is assumed that the video segment captured in step S1 is Z=[z ₀ , z ₁ ,...z _n ], where z _i (i=1, 2, . . . , n) is the i-th video frame. If each video frame includes K sub-blocks, in this step, a sequence of video segments may be generated as follows:

S3, the sequence of video segments is input to a pre-trained scene classifier, and the probability values of the video segment sequence belonging to each scene category are respectively calculated, and the scene category with the highest probability value is set as the first of the video segments. Scene category

Wherein, the probability value can be calculated according to the following formula:

In the formula,

Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y _j representing a video segment belonging to the j-th scene category in the original video sequence,

a sub-block in the sequence of video segments

a probability value belonging to the jth scene category, p(Y _j /Z) is a probability value of the video segment sequence belonging to the jth scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence, ∏ Multiplication operation.

The scene category classifier can be pre-trained prior to performing the alignment operation. The way to train the scene classifier can be packaged. Including the following steps:

Step 1: Obtain a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;

In the video sequence, if the scene is not switched, the similarity of adjacent images is extremely high. Therefore, the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:

Let the video sequence sample be Y=[y ₁ , y ₂ ,...y _m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2. In Figure 2, Y _l is the first video segment in the video sequence sample, and each video segment includes several video frames.

You can mark the scene boundary in advance, and divide the scene according to the label information (the original original video sequence is 20-30 minutes, the labeling amount is not large, and it is one-time work), and the scene classification can be automatically performed by using the typical inter-frame cumulative error. : The accumulated interframe error is:

Where f(y _i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and ||·|| is a distance metric function. If d(Y) is less than the set threshold, the current adjacent image is divided into the same category; the subsequent undivided sequence repeats the above division process.

Step 2: The video sequence samples of each scene category are respectively divided into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;

For each image sample in each scene category, non-overlapping sub-block partitioning (may also be overlapped, but should include special cases of non-overlapping partitioning) to construct a finer thumbnail (eg 256*256, if not When the overlap division cannot be divided by 256, the overlap partition can be used in the rightmost division to train the deep convolution network. In general learning strategies, the more samples you use, the better. Under non-overlapping, the samples used are the least, no more, and there is no overlap between them. And there are overlapping divisions that need to include non-overlapping exceptions, otherwise they lose their generality. The advantages of this are as follows: 1) The number of samples increases, which is conducive to deep convolutional network training; 2) The sample image size becomes smaller, which can effectively reduce the number of fully connected layers in the deep neural network, and the complexity is reduced. For example: each original sample image y _i , after sub-block division, can obtain K+1 sub-block images

Step 3: Train the deep convolution network according to the sample sub-block and its associated scene category to obtain a scene classifier.

Using the collected scene category sample map (ie, sub-block image) and its annotation (the scene category corresponding to the sub-block image), the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 . The deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)

S4. Align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.

Step S3 has roughly positioned which category Y _J = [y _u , y _u+1 , ... y _v ] the current video segment Z = [z ₀ , z ₁ , ... z _n ] belongs to. This step will precisely locate the location of the current video clip in it. In order to prevent the boundary problem, Y _J can be extended left and right to Y _J =[y _un , y _u-n+1 ,...y _v+n ], then the exact alignment is calculated as:

Where Y _J =[y _un , y _u-n+1 ,...y _v+n ];

Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z _i is the ith video frame in Z, Y _j = [y _u , y _u+1 , .y _v ] represents a video segment belonging to the jth scene category in the original video sequence, y _i is the i-th video frame in Y _j , y _ui (i=1) , 2, ..., n) is the video frame at the time i before y ₀ , y _v+i (i = 1, 2, ..., n) is the video frame at time i after y _n , n is a positive integer , q∈[un,v].

The above video sequence alignment method uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category. The video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.

As shown in FIG. 2, the present invention provides a video sequence alignment system, which may include:

The video capture module 10 is configured to capture a video clip without scene switching from the video sequence to be aligned;

Wherein, the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation. In general, the longer the length of the video sequence, the longer the alignment process takes. In order to satisfy the above constraints, a short video segment (for example, a video segment of 1 second in length) is generally captured. By setting the time cost constraint, the real-time performance of the alignment result can be improved, and the waiting time of the user can be shortened. Improve the user experience.

After capturing the video clip, it is necessary to judge the captured video clip, and if it does not meet the conditions, re-crawl. The basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching. A decision module can be set, and the accumulated interframe error is used as a criterion for judging, and the accumulated interframe error is:

Where f(z _i ) is the feature of the ith video frame (eg, the color histogram of the sub-region), f(z _i-1 ) is the feature of the i-1th video frame, and ||·|| A distance metric function (eg, an L ₂ distance metric function), T is a preset distance threshold, and n is the total number of video segments in the video sequence to be aligned. No scene switching means that the video content is basically the same, which is good for classification.

If the above conditions are not met, you will need to recapture the video clip. In general, video clips within 1 second can easily satisfy the above conditions, so the acquisition is not repeated too much.

The sequence generating module 20 is configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of the respective video frames;

Assume that the video clip captured in the video capture module 10 is Z=[z ₀ , z ₁ ,...z _n ], where z _i (i=1, 2, . . . , n) is the i th For each video frame, if K sub-blocks are included in each video frame, in the sequence generation module 20, a sequence of video segments may be generated as follows:

The calculating module 30 is configured to input the video segment sequence into a pre-trained scene class classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment. The first scene category to which it belongs;

In the formula,

a sub-block in the sequence of video segments

The scene category classifier can be pre-trained prior to performing the alignment operation. The video sequence alignment system can also include:

a classification module, configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;

a sub-block division module, configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;

a training module, configured to train the deep convolution network according to the sample sub-block and its associated scene category, Go to the scene category classifier.

The aligning module 40 is configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.

The calculation module 30 has roughly located which category Y _J = [y _u , y _u+1 , ... y _v ] the current video segment Z = [z ₀ , z ₁ , ... z _n ] belongs to. The alignment module 40 is about to accurately locate the location of the current video clip in it. In order to prevent the boundary problem, Y _J can be extended left and right to Y _J =[y _un , y _u-n+1 ,...y _v+n ], then the exact alignment is calculated as:

Where Y _J =[y _un , y _u-n+1 ,...y _v+n ];

The video sequence alignment system uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category. The video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.

The video sequence alignment system of the present invention has a one-to-one correspondence with the video sequence alignment method of the present invention, and the technical features and the beneficial effects described in the embodiments of the video sequence alignment method are applicable to the embodiment of the video sequence alignment system, and hereby declare .

The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as the combination of these technical features does not exist, the spear is not present. Shield should be considered as the scope of this manual.

The above-described embodiments are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A video sequence alignment method, comprising the steps of:

Grab a video clip without scene switching from the video sequence to be aligned;

Separating each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;

Inputting the video segment sequence into a pre-trained scene class classifier, respectively calculating probability values of the video segment sequence belonging to each scene category, and setting a scene category having the largest probability value to the first scene category to which the video segment belongs ;

Aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.
The video sequence alignment method according to claim 1, wherein before the video sequence is input to the pre-trained scene classifier, the method further comprises the following steps:

Obtaining a video sequence sample, and dividing the video sequence sample into multiple scene categories according to a scene;

Separating the video sequence samples of the respective scene categories into a plurality of sample sub-blocks, wherein the video sequence samples include non-overlapping sample sub-blocks;

The deep convolution network is trained according to the sample sub-block and its associated scene category to obtain a scene classifier.
The video sequence alignment method according to claim 1, further comprising the steps of:

If the video segment satisfies the following condition, it is determined that the video segment has no scene switching:

Where f(z i ) is the characteristic of the ith video frame, f(z i-1 ) is the feature of the i-1th video frame, ||·|| is the distance metric function, and T is the preset The distance threshold, n is the total number of video segments in the video sequence to be aligned.
The video sequence alignment method according to claim 1, wherein the step of respectively calculating the probability values of the video segment sequence belonging to each scene category comprises:

The probability values of the video segment sequence belonging to each scene category are calculated according to the following formula:

In the formula,
Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence,
a sub-block in the sequence of video segments
A probability value belonging to the j-th scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the j-th scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence.
The video sequence alignment method according to claim 1, wherein the step of aligning the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence comprises:

Aligning the video clip with a video clip belonging to the first scene category in the original video sequence according to the following formula:

Where Y J =[y un , y u-n+1 ,...y v+n ];

Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2,...,n) is the video frame at the time i before y 0 , y v+i (i=1,2,...,n) is the video frame at time i after y n , n is a positive integer, q∈[ Un, v].
A video sequence alignment system, comprising:

a video capture module, configured to capture a video clip without scene switching from a video sequence to be aligned;

a sequence generating module, configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;

a calculation module, configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;

And an aligning module, configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
The video sequence alignment system according to claim 6, further comprising:

a classification module, configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;

a sub-block division module, configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; The non-overlapping sample sub-blocks are included in the video sequence sample;

The training module is configured to train the deep convolution network according to the sample sub-block and the scene category to which it belongs, to obtain a scene category classifier.
The video sequence alignment system according to claim 6, further comprising:

a determining module, configured to determine that the video segment has no scene switching if the video segment meets the following conditions:

Where f(z i ) is the characteristic of the ith video frame, f(z i-1 ) is the feature of the i-1th video frame, ||·|| is the distance metric function, and T is the preset The distance threshold, n is the total number of video segments in the video sequence to be aligned.
The video sequence alignment system according to claim 6, wherein the calculation module further calculates a probability value of the video segment sequence belonging to each scene category according to the following formula:

In the formula,
Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence,
a sub-block in the sequence of video segments
A probability value belonging to the j-th scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the j-th scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence.
The video sequence alignment system according to claim 6, wherein the alignment module further aligns the video segment with a video segment belonging to the first scene category in an original video sequence according to the following formula:

Where Y J =[y un , y u-n+1 ,...y v+n ];

Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2,...,n) is the video frame at the time i before y 0 , y v+i (i=1,2,...,n) is the video frame at time i after y n , n is a positive integer, q∈[ Un, v].