WO2018086231A1 - Method and system for video sequence alignment - Google Patents
Method and system for video sequence alignment Download PDFInfo
- Publication number
- WO2018086231A1 WO2018086231A1 PCT/CN2016/113542 CN2016113542W WO2018086231A1 WO 2018086231 A1 WO2018086231 A1 WO 2018086231A1 CN 2016113542 W CN2016113542 W CN 2016113542W WO 2018086231 A1 WO2018086231 A1 WO 2018086231A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- sequence
- scene
- sub
- scene category
- Prior art date
Links
- 238000002864 sequence alignment Methods 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 12
- 238000011176 pooling Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000000638 solvent extraction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44012—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
Definitions
- the present invention relates to the field of signal detection technologies, and in particular, to a video sequence alignment method and system.
- a display device is a device that can output images or touch information. In order to ensure the normal operation of the display device, it is usually necessary to detect some performance parameters of the display device. Taking the TV as an example, the sensitivity of the motherboard of the TV is an important performance parameter of the TV.
- the existing scheme for detecting the sensitivity of the TV motherboard is: using the original video signal as a reference, aligning the video signal to be detected with the original video signal, and adjusting the signal strength of the aligned video signal to be output by the display device.
- the critical signal strength between the mosaic effect and the mosaic effect occurs, and the performance parameters of the display device are determined according to the signal strength.
- a video sequence alignment method includes the following steps:
- each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;
- a video sequence alignment system comprising:
- a video capture module configured to capture a video clip without scene switching from a video sequence to be aligned
- a sequence generating module configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;
- a calculation module configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;
- an aligning module configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
- the video sequence alignment method and system the video segment without scene switching is captured from the video sequence to be aligned, and each video frame in the video segment is divided into several sub-blocks, and a video is generated according to the sub-blocks of each video frame.
- a segment sequence respectively calculating a probability value of the video segment sequence belonging to each scene category, setting a scene category having the largest probability value to a first scene category to which the video segment belongs, and the video segment and the pre-stored original video sequence
- the video segments belonging to the first scene category are aligned, and the video segments belonging to the first scene category in the original video sequence are found by first performing coarse alignment, and then the video sequence to be aligned and the video of the first scene category are
- the fine alignment of the segments can effectively reduce the time of video alignment and improve the efficiency of video alignment.
- 1 is a flow chart of a video sequence alignment method of an embodiment
- FIG. 2 is a schematic diagram of classification of an original video sequence by scene according to an embodiment
- FIG. 3 is a schematic structural diagram of a deep convolution network of an embodiment
- FIG. 4 is a schematic structural diagram of a video sequence alignment system of an embodiment.
- the present invention provides a video sequence alignment method, which may include the following steps:
- the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation.
- the longer the length of the video sequence the longer the alignment process takes.
- a short video segment is generally captured (for example, 1 second in length) Video clip).
- the basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching.
- the accumulated interframe error can be used as the criterion for evaluation.
- the accumulated interframe error is:
- f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region)
- f(z i-1 ) is the feature of the i-1th video frame
- a distance metric function eg, an L 2 distance metric function
- T is a preset distance threshold
- n is the total number of video segments in the video sequence to be aligned.
- the sequence of video segments is input to a pre-trained scene classifier, and the probability values of the video segment sequence belonging to each scene category are respectively calculated, and the scene category with the highest probability value is set as the first of the video segments.
- Scene category
- the probability value can be calculated according to the following formula:
- the scene category classifier can be pre-trained prior to performing the alignment operation.
- the way to train the scene classifier can be packaged. Including the following steps:
- Step 1 Obtain a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
- the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
- Y [y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2.
- Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
- the accumulated interframe error is:
- f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and
- Step 2 The video sequence samples of each scene category are respectively divided into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
- non-overlapping sub-block partitioning may also be overlapped, but should include special cases of non-overlapping partitioning
- a finer thumbnail eg 256*256, if not
- each original sample image y i after sub-block division, can obtain K+1 sub-block images
- Step 3 Train the deep convolution network according to the sample sub-block and its associated scene category to obtain a scene classifier.
- the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 .
- the deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
- ReLU Rectified Linear Units
- Q represents the best alignment position of the video segment with the original video sequence
- d( ⁇ ) is the distance metric function
- Z is the video segment
- z i is the ith video frame in Z
- Y j [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence
- y i is the i-th video frame in Y j
- n is a positive integer
- the above video sequence alignment method uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category.
- the video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
- the present invention provides a video sequence alignment system, which may include:
- the video capture module 10 is configured to capture a video clip without scene switching from the video sequence to be aligned
- the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation.
- the longer the length of the video sequence the longer the alignment process takes.
- a short video segment for example, a video segment of 1 second in length
- the real-time performance of the alignment result can be improved, and the waiting time of the user can be shortened. Improve the user experience.
- the basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching.
- a decision module can be set, and the accumulated interframe error is used as a criterion for judging, and the accumulated interframe error is:
- f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region)
- f(z i-1 ) is the feature of the i-1th video frame
- a distance metric function eg, an L 2 distance metric function
- T is a preset distance threshold
- n is the total number of video segments in the video sequence to be aligned. No scene switching means that the video content is basically the same, which is good for classification.
- the sequence generating module 20 is configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of the respective video frames;
- a sequence of video segments may be generated as follows:
- the calculating module 30 is configured to input the video segment sequence into a pre-trained scene class classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment.
- the probability value can be calculated according to the following formula:
- the scene category classifier can be pre-trained prior to performing the alignment operation.
- the video sequence alignment system can also include:
- a classification module configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene
- the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
- Y [y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2.
- Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
- the accumulated interframe error is:
- f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and
- a sub-block division module configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
- non-overlapping sub-block partitioning may also be overlapped, but should include special cases of non-overlapping partitioning
- a finer thumbnail eg 256*256, if not
- each original sample image y i after sub-block division, can obtain K+1 sub-block images
- a training module configured to train the deep convolution network according to the sample sub-block and its associated scene category, Go to the scene category classifier.
- the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 .
- the deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
- ReLU Rectified Linear Units
- the aligning module 40 is configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
- the alignment module 40 is about to accurately locate the location of the current video clip in it.
- Q represents the best alignment position of the video segment with the original video sequence
- d( ⁇ ) is the distance metric function
- Z is the video segment
- z i is the ith video frame in Z
- Y j [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence
- y i is the i-th video frame in Y j
- n is a positive integer
- the video sequence alignment system uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category.
- the video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
- the video sequence alignment system of the present invention has a one-to-one correspondence with the video sequence alignment method of the present invention, and the technical features and the beneficial effects described in the embodiments of the video sequence alignment method are applicable to the embodiment of the video sequence alignment system, and hereby declare .
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The present invention relates to a method and system for video sequence alignment. The method comprises the following steps: extracting, from a video sequence to be aligned, a video clip without a scene change; dividing respective video frames in the video clip into a plurality of sub-blocks, and generating a video clip sequence according to the sub-blocks of the respective video frames; inputting the video clip sequence to a pre-trained scene category classifier to calculate probability values that the video clip sequence belongs to respective scene categories, and setting the scene category associated with a maximum probability value to be a first scene category to which the video clip belongs; and aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.
Description
本发明涉及信号检测技术领域,特别是涉及一种视频序列对齐方法和系统。The present invention relates to the field of signal detection technologies, and in particular, to a video sequence alignment method and system.
显示设备是一种可输出图像或感触信息的设备。为了保证显示设备正常工作,通常需要对显示设备的一些性能参数进行检测。以电视机为例,电视机的主板灵敏度是电视机的一个重要性能性能参数。A display device is a device that can output images or touch information. In order to ensure the normal operation of the display device, it is usually necessary to detect some performance parameters of the display device. Taking the TV as an example, the sensitivity of the motherboard of the TV is an important performance parameter of the TV.
现有的检测电视机主板灵敏度的方案是:利用原始的视频信号作为参考,将待检测的视频信号与原始视频信号进行对齐,将对齐后的视频信号的信号强度调整为经所述显示设备输出后无马赛克效应与出现马赛克效应之间的临界信号强度,并根据该信号强度确定所述显示设备的性能参数。The existing scheme for detecting the sensitivity of the TV motherboard is: using the original video signal as a reference, aligning the video signal to be detected with the original video signal, and adjusting the signal strength of the aligned video signal to be output by the display device. The critical signal strength between the mosaic effect and the mosaic effect occurs, and the performance parameters of the display device are determined according to the signal strength.
然而,这种方式需要花费较多的时间进行视频信号对齐,导致信号处理效率较低。However, this method requires more time for video signal alignment, resulting in lower signal processing efficiency.
发明内容Summary of the invention
基于此,有必要针对信号处理效率较低的问题,提供一种视频序列对齐方法和系统。Based on this, it is necessary to provide a video sequence alignment method and system for the problem of low signal processing efficiency.
一种视频序列对齐方法,包括以下步骤:A video sequence alignment method includes the following steps:
从待对齐的视频序列中抓取无场景切换的视频片段;Grab a video clip without scene switching from the video sequence to be aligned;
分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;Separating each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;
将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;Inputting the video segment sequence into a pre-trained scene class classifier, respectively calculating probability values of the video segment sequence belonging to each scene category, and setting a scene category having the largest probability value to the first scene category to which the video segment belongs ;
将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。Aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.
一种视频序列对齐系统,包括:A video sequence alignment system comprising:
视频抓取模块,用于从待对齐的视频序列中抓取无场景切换的视频片段;
a video capture module, configured to capture a video clip without scene switching from a video sequence to be aligned;
序列生成模块,用于分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;a sequence generating module, configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;
计算模块,用于将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;a calculation module, configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;
对齐模块,用于将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。And an aligning module, configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
上述视频序列对齐方法和系统,从待对齐的视频序列中抓取无场景切换的视频片段,分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别,将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐,通过先进行粗对齐找到原始视频序列中属于所述第一场景类别的视频片段,再将待对齐的视频序列与所述第一场景类别的视频片段进行精对齐,能够有效减少视频对齐的时间,提高视频对齐的效率。The video sequence alignment method and system, the video segment without scene switching is captured from the video sequence to be aligned, and each video frame in the video segment is divided into several sub-blocks, and a video is generated according to the sub-blocks of each video frame. a segment sequence, respectively calculating a probability value of the video segment sequence belonging to each scene category, setting a scene category having the largest probability value to a first scene category to which the video segment belongs, and the video segment and the pre-stored original video sequence The video segments belonging to the first scene category are aligned, and the video segments belonging to the first scene category in the original video sequence are found by first performing coarse alignment, and then the video sequence to be aligned and the video of the first scene category are The fine alignment of the segments can effectively reduce the time of video alignment and improve the efficiency of video alignment.
图1为一个实施例的视频序列对齐方法流程图;1 is a flow chart of a video sequence alignment method of an embodiment;
图2为一个实施例的原始视频序列按场景分类示意图;2 is a schematic diagram of classification of an original video sequence by scene according to an embodiment;
图3为一个实施例的深度卷积网络结构示意图;3 is a schematic structural diagram of a deep convolution network of an embodiment;
图4为一个实施例的视频序列对齐系统的结构示意图。4 is a schematic structural diagram of a video sequence alignment system of an embodiment.
下面结合附图对本发明的技术方案进行说明。The technical solution of the present invention will be described below with reference to the accompanying drawings.
如图1所示,本发明提供一种视频序列对齐方法,可包括以下步骤:As shown in FIG. 1, the present invention provides a video sequence alignment method, which may include the following steps:
S1,从待对齐的视频序列中抓取无场景切换的视频片段;S1, capturing a video clip without scene switching from the video sequence to be aligned;
其中,所述视频序列的长度应满足一定的时间代价约束条件,所述时间代价约束条件用于表征视频序列对齐操作花费的时间。一般来说,视频序列的长度越长,对齐过程花费的时间越长。为了满足上述约束条件,一般抓取一段较短的视频片段(例如长度为1秒的
视频片段)。通过设置时间代价约束条件,能够提高对齐结果的实时性,缩短用户等待时间,提高用户体验。Wherein, the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation. In general, the longer the length of the video sequence, the longer the alignment process takes. In order to meet the above constraints, a short video segment is generally captured (for example, 1 second in length)
Video clip). By setting the time cost constraint, the real-time performance of the alignment result can be improved, the waiting time of the user can be shortened, and the user experience can be improved.
抓取视频片段后,需要对抓取的视频片段进行判断,若不符合条件,则重新抓取。判断的基本原理是:尽量保持获取的视频片段前后变化小,无场景切换等。可采用累积的帧间误差作为评判标准,累积的帧间误差为:After capturing the video clip, it is necessary to judge the captured video clip, and if it does not meet the conditions, re-crawl. The basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching. The accumulated interframe error can be used as the criterion for evaluation. The accumulated interframe error is:
式中,f(zi)为第i个视频帧的特征(例如分区域的颜色直方图),f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数(例如,L2距离度量函数),T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。Where f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region), f(z i-1 ) is the feature of the i-1th video frame, and ||·|| A distance metric function (eg, an L 2 distance metric function), T is a preset distance threshold, and n is the total number of video segments in the video sequence to be aligned.
若不满足上述条件,则需要重新抓取视频片段。一般来说,1秒内的视频片段很容易满足上述条件,因此不会过多地重复采集。无场景切换表示视频内容基本一致,有利于分类。If the above conditions are not met, you will need to recapture the video clip. In general, video clips within 1 second can easily satisfy the above conditions, so the acquisition is not repeated too much. No scene switching means that the video content is basically the same, which is good for classification.
S2,分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;S2, respectively dividing each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of each video frame;
假设步骤S1中抓取到的视频片段为Z=[z0,z1,...zn],其中,zi(i=1,2,...,n)为第i个视频帧,若每个视频帧中包括K个子块,在本步骤中,可以生成如下视频片段序列:It is assumed that the video segment captured in step S1 is Z=[z 0 , z 1 ,...z n ], where z i (i=1, 2, . . . , n) is the i-th video frame. If each video frame includes K sub-blocks, in this step, a sequence of video segments may be generated as follows:
S3,将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;S3, the sequence of video segments is input to a pre-trained scene classifier, and the probability values of the video segment sequence belonging to each scene category are respectively calculated, and the scene category with the highest probability value is set as the first of the video segments. Scene category
其中,所述概率值可以根据如下公式计算:Wherein, the probability value can be calculated according to the following formula:
式中,表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,为所述视频片段序列中的子块属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数,∏表示乘法操作。In the formula, Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence, a sub-block in the sequence of video segments a probability value belonging to the jth scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the jth scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence, ∏ Multiplication operation.
所述场景类别分类器可以在执行对齐操作之前预先训练。训练场景分类器的方式可包
括以下步骤:The scene category classifier can be pre-trained prior to performing the alignment operation. The way to train the scene classifier can be packaged.
Including the following steps:
步骤1,获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;Step 1: Obtain a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
视频序列中,若场景不进行切换,则相邻图像相似度极高。因此可以将视频序列样本按场景划分为较粗的类别,并保持时间先后关系。在粗定位时,只需确定当前视频片段与哪一个类别最为相似即可。具体分类描述如下:In the video sequence, if the scene is not switched, the similarity of adjacent images is extremely high. Therefore, the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
设视频序列样本为Y=[y1,y2,...ym],m为视频序列样本中的视频帧的总数。按场景划分为多个类别,如图2所示。图2中,Yl为视频序列样本中的第l个视频片段,每个视频片段包括若干个视频帧。Let the video sequence sample be Y=[y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2. In Figure 2, Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
可以预先在场景边界做标注,根据该标注信息进行场景划分(一般原始视频序列20-30分钟,标注量不大,并且是一次性工作),也可以采用典型的帧间累积误差自动进行场景分类:累积的帧间误差为:You can mark the scene boundary in advance, and divide the scene according to the label information (the original original video sequence is 20-30 minutes, the labeling amount is not large, and it is one-time work), and the scene classification can be automatically performed by using the typical inter-frame cumulative error. : The accumulated interframe error is:
其中,f(yi)表示第i个视频帧的特征表示(例如分区域的颜色直方图),||·||是距离度量函数。若d(Y)小于设置的阈值,则将当前相邻图像划分为同一个类别;后续未划分的序列则重复上述划分过程即可。Where f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and ||·|| is a distance metric function. If d(Y) is less than the set threshold, the current adjacent image is divided into the same category; the subsequent undivided sequence repeats the above division process.
步骤2,分别将各个场景类别的视频序列样本划分为若干个样本子块;其中,所述视频序列样本中包括非重叠的样本子块;Step 2: The video sequence samples of each scene category are respectively divided into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
针对每个场景类别中每个图像样本,进行非重叠的子块划分(也可以带重叠,但应包括非重叠划分的特例),构建立出更为精细的小图(例如256*256,若非重叠划分不能整除256时,可以在最右侧划分时采用重叠划分),进行训练深度卷积网络。一般学习策略,用的样本越多越好,非重叠下,采用的样本是最少的,不能再少了,并且它们之间没有重叠。而有重叠的划分,需要包括非重叠的特例,否则失去了一般性。这样做的好处有:1)样本数量增多,有利于深度卷积网络训练;2)样本图像尺寸变小,可以有效地减少深度神经网络中全连接层的数量,复杂度降低。例如:每个原始样本图像yi,经过子块划分后,可以得到K+1个子块图像
For each image sample in each scene category, non-overlapping sub-block partitioning (may also be overlapped, but should include special cases of non-overlapping partitioning) to construct a finer thumbnail (eg 256*256, if not When the overlap division cannot be divided by 256, the overlap partition can be used in the rightmost division to train the deep convolution network. In general learning strategies, the more samples you use, the better. Under non-overlapping, the samples used are the least, no more, and there is no overlap between them. And there are overlapping divisions that need to include non-overlapping exceptions, otherwise they lose their generality. The advantages of this are as follows: 1) The number of samples increases, which is conducive to deep convolutional network training; 2) The sample image size becomes smaller, which can effectively reduce the number of fully connected layers in the deep neural network, and the complexity is reduced. For example: each original sample image y i , after sub-block division, can obtain K+1 sub-block images
步骤3,根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得到场景类别分类器。
Step 3: Train the deep convolution network according to the sample sub-block and its associated scene category to obtain a scene classifier.
利用收集的场景类别样本图(即子块图像)及其标注(子块图像对应的场景类别),对深度卷积网络进行训练,得到分类器,如图3所示。本发明采用的深度卷积网络包括五个卷积层(Convolutional Layer),每个卷积层的输出都经过ReLU(Rectified Linear Units)激活函数进行非线性变换,再经过池化层(Pooling Layer)进行池化,再接两个全连接层(Fully-Connected Layer),最后通过Softmax函数输出分类概率(输入子块图像属于某个场景类别的概率
Using the collected scene category sample map (ie, sub-block image) and its annotation (the scene category corresponding to the sub-block image), the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 . The deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
S4,将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。S4. Align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
步骤S3已经粗略定位当前视频片段Z=[z0,z1,...zn]属于哪个类别YJ=[yu,yu+1,...yv]。本步骤即将在中精确地定位当前视频片段所属位置。为了防止边界问题,可以将YJ进行左右扩展为YJ=[yu-n,yu-n+1,...yv+n],则精确对齐的计算方式为:Step S3 has roughly positioned which category Y J = [y u , y u+1 , ... y v ] the current video segment Z = [z 0 , z 1 , ... z n ] belongs to. This step will precisely locate the location of the current video clip in it. In order to prevent the boundary problem, Y J can be extended left and right to Y J =[y un , y u-n+1 ,...y v+n ], then the exact alignment is calculated as:
其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];
式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,...,n)为y0前i时刻的视频帧,yv+i(i=1,2,...,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2, ..., n) is the video frame at the time i before y 0 , y v+i (i = 1, 2, ..., n) is the video frame at time i after y n , n is a positive integer , q∈[un,v].
上述视频序列对齐方法,采用由粗到精的搜索策略,通过先进行粗对齐找到原始视频序列中属于所述第一场景类别的视频片段,再将待对齐的视频序列与所述第一场景类别的视频片段进行精对齐,有效减少了视频对齐的时间,提高了视频对齐的效率。The above video sequence alignment method uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category. The video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
如图2所示,本发明提供一种视频序列对齐系统,可包括:As shown in FIG. 2, the present invention provides a video sequence alignment system, which may include:
视频抓取模块10,用于从待对齐的视频序列中抓取无场景切换的视频片段;The video capture module 10 is configured to capture a video clip without scene switching from the video sequence to be aligned;
其中,所述视频序列的长度应满足一定的时间代价约束条件,所述时间代价约束条件用于表征视频序列对齐操作花费的时间。一般来说,视频序列的长度越长,对齐过程花费的时间越长。为了满足上述约束条件,一般抓取一段较短的视频片段(例如长度为1秒的视频片段)。通过设置时间代价约束条件,能够提高对齐结果的实时性,缩短用户等待时间,
提高用户体验。Wherein, the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation. In general, the longer the length of the video sequence, the longer the alignment process takes. In order to satisfy the above constraints, a short video segment (for example, a video segment of 1 second in length) is generally captured. By setting the time cost constraint, the real-time performance of the alignment result can be improved, and the waiting time of the user can be shortened.
Improve the user experience.
抓取视频片段后,需要对抓取的视频片段进行判断,若不符合条件,则重新抓取。判断的基本原理是:尽量保持获取的视频片段前后变化小,无场景切换等。可设置一判定模块,采用累积的帧间误差作为评判标准进行判断,累积的帧间误差为:After capturing the video clip, it is necessary to judge the captured video clip, and if it does not meet the conditions, re-crawl. The basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching. A decision module can be set, and the accumulated interframe error is used as a criterion for judging, and the accumulated interframe error is:
式中,f(zi)为第i个视频帧的特征(例如分区域的颜色直方图),f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数(例如,L2距离度量函数),T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。无场景切换表示视频内容基本一致,有利于分类。Where f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region), f(z i-1 ) is the feature of the i-1th video frame, and ||·|| A distance metric function (eg, an L 2 distance metric function), T is a preset distance threshold, and n is the total number of video segments in the video sequence to be aligned. No scene switching means that the video content is basically the same, which is good for classification.
若不满足上述条件,则需要重新抓取视频片段。一般来说,1秒内的视频片段很容易满足上述条件,因此不会过多地重复采集。If the above conditions are not met, you will need to recapture the video clip. In general, video clips within 1 second can easily satisfy the above conditions, so the acquisition is not repeated too much.
序列生成模块20,用于分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;The sequence generating module 20 is configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of the respective video frames;
假设视频抓取模块10中抓取到的视频片段为Z=[z0,z1,...zn],其中,zi(i=1,2,...,n)为第i个视频帧,若每个视频帧中包括K个子块,在序列生成模块20中,可以生成如下视频片段序列:Assume that the video clip captured in the video capture module 10 is Z=[z 0 , z 1 ,...z n ], where z i (i=1, 2, . . . , n) is the i th For each video frame, if K sub-blocks are included in each video frame, in the sequence generation module 20, a sequence of video segments may be generated as follows:
计算模块30,用于将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;The calculating module 30 is configured to input the video segment sequence into a pre-trained scene class classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment. The first scene category to which it belongs;
其中,所述概率值可以根据如下公式计算:Wherein, the probability value can be calculated according to the following formula:
式中,表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,为所述视频片段序列中的子块属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数,∏表示乘法操作。
In the formula, Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence, a sub-block in the sequence of video segments a probability value belonging to the jth scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the jth scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence, ∏ Multiplication operation.
所述场景类别分类器可以在执行对齐操作之前预先训练。所述视频序列对齐系统还可包括:The scene category classifier can be pre-trained prior to performing the alignment operation. The video sequence alignment system can also include:
分类模块,用于获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;a classification module, configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
视频序列中,若场景不进行切换,则相邻图像相似度极高。因此可以将视频序列样本按场景划分为较粗的类别,并保持时间先后关系。在粗定位时,只需确定当前视频片段与哪一个类别最为相似即可。具体分类描述如下:In the video sequence, if the scene is not switched, the similarity of adjacent images is extremely high. Therefore, the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
设视频序列样本为Y=[y1,y2,...ym],m为视频序列样本中的视频帧的总数。按场景划分为多个类别,如图2所示。图2中,Yl为视频序列样本中的第l个视频片段,每个视频片段包括若干个视频帧。Let the video sequence sample be Y=[y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2. In Figure 2, Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
可以预先在场景边界做标注,根据该标注信息进行场景划分(一般原始视频序列20-30分钟,标注量不大,并且是一次性工作),也可以采用典型的帧间累积误差自动进行场景分类:累积的帧间误差为:You can mark the scene boundary in advance, and divide the scene according to the label information (the original original video sequence is 20-30 minutes, the labeling amount is not large, and it is one-time work), and the scene classification can be automatically performed by using the typical inter-frame cumulative error. : The accumulated interframe error is:
其中,f(yi)表示第i个视频帧的特征表示(例如分区域的颜色直方图),||·||是距离度量函数。若d(Y)小于设置的阈值,则将当前相邻图像划分为同一个类别;后续未划分的序列则重复上述划分过程即可。Where f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and ||·|| is a distance metric function. If d(Y) is less than the set threshold, the current adjacent image is divided into the same category; the subsequent undivided sequence repeats the above division process.
子块划分模块,用于分别将各个场景类别的视频序列样本划分为若干个样本子块;其中,所述视频序列样本中包括非重叠的样本子块;a sub-block division module, configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
针对每个场景类别中每个图像样本,进行非重叠的子块划分(也可以带重叠,但应包括非重叠划分的特例),构建立出更为精细的小图(例如256*256,若非重叠划分不能整除256时,可以在最右侧划分时采用重叠划分),进行训练深度卷积网络。一般学习策略,用的样本越多越好,非重叠下,采用的样本是最少的,不能再少了,并且它们之间没有重叠。而有重叠的划分,需要包括非重叠的特例,否则失去了一般性。这样做的好处有:1)样本数量增多,有利于深度卷积网络训练;2)样本图像尺寸变小,可以有效地减少深度神经网络中全连接层的数量,复杂度降低。例如:每个原始样本图像yi,经过子块划分后,可以得到K+1个子块图像
For each image sample in each scene category, non-overlapping sub-block partitioning (may also be overlapped, but should include special cases of non-overlapping partitioning) to construct a finer thumbnail (eg 256*256, if not When the overlap division cannot be divided by 256, the overlap partition can be used in the rightmost division to train the deep convolution network. In general learning strategies, the more samples you use, the better. Under non-overlapping, the samples used are the least, no more, and there is no overlap between them. And there are overlapping divisions that need to include non-overlapping exceptions, otherwise they lose their generality. The advantages of this are as follows: 1) The number of samples increases, which is conducive to deep convolutional network training; 2) The sample image size becomes smaller, which can effectively reduce the number of fully connected layers in the deep neural network, and the complexity is reduced. For example: each original sample image y i , after sub-block division, can obtain K+1 sub-block images
训练模块,用于根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得
到场景类别分类器。a training module, configured to train the deep convolution network according to the sample sub-block and its associated scene category,
Go to the scene category classifier.
利用收集的场景类别样本图(即子块图像)及其标注(子块图像对应的场景类别),对深度卷积网络进行训练,得到分类器,如图3所示。本发明采用的深度卷积网络包括五个卷积层(Convolutional Layer),每个卷积层的输出都经过ReLU(Rectified Linear Units)激活函数进行非线性变换,再经过池化层(Pooling Layer)进行池化,再接两个全连接层(Fully-Connected Layer),最后通过Softmax函数输出分类概率(输入子块图像属于某个场景类别的概率
Using the collected scene category sample map (ie, sub-block image) and its annotation (the scene category corresponding to the sub-block image), the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 . The deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
对齐模块40,用于将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。The aligning module 40 is configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
计算模块30已经粗略定位当前视频片段Z=[z0,z1,...zn]属于哪个类别YJ=[yu,yu+1,...yv]。对齐模块40即将在中精确地定位当前视频片段所属位置。为了防止边界问题,可以将YJ进行左右扩展为YJ=[yu-n,yu-n+1,...yv+n],则精确对齐的计算方式为:The calculation module 30 has roughly located which category Y J = [y u , y u+1 , ... y v ] the current video segment Z = [z 0 , z 1 , ... z n ] belongs to. The alignment module 40 is about to accurately locate the location of the current video clip in it. In order to prevent the boundary problem, Y J can be extended left and right to Y J =[y un , y u-n+1 ,...y v+n ], then the exact alignment is calculated as:
其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];
式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,...,n)为y0前i时刻的视频帧,yv+i(i=1,2,...,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2, ..., n) is the video frame at the time i before y 0 , y v+i (i = 1, 2, ..., n) is the video frame at time i after y n , n is a positive integer , q∈[un,v].
上述视频序列对齐系统,采用由粗到精的搜索策略,通过先进行粗对齐找到原始视频序列中属于所述第一场景类别的视频片段,再将待对齐的视频序列与所述第一场景类别的视频片段进行精对齐,有效减少了视频对齐的时间,提高了视频对齐的效率。The video sequence alignment system uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category. The video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
本发明的视频序列对齐系统与本发明的视频序列对齐方法一一对应,在上述视频序列对齐方法的实施例阐述的技术特征及其有益效果均适用于视频序列对齐系统的实施例中,特此声明。The video sequence alignment system of the present invention has a one-to-one correspondence with the video sequence alignment method of the present invention, and the technical features and the beneficial effects described in the embodiments of the video sequence alignment method are applicable to the embodiment of the video sequence alignment system, and hereby declare .
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛
盾,都应当认为是本说明书记载的范围。The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as the combination of these technical features does not exist, the spear is not present.
Shield should be considered as the scope of this manual.
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。
The above-described embodiments are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.
Claims (10)
- 一种视频序列对齐方法,其特征在于,包括以下步骤:A video sequence alignment method, comprising the steps of:从待对齐的视频序列中抓取无场景切换的视频片段;Grab a video clip without scene switching from the video sequence to be aligned;分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;Separating each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;Inputting the video segment sequence into a pre-trained scene class classifier, respectively calculating probability values of the video segment sequence belonging to each scene category, and setting a scene category having the largest probability value to the first scene category to which the video segment belongs ;将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。Aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.
- 根据权利要求1所述的视频序列对齐方法,其特征在于,在将所述视频序列输入至预先训练的场景类别分类器之前,还包括以下步骤:The video sequence alignment method according to claim 1, wherein before the video sequence is input to the pre-trained scene classifier, the method further comprises the following steps:获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;Obtaining a video sequence sample, and dividing the video sequence sample into multiple scene categories according to a scene;分别将各个场景类别的视频序列样本划分为若干个样本子块;其中,所述视频序列样本中包括非重叠的样本子块;Separating the video sequence samples of the respective scene categories into a plurality of sample sub-blocks, wherein the video sequence samples include non-overlapping sample sub-blocks;根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得到场景类别分类器。The deep convolution network is trained according to the sample sub-block and its associated scene category to obtain a scene classifier.
- 根据权利要求1所述的视频序列对齐方法,其特征在于,还包括以下步骤:The video sequence alignment method according to claim 1, further comprising the steps of:若所述视频片段满足如下条件,判定所述视频片段无场景切换:If the video segment satisfies the following condition, it is determined that the video segment has no scene switching:式中,f(zi)为第i个视频帧的特征,f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数,T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。Where f(z i ) is the characteristic of the ith video frame, f(z i-1 ) is the feature of the i-1th video frame, ||·|| is the distance metric function, and T is the preset The distance threshold, n is the total number of video segments in the video sequence to be aligned.
- 根据权利要求1所述的视频序列对齐方法,其特征在于,分别计算所述视频片段序列属于各个场景类别的概率值的步骤包括:The video sequence alignment method according to claim 1, wherein the step of respectively calculating the probability values of the video segment sequence belonging to each scene category comprises:根据如下公式计算所述视频片段序列属于各个场景类别的概率值:The probability values of the video segment sequence belonging to each scene category are calculated according to the following formula:式中,表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,为所述视频片段序列中的子块属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数。In the formula, Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence, a sub-block in the sequence of video segments A probability value belonging to the j-th scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the j-th scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence.
- 根据权利要求1所述的视频序列对齐方法,其特征在于,将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐的步骤包括:The video sequence alignment method according to claim 1, wherein the step of aligning the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence comprises:根据如下公式将所述视频片段与原始视频序列中属于所述第一场景类别的视频片段进行对齐:Aligning the video clip with a video clip belonging to the first scene category in the original video sequence according to the following formula:其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,…,n)为y0前i时刻的视频帧,yv+i(i=1,2,…,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2,...,n) is the video frame at the time i before y 0 , y v+i (i=1,2,...,n) is the video frame at time i after y n , n is a positive integer, q∈[ Un, v].
- 一种视频序列对齐系统,其特征在于,包括:A video sequence alignment system, comprising:视频抓取模块,用于从待对齐的视频序列中抓取无场景切换的视频片段;a video capture module, configured to capture a video clip without scene switching from a video sequence to be aligned;序列生成模块,用于分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;a sequence generating module, configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;计算模块,用于将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;a calculation module, configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;对齐模块,用于将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。And an aligning module, configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
- 根据权利要求6所述的视频序列对齐系统,其特征在于,还包括:The video sequence alignment system according to claim 6, further comprising:分类模块,用于获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;a classification module, configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;子块划分模块,用于分别将各个场景类别的视频序列样本划分为若干个样本子块;其 中,所述视频序列样本中包括非重叠的样本子块;a sub-block division module, configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; The non-overlapping sample sub-blocks are included in the video sequence sample;训练模块,用于根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得到场景类别分类器。The training module is configured to train the deep convolution network according to the sample sub-block and the scene category to which it belongs, to obtain a scene category classifier.
- 根据权利要求6所述的视频序列对齐系统,其特征在于,还包括:The video sequence alignment system according to claim 6, further comprising:判定模块,用于若所述视频片段满足如下条件,判定所述视频片段无场景切换:a determining module, configured to determine that the video segment has no scene switching if the video segment meets the following conditions:式中,f(zi)为第i个视频帧的特征,f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数,T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。Where f(z i ) is the characteristic of the ith video frame, f(z i-1 ) is the feature of the i-1th video frame, ||·|| is the distance metric function, and T is the preset The distance threshold, n is the total number of video segments in the video sequence to be aligned.
- 根据权利要求6所述的视频序列对齐系统,其特征在于,所述计算模块进一步根据如下公式计算所述视频片段序列属于各个场景类别的概率值:The video sequence alignment system according to claim 6, wherein the calculation module further calculates a probability value of the video segment sequence belonging to each scene category according to the following formula:式中,表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,为所述视频片段序列中的子块属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数。In the formula, Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence, a sub-block in the sequence of video segments A probability value belonging to the j-th scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the j-th scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence.
- 根据权利要求6所述的视频序列对齐系统,其特征在于,所述对齐模块进一步根据如下公式将所述视频片段与原始视频序列中属于所述第一场景类别的视频片段进行对齐:The video sequence alignment system according to claim 6, wherein the alignment module further aligns the video segment with a video segment belonging to the first scene category in an original video sequence according to the following formula:其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,…,n)为y0前i时刻的视频帧,yv+i(i=1,2,…,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。 Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2,...,n) is the video frame at the time i before y 0 , y v+i (i=1,2,...,n) is the video frame at time i after y n , n is a positive integer, q∈[ Un, v].
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610986953.9 | 2016-11-09 | ||
CN201610986953.9A CN106612457B (en) | 2016-11-09 | 2016-11-09 | Video sequence alignment schemes and system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018086231A1 true WO2018086231A1 (en) | 2018-05-17 |
Family
ID=58614979
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/113542 WO2018086231A1 (en) | 2016-11-09 | 2016-12-30 | Method and system for video sequence alignment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106612457B (en) |
WO (1) | WO2018086231A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107194419A (en) * | 2017-05-10 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Video classification methods and device, computer equipment and computer-readable recording medium |
CN108537134B (en) * | 2018-03-16 | 2020-06-30 | 北京交通大学 | Video semantic scene segmentation and labeling method |
CN108682436B (en) * | 2018-05-11 | 2020-06-23 | 北京海天瑞声科技股份有限公司 | Voice alignment method and device |
CN110147700B (en) * | 2018-05-18 | 2023-06-27 | 腾讯科技(深圳)有限公司 | Video classification method, device, storage medium and equipment |
CN111723617B (en) * | 2019-03-20 | 2023-10-27 | 顺丰科技有限公司 | Method, device, equipment and storage medium for identifying actions |
CN110347875B (en) * | 2019-07-08 | 2022-04-15 | 北京字节跳动网络技术有限公司 | Video scene classification method and device, mobile terminal and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080310734A1 (en) * | 2007-06-18 | 2008-12-18 | The Regents Of The University Of California | High speed video action recognition and localization |
CN101692269A (en) * | 2009-10-16 | 2010-04-07 | 北京中星微电子有限公司 | Method and device for processing video programs |
CN105184271A (en) * | 2015-09-18 | 2015-12-23 | 苏州派瑞雷尔智能科技有限公司 | Automatic vehicle detection method based on deep learning |
CN105704485A (en) * | 2016-02-02 | 2016-06-22 | 广州视源电子科技股份有限公司 | Display device performance parameter detection method and system |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680342B2 (en) * | 2004-08-16 | 2010-03-16 | Fotonation Vision Limited | Indoor/outdoor classification in digital images |
JP2008234623A (en) * | 2007-02-19 | 2008-10-02 | Seiko Epson Corp | Category classification apparatus and method, and program |
CN101814147B (en) * | 2010-04-12 | 2012-04-25 | 中国科学院自动化研究所 | Method for realizing classification of scene images |
CN103366181A (en) * | 2013-06-28 | 2013-10-23 | 安科智慧城市技术(中国)有限公司 | Method and device for identifying scene integrated by multi-feature vision codebook |
CN104881675A (en) * | 2015-05-04 | 2015-09-02 | 北京奇艺世纪科技有限公司 | Video scene identification method and apparatus |
CN105227907B (en) * | 2015-08-31 | 2018-07-27 | 电子科技大学 | Unsupervised anomalous event real-time detection method based on video |
CN105550699B (en) * | 2015-12-08 | 2019-02-12 | 北京工业大学 | A kind of video identification classification method based on CNN fusion space-time remarkable information |
CN105847964A (en) * | 2016-03-28 | 2016-08-10 | 乐视控股(北京)有限公司 | Movie and television program processing method and movie and television program processing system |
-
2016
- 2016-11-09 CN CN201610986953.9A patent/CN106612457B/en active Active
- 2016-12-30 WO PCT/CN2016/113542 patent/WO2018086231A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080310734A1 (en) * | 2007-06-18 | 2008-12-18 | The Regents Of The University Of California | High speed video action recognition and localization |
CN101692269A (en) * | 2009-10-16 | 2010-04-07 | 北京中星微电子有限公司 | Method and device for processing video programs |
CN105184271A (en) * | 2015-09-18 | 2015-12-23 | 苏州派瑞雷尔智能科技有限公司 | Automatic vehicle detection method based on deep learning |
CN105704485A (en) * | 2016-02-02 | 2016-06-22 | 广州视源电子科技股份有限公司 | Display device performance parameter detection method and system |
Also Published As
Publication number | Publication date |
---|---|
CN106612457B (en) | 2019-09-03 |
CN106612457A (en) | 2017-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018086231A1 (en) | Method and system for video sequence alignment | |
Masood et al. | License plate detection and recognition using deeply learned convolutional neural networks | |
CN107316007B (en) | Monitoring image multi-class object detection and identification method based on deep learning | |
CN110609920B (en) | Pedestrian hybrid search method and system in video monitoring scene | |
CN104778474B (en) | A kind of classifier construction method and object detection method for target detection | |
US8929595B2 (en) | Dictionary creation using image similarity | |
CN109829467A (en) | Image labeling method, electronic device and non-transient computer-readable storage medium | |
WO2022121766A1 (en) | Method and apparatus for detecting free space | |
CN106778736B (en) | Robust license plate recognition method and system | |
CN109859164A (en) | A method of by Quick-type convolutional neural networks to PCBA appearance test | |
CN112232237B (en) | Method, system, computer device and storage medium for monitoring vehicle flow | |
WO2014082480A1 (en) | Method and device for calculating number of pedestrians and crowd movement directions | |
CN108052931A (en) | A kind of license plate recognition result fusion method and device | |
CN104200218B (en) | A kind of across visual angle action identification method and system based on timing information | |
CN111241987B (en) | Multi-target model visual tracking method based on cost-sensitive three-branch decision | |
CN110222627A (en) | A kind of face amended record method | |
CN106572387A (en) | Video sequence alignment method and video sequence alignment system | |
CN102298695B (en) | Visual analyzing and processing method for detecting paper money bundle | |
CN113269038B (en) | Multi-scale-based pedestrian detection method | |
CN112232236B (en) | Pedestrian flow monitoring method, system, computer equipment and storage medium | |
Lin et al. | A traffic sign recognition method based on deep visual feature | |
CN109325467A (en) | A kind of wireless vehicle tracking based on video detection result | |
CN103310088A (en) | Automatic detecting method of abnormal illumination power consumption | |
CN109325487B (en) | Full-category license plate recognition method based on target detection | |
CN116469164A (en) | Human gesture recognition man-machine interaction method and system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16921379 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08.10.2019) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16921379 Country of ref document: EP Kind code of ref document: A1 |