WO2018086231A1 - Method and system for video sequence alignment - Google Patents

Method and system for video sequence alignment Download PDF

Info

Publication number
WO2018086231A1
WO2018086231A1 PCT/CN2016/113542 CN2016113542W WO2018086231A1 WO 2018086231 A1 WO2018086231 A1 WO 2018086231A1 CN 2016113542 W CN2016113542 W CN 2016113542W WO 2018086231 A1 WO2018086231 A1 WO 2018086231A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
sequence
scene
sub
scene category
Prior art date
Application number
PCT/CN2016/113542
Other languages
French (fr)
Chinese (zh)
Inventor
雷延强
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2018086231A1 publication Critical patent/WO2018086231A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs

Definitions

  • the present invention relates to the field of signal detection technologies, and in particular, to a video sequence alignment method and system.
  • a display device is a device that can output images or touch information. In order to ensure the normal operation of the display device, it is usually necessary to detect some performance parameters of the display device. Taking the TV as an example, the sensitivity of the motherboard of the TV is an important performance parameter of the TV.
  • the existing scheme for detecting the sensitivity of the TV motherboard is: using the original video signal as a reference, aligning the video signal to be detected with the original video signal, and adjusting the signal strength of the aligned video signal to be output by the display device.
  • the critical signal strength between the mosaic effect and the mosaic effect occurs, and the performance parameters of the display device are determined according to the signal strength.
  • a video sequence alignment method includes the following steps:
  • each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;
  • a video sequence alignment system comprising:
  • a video capture module configured to capture a video clip without scene switching from a video sequence to be aligned
  • a sequence generating module configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;
  • a calculation module configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;
  • an aligning module configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
  • the video sequence alignment method and system the video segment without scene switching is captured from the video sequence to be aligned, and each video frame in the video segment is divided into several sub-blocks, and a video is generated according to the sub-blocks of each video frame.
  • a segment sequence respectively calculating a probability value of the video segment sequence belonging to each scene category, setting a scene category having the largest probability value to a first scene category to which the video segment belongs, and the video segment and the pre-stored original video sequence
  • the video segments belonging to the first scene category are aligned, and the video segments belonging to the first scene category in the original video sequence are found by first performing coarse alignment, and then the video sequence to be aligned and the video of the first scene category are
  • the fine alignment of the segments can effectively reduce the time of video alignment and improve the efficiency of video alignment.
  • 1 is a flow chart of a video sequence alignment method of an embodiment
  • FIG. 2 is a schematic diagram of classification of an original video sequence by scene according to an embodiment
  • FIG. 3 is a schematic structural diagram of a deep convolution network of an embodiment
  • FIG. 4 is a schematic structural diagram of a video sequence alignment system of an embodiment.
  • the present invention provides a video sequence alignment method, which may include the following steps:
  • the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation.
  • the longer the length of the video sequence the longer the alignment process takes.
  • a short video segment is generally captured (for example, 1 second in length) Video clip).
  • the basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching.
  • the accumulated interframe error can be used as the criterion for evaluation.
  • the accumulated interframe error is:
  • f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region)
  • f(z i-1 ) is the feature of the i-1th video frame
  • a distance metric function eg, an L 2 distance metric function
  • T is a preset distance threshold
  • n is the total number of video segments in the video sequence to be aligned.
  • the sequence of video segments is input to a pre-trained scene classifier, and the probability values of the video segment sequence belonging to each scene category are respectively calculated, and the scene category with the highest probability value is set as the first of the video segments.
  • Scene category
  • the probability value can be calculated according to the following formula:
  • the scene category classifier can be pre-trained prior to performing the alignment operation.
  • the way to train the scene classifier can be packaged. Including the following steps:
  • Step 1 Obtain a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
  • the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
  • Y [y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2.
  • Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
  • the accumulated interframe error is:
  • f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and
  • Step 2 The video sequence samples of each scene category are respectively divided into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
  • non-overlapping sub-block partitioning may also be overlapped, but should include special cases of non-overlapping partitioning
  • a finer thumbnail eg 256*256, if not
  • each original sample image y i after sub-block division, can obtain K+1 sub-block images
  • Step 3 Train the deep convolution network according to the sample sub-block and its associated scene category to obtain a scene classifier.
  • the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 .
  • the deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
  • ReLU Rectified Linear Units
  • Q represents the best alignment position of the video segment with the original video sequence
  • d( ⁇ ) is the distance metric function
  • Z is the video segment
  • z i is the ith video frame in Z
  • Y j [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence
  • y i is the i-th video frame in Y j
  • n is a positive integer
  • the above video sequence alignment method uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category.
  • the video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
  • the present invention provides a video sequence alignment system, which may include:
  • the video capture module 10 is configured to capture a video clip without scene switching from the video sequence to be aligned
  • the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation.
  • the longer the length of the video sequence the longer the alignment process takes.
  • a short video segment for example, a video segment of 1 second in length
  • the real-time performance of the alignment result can be improved, and the waiting time of the user can be shortened. Improve the user experience.
  • the basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching.
  • a decision module can be set, and the accumulated interframe error is used as a criterion for judging, and the accumulated interframe error is:
  • f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region)
  • f(z i-1 ) is the feature of the i-1th video frame
  • a distance metric function eg, an L 2 distance metric function
  • T is a preset distance threshold
  • n is the total number of video segments in the video sequence to be aligned. No scene switching means that the video content is basically the same, which is good for classification.
  • the sequence generating module 20 is configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of the respective video frames;
  • a sequence of video segments may be generated as follows:
  • the calculating module 30 is configured to input the video segment sequence into a pre-trained scene class classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment.
  • the probability value can be calculated according to the following formula:
  • the scene category classifier can be pre-trained prior to performing the alignment operation.
  • the video sequence alignment system can also include:
  • a classification module configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene
  • the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
  • Y [y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2.
  • Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
  • the accumulated interframe error is:
  • f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and
  • a sub-block division module configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
  • non-overlapping sub-block partitioning may also be overlapped, but should include special cases of non-overlapping partitioning
  • a finer thumbnail eg 256*256, if not
  • each original sample image y i after sub-block division, can obtain K+1 sub-block images
  • a training module configured to train the deep convolution network according to the sample sub-block and its associated scene category, Go to the scene category classifier.
  • the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 .
  • the deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
  • ReLU Rectified Linear Units
  • the aligning module 40 is configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
  • the alignment module 40 is about to accurately locate the location of the current video clip in it.
  • Q represents the best alignment position of the video segment with the original video sequence
  • d( ⁇ ) is the distance metric function
  • Z is the video segment
  • z i is the ith video frame in Z
  • Y j [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence
  • y i is the i-th video frame in Y j
  • n is a positive integer
  • the video sequence alignment system uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category.
  • the video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
  • the video sequence alignment system of the present invention has a one-to-one correspondence with the video sequence alignment method of the present invention, and the technical features and the beneficial effects described in the embodiments of the video sequence alignment method are applicable to the embodiment of the video sequence alignment system, and hereby declare .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a method and system for video sequence alignment. The method comprises the following steps: extracting, from a video sequence to be aligned, a video clip without a scene change; dividing respective video frames in the video clip into a plurality of sub-blocks, and generating a video clip sequence according to the sub-blocks of the respective video frames; inputting the video clip sequence to a pre-trained scene category classifier to calculate probability values that the video clip sequence belongs to respective scene categories, and setting the scene category associated with a maximum probability value to be a first scene category to which the video clip belongs; and aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.

Description

视频序列对齐方法和系统Video sequence alignment method and system 技术领域Technical field
本发明涉及信号检测技术领域,特别是涉及一种视频序列对齐方法和系统。The present invention relates to the field of signal detection technologies, and in particular, to a video sequence alignment method and system.
背景技术Background technique
显示设备是一种可输出图像或感触信息的设备。为了保证显示设备正常工作,通常需要对显示设备的一些性能参数进行检测。以电视机为例,电视机的主板灵敏度是电视机的一个重要性能性能参数。A display device is a device that can output images or touch information. In order to ensure the normal operation of the display device, it is usually necessary to detect some performance parameters of the display device. Taking the TV as an example, the sensitivity of the motherboard of the TV is an important performance parameter of the TV.
现有的检测电视机主板灵敏度的方案是:利用原始的视频信号作为参考,将待检测的视频信号与原始视频信号进行对齐,将对齐后的视频信号的信号强度调整为经所述显示设备输出后无马赛克效应与出现马赛克效应之间的临界信号强度,并根据该信号强度确定所述显示设备的性能参数。The existing scheme for detecting the sensitivity of the TV motherboard is: using the original video signal as a reference, aligning the video signal to be detected with the original video signal, and adjusting the signal strength of the aligned video signal to be output by the display device. The critical signal strength between the mosaic effect and the mosaic effect occurs, and the performance parameters of the display device are determined according to the signal strength.
然而,这种方式需要花费较多的时间进行视频信号对齐,导致信号处理效率较低。However, this method requires more time for video signal alignment, resulting in lower signal processing efficiency.
发明内容Summary of the invention
基于此,有必要针对信号处理效率较低的问题,提供一种视频序列对齐方法和系统。Based on this, it is necessary to provide a video sequence alignment method and system for the problem of low signal processing efficiency.
一种视频序列对齐方法,包括以下步骤:A video sequence alignment method includes the following steps:
从待对齐的视频序列中抓取无场景切换的视频片段;Grab a video clip without scene switching from the video sequence to be aligned;
分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;Separating each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;
将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;Inputting the video segment sequence into a pre-trained scene class classifier, respectively calculating probability values of the video segment sequence belonging to each scene category, and setting a scene category having the largest probability value to the first scene category to which the video segment belongs ;
将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。Aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.
一种视频序列对齐系统,包括:A video sequence alignment system comprising:
视频抓取模块,用于从待对齐的视频序列中抓取无场景切换的视频片段; a video capture module, configured to capture a video clip without scene switching from a video sequence to be aligned;
序列生成模块,用于分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;a sequence generating module, configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;
计算模块,用于将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;a calculation module, configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;
对齐模块,用于将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。And an aligning module, configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
上述视频序列对齐方法和系统,从待对齐的视频序列中抓取无场景切换的视频片段,分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别,将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐,通过先进行粗对齐找到原始视频序列中属于所述第一场景类别的视频片段,再将待对齐的视频序列与所述第一场景类别的视频片段进行精对齐,能够有效减少视频对齐的时间,提高视频对齐的效率。The video sequence alignment method and system, the video segment without scene switching is captured from the video sequence to be aligned, and each video frame in the video segment is divided into several sub-blocks, and a video is generated according to the sub-blocks of each video frame. a segment sequence, respectively calculating a probability value of the video segment sequence belonging to each scene category, setting a scene category having the largest probability value to a first scene category to which the video segment belongs, and the video segment and the pre-stored original video sequence The video segments belonging to the first scene category are aligned, and the video segments belonging to the first scene category in the original video sequence are found by first performing coarse alignment, and then the video sequence to be aligned and the video of the first scene category are The fine alignment of the segments can effectively reduce the time of video alignment and improve the efficiency of video alignment.
附图说明DRAWINGS
图1为一个实施例的视频序列对齐方法流程图;1 is a flow chart of a video sequence alignment method of an embodiment;
图2为一个实施例的原始视频序列按场景分类示意图;2 is a schematic diagram of classification of an original video sequence by scene according to an embodiment;
图3为一个实施例的深度卷积网络结构示意图;3 is a schematic structural diagram of a deep convolution network of an embodiment;
图4为一个实施例的视频序列对齐系统的结构示意图。4 is a schematic structural diagram of a video sequence alignment system of an embodiment.
具体实施方式detailed description
下面结合附图对本发明的技术方案进行说明。The technical solution of the present invention will be described below with reference to the accompanying drawings.
如图1所示,本发明提供一种视频序列对齐方法,可包括以下步骤:As shown in FIG. 1, the present invention provides a video sequence alignment method, which may include the following steps:
S1,从待对齐的视频序列中抓取无场景切换的视频片段;S1, capturing a video clip without scene switching from the video sequence to be aligned;
其中,所述视频序列的长度应满足一定的时间代价约束条件,所述时间代价约束条件用于表征视频序列对齐操作花费的时间。一般来说,视频序列的长度越长,对齐过程花费的时间越长。为了满足上述约束条件,一般抓取一段较短的视频片段(例如长度为1秒的 视频片段)。通过设置时间代价约束条件,能够提高对齐结果的实时性,缩短用户等待时间,提高用户体验。Wherein, the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation. In general, the longer the length of the video sequence, the longer the alignment process takes. In order to meet the above constraints, a short video segment is generally captured (for example, 1 second in length) Video clip). By setting the time cost constraint, the real-time performance of the alignment result can be improved, the waiting time of the user can be shortened, and the user experience can be improved.
抓取视频片段后,需要对抓取的视频片段进行判断,若不符合条件,则重新抓取。判断的基本原理是:尽量保持获取的视频片段前后变化小,无场景切换等。可采用累积的帧间误差作为评判标准,累积的帧间误差为:After capturing the video clip, it is necessary to judge the captured video clip, and if it does not meet the conditions, re-crawl. The basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching. The accumulated interframe error can be used as the criterion for evaluation. The accumulated interframe error is:
Figure PCTCN2016113542-appb-000001
Figure PCTCN2016113542-appb-000001
式中,f(zi)为第i个视频帧的特征(例如分区域的颜色直方图),f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数(例如,L2距离度量函数),T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。Where f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region), f(z i-1 ) is the feature of the i-1th video frame, and ||·|| A distance metric function (eg, an L 2 distance metric function), T is a preset distance threshold, and n is the total number of video segments in the video sequence to be aligned.
若不满足上述条件,则需要重新抓取视频片段。一般来说,1秒内的视频片段很容易满足上述条件,因此不会过多地重复采集。无场景切换表示视频内容基本一致,有利于分类。If the above conditions are not met, you will need to recapture the video clip. In general, video clips within 1 second can easily satisfy the above conditions, so the acquisition is not repeated too much. No scene switching means that the video content is basically the same, which is good for classification.
S2,分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;S2, respectively dividing each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of each video frame;
假设步骤S1中抓取到的视频片段为Z=[z0,z1,...zn],其中,zi(i=1,2,...,n)为第i个视频帧,若每个视频帧中包括K个子块,在本步骤中,可以生成如下视频片段序列:It is assumed that the video segment captured in step S1 is Z=[z 0 , z 1 ,...z n ], where z i (i=1, 2, . . . , n) is the i-th video frame. If each video frame includes K sub-blocks, in this step, a sequence of video segments may be generated as follows:
Figure PCTCN2016113542-appb-000002
Figure PCTCN2016113542-appb-000002
S3,将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;S3, the sequence of video segments is input to a pre-trained scene classifier, and the probability values of the video segment sequence belonging to each scene category are respectively calculated, and the scene category with the highest probability value is set as the first of the video segments. Scene category
其中,所述概率值可以根据如下公式计算:Wherein, the probability value can be calculated according to the following formula:
Figure PCTCN2016113542-appb-000003
Figure PCTCN2016113542-appb-000003
式中,
Figure PCTCN2016113542-appb-000004
表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,
Figure PCTCN2016113542-appb-000005
为所述视频片段序列中的子块
Figure PCTCN2016113542-appb-000006
属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数,∏表示乘法操作。
In the formula,
Figure PCTCN2016113542-appb-000004
Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence,
Figure PCTCN2016113542-appb-000005
a sub-block in the sequence of video segments
Figure PCTCN2016113542-appb-000006
a probability value belonging to the jth scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the jth scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence, ∏ Multiplication operation.
所述场景类别分类器可以在执行对齐操作之前预先训练。训练场景分类器的方式可包 括以下步骤:The scene category classifier can be pre-trained prior to performing the alignment operation. The way to train the scene classifier can be packaged. Including the following steps:
步骤1,获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;Step 1: Obtain a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
视频序列中,若场景不进行切换,则相邻图像相似度极高。因此可以将视频序列样本按场景划分为较粗的类别,并保持时间先后关系。在粗定位时,只需确定当前视频片段与哪一个类别最为相似即可。具体分类描述如下:In the video sequence, if the scene is not switched, the similarity of adjacent images is extremely high. Therefore, the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
设视频序列样本为Y=[y1,y2,...ym],m为视频序列样本中的视频帧的总数。按场景划分为多个类别,如图2所示。图2中,Yl为视频序列样本中的第l个视频片段,每个视频片段包括若干个视频帧。Let the video sequence sample be Y=[y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2. In Figure 2, Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
可以预先在场景边界做标注,根据该标注信息进行场景划分(一般原始视频序列20-30分钟,标注量不大,并且是一次性工作),也可以采用典型的帧间累积误差自动进行场景分类:累积的帧间误差为:You can mark the scene boundary in advance, and divide the scene according to the label information (the original original video sequence is 20-30 minutes, the labeling amount is not large, and it is one-time work), and the scene classification can be automatically performed by using the typical inter-frame cumulative error. : The accumulated interframe error is:
Figure PCTCN2016113542-appb-000007
Figure PCTCN2016113542-appb-000007
其中,f(yi)表示第i个视频帧的特征表示(例如分区域的颜色直方图),||·||是距离度量函数。若d(Y)小于设置的阈值,则将当前相邻图像划分为同一个类别;后续未划分的序列则重复上述划分过程即可。Where f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and ||·|| is a distance metric function. If d(Y) is less than the set threshold, the current adjacent image is divided into the same category; the subsequent undivided sequence repeats the above division process.
步骤2,分别将各个场景类别的视频序列样本划分为若干个样本子块;其中,所述视频序列样本中包括非重叠的样本子块;Step 2: The video sequence samples of each scene category are respectively divided into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
针对每个场景类别中每个图像样本,进行非重叠的子块划分(也可以带重叠,但应包括非重叠划分的特例),构建立出更为精细的小图(例如256*256,若非重叠划分不能整除256时,可以在最右侧划分时采用重叠划分),进行训练深度卷积网络。一般学习策略,用的样本越多越好,非重叠下,采用的样本是最少的,不能再少了,并且它们之间没有重叠。而有重叠的划分,需要包括非重叠的特例,否则失去了一般性。这样做的好处有:1)样本数量增多,有利于深度卷积网络训练;2)样本图像尺寸变小,可以有效地减少深度神经网络中全连接层的数量,复杂度降低。例如:每个原始样本图像yi,经过子块划分后,可以得到K+1个子块图像
Figure PCTCN2016113542-appb-000008
For each image sample in each scene category, non-overlapping sub-block partitioning (may also be overlapped, but should include special cases of non-overlapping partitioning) to construct a finer thumbnail (eg 256*256, if not When the overlap division cannot be divided by 256, the overlap partition can be used in the rightmost division to train the deep convolution network. In general learning strategies, the more samples you use, the better. Under non-overlapping, the samples used are the least, no more, and there is no overlap between them. And there are overlapping divisions that need to include non-overlapping exceptions, otherwise they lose their generality. The advantages of this are as follows: 1) The number of samples increases, which is conducive to deep convolutional network training; 2) The sample image size becomes smaller, which can effectively reduce the number of fully connected layers in the deep neural network, and the complexity is reduced. For example: each original sample image y i , after sub-block division, can obtain K+1 sub-block images
Figure PCTCN2016113542-appb-000008
步骤3,根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得到场景类别分类器。 Step 3: Train the deep convolution network according to the sample sub-block and its associated scene category to obtain a scene classifier.
利用收集的场景类别样本图(即子块图像)及其标注(子块图像对应的场景类别),对深度卷积网络进行训练,得到分类器,如图3所示。本发明采用的深度卷积网络包括五个卷积层(Convolutional Layer),每个卷积层的输出都经过ReLU(Rectified Linear Units)激活函数进行非线性变换,再经过池化层(Pooling Layer)进行池化,再接两个全连接层(Fully-Connected Layer),最后通过Softmax函数输出分类概率(输入子块图像属于某个场景类别的概率
Figure PCTCN2016113542-appb-000009
Using the collected scene category sample map (ie, sub-block image) and its annotation (the scene category corresponding to the sub-block image), the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 . The deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
Figure PCTCN2016113542-appb-000009
S4,将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。S4. Align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
步骤S3已经粗略定位当前视频片段Z=[z0,z1,...zn]属于哪个类别YJ=[yu,yu+1,...yv]。本步骤即将在中精确地定位当前视频片段所属位置。为了防止边界问题,可以将YJ进行左右扩展为YJ=[yu-n,yu-n+1,...yv+n],则精确对齐的计算方式为:Step S3 has roughly positioned which category Y J = [y u , y u+1 , ... y v ] the current video segment Z = [z 0 , z 1 , ... z n ] belongs to. This step will precisely locate the location of the current video clip in it. In order to prevent the boundary problem, Y J can be extended left and right to Y J =[y un , y u-n+1 ,...y v+n ], then the exact alignment is calculated as:
Figure PCTCN2016113542-appb-000010
Figure PCTCN2016113542-appb-000010
其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];
式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,...,n)为y0前i时刻的视频帧,yv+i(i=1,2,...,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2, ..., n) is the video frame at the time i before y 0 , y v+i (i = 1, 2, ..., n) is the video frame at time i after y n , n is a positive integer , q∈[un,v].
上述视频序列对齐方法,采用由粗到精的搜索策略,通过先进行粗对齐找到原始视频序列中属于所述第一场景类别的视频片段,再将待对齐的视频序列与所述第一场景类别的视频片段进行精对齐,有效减少了视频对齐的时间,提高了视频对齐的效率。The above video sequence alignment method uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category. The video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
如图2所示,本发明提供一种视频序列对齐系统,可包括:As shown in FIG. 2, the present invention provides a video sequence alignment system, which may include:
视频抓取模块10,用于从待对齐的视频序列中抓取无场景切换的视频片段;The video capture module 10 is configured to capture a video clip without scene switching from the video sequence to be aligned;
其中,所述视频序列的长度应满足一定的时间代价约束条件,所述时间代价约束条件用于表征视频序列对齐操作花费的时间。一般来说,视频序列的长度越长,对齐过程花费的时间越长。为了满足上述约束条件,一般抓取一段较短的视频片段(例如长度为1秒的视频片段)。通过设置时间代价约束条件,能够提高对齐结果的实时性,缩短用户等待时间, 提高用户体验。Wherein, the length of the video sequence should satisfy a certain time cost constraint, and the time cost constraint is used to characterize the time taken for the video sequence alignment operation. In general, the longer the length of the video sequence, the longer the alignment process takes. In order to satisfy the above constraints, a short video segment (for example, a video segment of 1 second in length) is generally captured. By setting the time cost constraint, the real-time performance of the alignment result can be improved, and the waiting time of the user can be shortened. Improve the user experience.
抓取视频片段后,需要对抓取的视频片段进行判断,若不符合条件,则重新抓取。判断的基本原理是:尽量保持获取的视频片段前后变化小,无场景切换等。可设置一判定模块,采用累积的帧间误差作为评判标准进行判断,累积的帧间误差为:After capturing the video clip, it is necessary to judge the captured video clip, and if it does not meet the conditions, re-crawl. The basic principle of judgment is: try to keep the acquired video clips small before and after, without scene switching. A decision module can be set, and the accumulated interframe error is used as a criterion for judging, and the accumulated interframe error is:
Figure PCTCN2016113542-appb-000011
Figure PCTCN2016113542-appb-000011
式中,f(zi)为第i个视频帧的特征(例如分区域的颜色直方图),f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数(例如,L2距离度量函数),T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。无场景切换表示视频内容基本一致,有利于分类。Where f(z i ) is the feature of the ith video frame (eg, the color histogram of the sub-region), f(z i-1 ) is the feature of the i-1th video frame, and ||·|| A distance metric function (eg, an L 2 distance metric function), T is a preset distance threshold, and n is the total number of video segments in the video sequence to be aligned. No scene switching means that the video content is basically the same, which is good for classification.
若不满足上述条件,则需要重新抓取视频片段。一般来说,1秒内的视频片段很容易满足上述条件,因此不会过多地重复采集。If the above conditions are not met, you will need to recapture the video clip. In general, video clips within 1 second can easily satisfy the above conditions, so the acquisition is not repeated too much.
序列生成模块20,用于分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;The sequence generating module 20 is configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of the respective video frames;
假设视频抓取模块10中抓取到的视频片段为Z=[z0,z1,...zn],其中,zi(i=1,2,...,n)为第i个视频帧,若每个视频帧中包括K个子块,在序列生成模块20中,可以生成如下视频片段序列:Assume that the video clip captured in the video capture module 10 is Z=[z 0 , z 1 ,...z n ], where z i (i=1, 2, . . . , n) is the i th For each video frame, if K sub-blocks are included in each video frame, in the sequence generation module 20, a sequence of video segments may be generated as follows:
Figure PCTCN2016113542-appb-000012
Figure PCTCN2016113542-appb-000012
计算模块30,用于将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;The calculating module 30 is configured to input the video segment sequence into a pre-trained scene class classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment. The first scene category to which it belongs;
其中,所述概率值可以根据如下公式计算:Wherein, the probability value can be calculated according to the following formula:
Figure PCTCN2016113542-appb-000013
Figure PCTCN2016113542-appb-000013
式中,
Figure PCTCN2016113542-appb-000014
表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,
Figure PCTCN2016113542-appb-000015
为所述视频片段序列中的子块
Figure PCTCN2016113542-appb-000016
属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数,∏表示乘法操作。
In the formula,
Figure PCTCN2016113542-appb-000014
Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence,
Figure PCTCN2016113542-appb-000015
a sub-block in the sequence of video segments
Figure PCTCN2016113542-appb-000016
a probability value belonging to the jth scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the jth scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence, ∏ Multiplication operation.
所述场景类别分类器可以在执行对齐操作之前预先训练。所述视频序列对齐系统还可包括:The scene category classifier can be pre-trained prior to performing the alignment operation. The video sequence alignment system can also include:
分类模块,用于获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;a classification module, configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
视频序列中,若场景不进行切换,则相邻图像相似度极高。因此可以将视频序列样本按场景划分为较粗的类别,并保持时间先后关系。在粗定位时,只需确定当前视频片段与哪一个类别最为相似即可。具体分类描述如下:In the video sequence, if the scene is not switched, the similarity of adjacent images is extremely high. Therefore, the video sequence samples can be divided into coarser categories according to the scene, and the time-sequence relationship is maintained. In coarse positioning, just determine which category of the current video clip is most similar. The specific classification is described as follows:
设视频序列样本为Y=[y1,y2,...ym],m为视频序列样本中的视频帧的总数。按场景划分为多个类别,如图2所示。图2中,Yl为视频序列样本中的第l个视频片段,每个视频片段包括若干个视频帧。Let the video sequence sample be Y=[y 1 , y 2 ,...y m ], where m is the total number of video frames in the video sequence sample. Divided into multiple categories by scene, as shown in Figure 2. In Figure 2, Y l is the first video segment in the video sequence sample, and each video segment includes several video frames.
可以预先在场景边界做标注,根据该标注信息进行场景划分(一般原始视频序列20-30分钟,标注量不大,并且是一次性工作),也可以采用典型的帧间累积误差自动进行场景分类:累积的帧间误差为:You can mark the scene boundary in advance, and divide the scene according to the label information (the original original video sequence is 20-30 minutes, the labeling amount is not large, and it is one-time work), and the scene classification can be automatically performed by using the typical inter-frame cumulative error. : The accumulated interframe error is:
Figure PCTCN2016113542-appb-000017
Figure PCTCN2016113542-appb-000017
其中,f(yi)表示第i个视频帧的特征表示(例如分区域的颜色直方图),||·||是距离度量函数。若d(Y)小于设置的阈值,则将当前相邻图像划分为同一个类别;后续未划分的序列则重复上述划分过程即可。Where f(y i ) represents a feature representation of the i-th video frame (eg, a color histogram of the sub-region), and ||·|| is a distance metric function. If d(Y) is less than the set threshold, the current adjacent image is divided into the same category; the subsequent undivided sequence repeats the above division process.
子块划分模块,用于分别将各个场景类别的视频序列样本划分为若干个样本子块;其中,所述视频序列样本中包括非重叠的样本子块;a sub-block division module, configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; wherein the video sequence samples include non-overlapping sample sub-blocks;
针对每个场景类别中每个图像样本,进行非重叠的子块划分(也可以带重叠,但应包括非重叠划分的特例),构建立出更为精细的小图(例如256*256,若非重叠划分不能整除256时,可以在最右侧划分时采用重叠划分),进行训练深度卷积网络。一般学习策略,用的样本越多越好,非重叠下,采用的样本是最少的,不能再少了,并且它们之间没有重叠。而有重叠的划分,需要包括非重叠的特例,否则失去了一般性。这样做的好处有:1)样本数量增多,有利于深度卷积网络训练;2)样本图像尺寸变小,可以有效地减少深度神经网络中全连接层的数量,复杂度降低。例如:每个原始样本图像yi,经过子块划分后,可以得到K+1个子块图像
Figure PCTCN2016113542-appb-000018
For each image sample in each scene category, non-overlapping sub-block partitioning (may also be overlapped, but should include special cases of non-overlapping partitioning) to construct a finer thumbnail (eg 256*256, if not When the overlap division cannot be divided by 256, the overlap partition can be used in the rightmost division to train the deep convolution network. In general learning strategies, the more samples you use, the better. Under non-overlapping, the samples used are the least, no more, and there is no overlap between them. And there are overlapping divisions that need to include non-overlapping exceptions, otherwise they lose their generality. The advantages of this are as follows: 1) The number of samples increases, which is conducive to deep convolutional network training; 2) The sample image size becomes smaller, which can effectively reduce the number of fully connected layers in the deep neural network, and the complexity is reduced. For example: each original sample image y i , after sub-block division, can obtain K+1 sub-block images
Figure PCTCN2016113542-appb-000018
训练模块,用于根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得 到场景类别分类器。a training module, configured to train the deep convolution network according to the sample sub-block and its associated scene category, Go to the scene category classifier.
利用收集的场景类别样本图(即子块图像)及其标注(子块图像对应的场景类别),对深度卷积网络进行训练,得到分类器,如图3所示。本发明采用的深度卷积网络包括五个卷积层(Convolutional Layer),每个卷积层的输出都经过ReLU(Rectified Linear Units)激活函数进行非线性变换,再经过池化层(Pooling Layer)进行池化,再接两个全连接层(Fully-Connected Layer),最后通过Softmax函数输出分类概率(输入子块图像属于某个场景类别的概率
Figure PCTCN2016113542-appb-000019
Using the collected scene category sample map (ie, sub-block image) and its annotation (the scene category corresponding to the sub-block image), the deep convolution network is trained to obtain a classifier, as shown in FIG. 3 . The deep convolutional network used in the present invention comprises five Convolutional Layers, and the output of each convolutional layer is nonlinearly transformed by a ReLU (Rectified Linear Units) activation function, and then passes through a Pooling Layer. Pooling, then connecting two Fully-Connected Layers, and finally outputting the classification probability through the Softmax function (the probability that the input sub-block image belongs to a certain scene category)
Figure PCTCN2016113542-appb-000019
对齐模块40,用于将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。The aligning module 40 is configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
计算模块30已经粗略定位当前视频片段Z=[z0,z1,...zn]属于哪个类别YJ=[yu,yu+1,...yv]。对齐模块40即将在中精确地定位当前视频片段所属位置。为了防止边界问题,可以将YJ进行左右扩展为YJ=[yu-n,yu-n+1,...yv+n],则精确对齐的计算方式为:The calculation module 30 has roughly located which category Y J = [y u , y u+1 , ... y v ] the current video segment Z = [z 0 , z 1 , ... z n ] belongs to. The alignment module 40 is about to accurately locate the location of the current video clip in it. In order to prevent the boundary problem, Y J can be extended left and right to Y J =[y un , y u-n+1 ,...y v+n ], then the exact alignment is calculated as:
Figure PCTCN2016113542-appb-000020
Figure PCTCN2016113542-appb-000020
其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];
式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,...,n)为y0前i时刻的视频帧,yv+i(i=1,2,...,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2, ..., n) is the video frame at the time i before y 0 , y v+i (i = 1, 2, ..., n) is the video frame at time i after y n , n is a positive integer , q∈[un,v].
上述视频序列对齐系统,采用由粗到精的搜索策略,通过先进行粗对齐找到原始视频序列中属于所述第一场景类别的视频片段,再将待对齐的视频序列与所述第一场景类别的视频片段进行精对齐,有效减少了视频对齐的时间,提高了视频对齐的效率。The video sequence alignment system uses a coarse-to-fine search strategy to find a video segment belonging to the first scene category in the original video sequence by first performing coarse alignment, and then the video sequence to be aligned and the first scene category. The video clips are finely aligned, which effectively reduces the time for video alignment and improves the efficiency of video alignment.
本发明的视频序列对齐系统与本发明的视频序列对齐方法一一对应,在上述视频序列对齐方法的实施例阐述的技术特征及其有益效果均适用于视频序列对齐系统的实施例中,特此声明。The video sequence alignment system of the present invention has a one-to-one correspondence with the video sequence alignment method of the present invention, and the technical features and the beneficial effects described in the embodiments of the video sequence alignment method are applicable to the embodiment of the video sequence alignment system, and hereby declare .
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛 盾,都应当认为是本说明书记载的范围。The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as the combination of these technical features does not exist, the spear is not present. Shield should be considered as the scope of this manual.
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。 The above-described embodiments are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims (10)

  1. 一种视频序列对齐方法,其特征在于,包括以下步骤:A video sequence alignment method, comprising the steps of:
    从待对齐的视频序列中抓取无场景切换的视频片段;Grab a video clip without scene switching from the video sequence to be aligned;
    分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;Separating each video frame in the video segment into a plurality of sub-blocks, and generating a video segment sequence according to the sub-blocks of the respective video frames;
    将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;Inputting the video segment sequence into a pre-trained scene class classifier, respectively calculating probability values of the video segment sequence belonging to each scene category, and setting a scene category having the largest probability value to the first scene category to which the video segment belongs ;
    将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。Aligning the video clip with a video clip belonging to the first scene category in a pre-stored original video sequence.
  2. 根据权利要求1所述的视频序列对齐方法,其特征在于,在将所述视频序列输入至预先训练的场景类别分类器之前,还包括以下步骤:The video sequence alignment method according to claim 1, wherein before the video sequence is input to the pre-trained scene classifier, the method further comprises the following steps:
    获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;Obtaining a video sequence sample, and dividing the video sequence sample into multiple scene categories according to a scene;
    分别将各个场景类别的视频序列样本划分为若干个样本子块;其中,所述视频序列样本中包括非重叠的样本子块;Separating the video sequence samples of the respective scene categories into a plurality of sample sub-blocks, wherein the video sequence samples include non-overlapping sample sub-blocks;
    根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得到场景类别分类器。The deep convolution network is trained according to the sample sub-block and its associated scene category to obtain a scene classifier.
  3. 根据权利要求1所述的视频序列对齐方法,其特征在于,还包括以下步骤:The video sequence alignment method according to claim 1, further comprising the steps of:
    若所述视频片段满足如下条件,判定所述视频片段无场景切换:If the video segment satisfies the following condition, it is determined that the video segment has no scene switching:
    Figure PCTCN2016113542-appb-100001
    Figure PCTCN2016113542-appb-100001
    式中,f(zi)为第i个视频帧的特征,f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数,T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。Where f(z i ) is the characteristic of the ith video frame, f(z i-1 ) is the feature of the i-1th video frame, ||·|| is the distance metric function, and T is the preset The distance threshold, n is the total number of video segments in the video sequence to be aligned.
  4. 根据权利要求1所述的视频序列对齐方法,其特征在于,分别计算所述视频片段序列属于各个场景类别的概率值的步骤包括:The video sequence alignment method according to claim 1, wherein the step of respectively calculating the probability values of the video segment sequence belonging to each scene category comprises:
    根据如下公式计算所述视频片段序列属于各个场景类别的概率值:The probability values of the video segment sequence belonging to each scene category are calculated according to the following formula:
    Figure PCTCN2016113542-appb-100002
    Figure PCTCN2016113542-appb-100002
    式中,
    Figure PCTCN2016113542-appb-100003
    表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,
    Figure PCTCN2016113542-appb-100004
    为所述视频片段序列中的子块
    Figure PCTCN2016113542-appb-100005
    属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数。
    In the formula,
    Figure PCTCN2016113542-appb-100003
    Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence,
    Figure PCTCN2016113542-appb-100004
    a sub-block in the sequence of video segments
    Figure PCTCN2016113542-appb-100005
    A probability value belonging to the j-th scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the j-th scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence.
  5. 根据权利要求1所述的视频序列对齐方法,其特征在于,将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐的步骤包括:The video sequence alignment method according to claim 1, wherein the step of aligning the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence comprises:
    根据如下公式将所述视频片段与原始视频序列中属于所述第一场景类别的视频片段进行对齐:Aligning the video clip with a video clip belonging to the first scene category in the original video sequence according to the following formula:
    Figure PCTCN2016113542-appb-100006
    Figure PCTCN2016113542-appb-100006
    其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];
    式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,…,n)为y0前i时刻的视频帧,yv+i(i=1,2,…,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2,...,n) is the video frame at the time i before y 0 , y v+i (i=1,2,...,n) is the video frame at time i after y n , n is a positive integer, q∈[ Un, v].
  6. 一种视频序列对齐系统,其特征在于,包括:A video sequence alignment system, comprising:
    视频抓取模块,用于从待对齐的视频序列中抓取无场景切换的视频片段;a video capture module, configured to capture a video clip without scene switching from a video sequence to be aligned;
    序列生成模块,用于分别将所述视频片段中的各个视频帧划分为若干个子块,根据各个视频帧的子块生成视频片段序列;a sequence generating module, configured to separately divide each video frame in the video segment into a plurality of sub-blocks, and generate a video segment sequence according to the sub-blocks of each video frame;
    计算模块,用于将所述视频片段序列输入至预先训练的场景类别分类器,分别计算所述视频片段序列属于各个场景类别的概率值,将概率值最大的场景类别设为所述视频片段所属的第一场景类别;a calculation module, configured to input the sequence of the video segments into a pre-trained scene classifier, calculate a probability value of the video segment sequence belonging to each scene category, and set a scene category with the highest probability value as the video segment belongs to First scene category;
    对齐模块,用于将所述视频片段与预存的原始视频序列中属于所述第一场景类别的视频片段进行对齐。And an aligning module, configured to align the video segment with a video segment belonging to the first scene category in a pre-stored original video sequence.
  7. 根据权利要求6所述的视频序列对齐系统,其特征在于,还包括:The video sequence alignment system according to claim 6, further comprising:
    分类模块,用于获取视频序列样本,将所述视频序列样本按场景划分为多个场景类别;a classification module, configured to acquire a video sequence sample, and divide the video sequence sample into multiple scene categories according to a scene;
    子块划分模块,用于分别将各个场景类别的视频序列样本划分为若干个样本子块;其 中,所述视频序列样本中包括非重叠的样本子块;a sub-block division module, configured to separately divide a video sequence sample of each scene category into a plurality of sample sub-blocks; The non-overlapping sample sub-blocks are included in the video sequence sample;
    训练模块,用于根据所述样本子块及其所属的场景类别对深度卷积网络进行训练,得到场景类别分类器。The training module is configured to train the deep convolution network according to the sample sub-block and the scene category to which it belongs, to obtain a scene category classifier.
  8. 根据权利要求6所述的视频序列对齐系统,其特征在于,还包括:The video sequence alignment system according to claim 6, further comprising:
    判定模块,用于若所述视频片段满足如下条件,判定所述视频片段无场景切换:a determining module, configured to determine that the video segment has no scene switching if the video segment meets the following conditions:
    Figure PCTCN2016113542-appb-100007
    Figure PCTCN2016113542-appb-100007
    式中,f(zi)为第i个视频帧的特征,f(zi-1)为第i-1个视频帧的特征,||·||为距离度量函数,T为预设的距离阈值,n为所述待对齐的视频序列中的视频片段的总数。Where f(z i ) is the characteristic of the ith video frame, f(z i-1 ) is the feature of the i-1th video frame, ||·|| is the distance metric function, and T is the preset The distance threshold, n is the total number of video segments in the video sequence to be aligned.
  9. 根据权利要求6所述的视频序列对齐系统,其特征在于,所述计算模块进一步根据如下公式计算所述视频片段序列属于各个场景类别的概率值:The video sequence alignment system according to claim 6, wherein the calculation module further calculates a probability value of the video segment sequence belonging to each scene category according to the following formula:
    Figure PCTCN2016113542-appb-100008
    Figure PCTCN2016113542-appb-100008
    式中,
    Figure PCTCN2016113542-appb-100009
    表示所述视频片段序列的第i个视频帧中的第k个子块,Yj表示所述原始视频序列中属于第j场景类别的视频片段,
    Figure PCTCN2016113542-appb-100010
    为所述视频片段序列中的子块
    Figure PCTCN2016113542-appb-100011
    属于第j场景类别的概率值,p(Yj/Z)为所述视频片段序列属于第j场景类别的概率值,K为所述视频片段序列的一个视频帧中子块的总数。
    In the formula,
    Figure PCTCN2016113542-appb-100009
    Representing a kth sub-block in the i-th video frame of the sequence of video segments, Y j representing a video segment belonging to the j-th scene category in the original video sequence,
    Figure PCTCN2016113542-appb-100010
    a sub-block in the sequence of video segments
    Figure PCTCN2016113542-appb-100011
    A probability value belonging to the j-th scene category, p(Y j /Z) is a probability value of the video segment sequence belonging to the j-th scene category, and K is a total number of sub-blocks in one video frame of the video segment sequence.
  10. 根据权利要求6所述的视频序列对齐系统,其特征在于,所述对齐模块进一步根据如下公式将所述视频片段与原始视频序列中属于所述第一场景类别的视频片段进行对齐:The video sequence alignment system according to claim 6, wherein the alignment module further aligns the video segment with a video segment belonging to the first scene category in an original video sequence according to the following formula:
    Figure PCTCN2016113542-appb-100012
    Figure PCTCN2016113542-appb-100012
    其中,YJ=[yu-n,yu-n+1,...yv+n];Where Y J =[y un , y u-n+1 ,...y v+n ];
    式中,Q表示所述视频片段与原始视频序列的最佳对齐位置,d(·)为距离度量函数,Z为所述视频片段,zi为Z中的第i个视频帧,Yj=[yu,yu+1,...yv]表示所述原始视频序列中属于第j场景类别的视频片段,yi为Yj中的第i个视频帧,yu-i(i=1,2,…,n)为y0前i时刻的视频帧,yv+i(i=1,2,…,n)为yn后i时刻的视频帧,n为正整数,q∈[u-n,v]。 Where Q represents the best alignment position of the video segment with the original video sequence, d(·) is the distance metric function, Z is the video segment, z i is the ith video frame in Z, Y j = [y u , y u+1 , .y v ] represents a video segment belonging to the jth scene category in the original video sequence, y i is the i-th video frame in Y j , y ui (i=1) , 2,...,n) is the video frame at the time i before y 0 , y v+i (i=1,2,...,n) is the video frame at time i after y n , n is a positive integer, q∈[ Un, v].
PCT/CN2016/113542 2016-11-09 2016-12-30 Method and system for video sequence alignment WO2018086231A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610986953.9 2016-11-09
CN201610986953.9A CN106612457B (en) 2016-11-09 2016-11-09 Video sequence alignment schemes and system

Publications (1)

Publication Number Publication Date
WO2018086231A1 true WO2018086231A1 (en) 2018-05-17

Family

ID=58614979

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/113542 WO2018086231A1 (en) 2016-11-09 2016-12-30 Method and system for video sequence alignment

Country Status (2)

Country Link
CN (1) CN106612457B (en)
WO (1) WO2018086231A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194419A (en) * 2017-05-10 2017-09-22 百度在线网络技术(北京)有限公司 Video classification methods and device, computer equipment and computer-readable recording medium
CN108537134B (en) * 2018-03-16 2020-06-30 北京交通大学 Video semantic scene segmentation and labeling method
CN108682436B (en) * 2018-05-11 2020-06-23 北京海天瑞声科技股份有限公司 Voice alignment method and device
CN110147700B (en) * 2018-05-18 2023-06-27 腾讯科技(深圳)有限公司 Video classification method, device, storage medium and equipment
CN111723617B (en) * 2019-03-20 2023-10-27 顺丰科技有限公司 Method, device, equipment and storage medium for identifying actions
CN110347875B (en) * 2019-07-08 2022-04-15 北京字节跳动网络技术有限公司 Video scene classification method and device, mobile terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080310734A1 (en) * 2007-06-18 2008-12-18 The Regents Of The University Of California High speed video action recognition and localization
CN101692269A (en) * 2009-10-16 2010-04-07 北京中星微电子有限公司 Method and device for processing video programs
CN105184271A (en) * 2015-09-18 2015-12-23 苏州派瑞雷尔智能科技有限公司 Automatic vehicle detection method based on deep learning
CN105704485A (en) * 2016-02-02 2016-06-22 广州视源电子科技股份有限公司 Display device performance parameter detection method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680342B2 (en) * 2004-08-16 2010-03-16 Fotonation Vision Limited Indoor/outdoor classification in digital images
JP2008234623A (en) * 2007-02-19 2008-10-02 Seiko Epson Corp Category classification apparatus and method, and program
CN101814147B (en) * 2010-04-12 2012-04-25 中国科学院自动化研究所 Method for realizing classification of scene images
CN103366181A (en) * 2013-06-28 2013-10-23 安科智慧城市技术(中国)有限公司 Method and device for identifying scene integrated by multi-feature vision codebook
CN104881675A (en) * 2015-05-04 2015-09-02 北京奇艺世纪科技有限公司 Video scene identification method and apparatus
CN105227907B (en) * 2015-08-31 2018-07-27 电子科技大学 Unsupervised anomalous event real-time detection method based on video
CN105550699B (en) * 2015-12-08 2019-02-12 北京工业大学 A kind of video identification classification method based on CNN fusion space-time remarkable information
CN105847964A (en) * 2016-03-28 2016-08-10 乐视控股(北京)有限公司 Movie and television program processing method and movie and television program processing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080310734A1 (en) * 2007-06-18 2008-12-18 The Regents Of The University Of California High speed video action recognition and localization
CN101692269A (en) * 2009-10-16 2010-04-07 北京中星微电子有限公司 Method and device for processing video programs
CN105184271A (en) * 2015-09-18 2015-12-23 苏州派瑞雷尔智能科技有限公司 Automatic vehicle detection method based on deep learning
CN105704485A (en) * 2016-02-02 2016-06-22 广州视源电子科技股份有限公司 Display device performance parameter detection method and system

Also Published As

Publication number Publication date
CN106612457B (en) 2019-09-03
CN106612457A (en) 2017-05-03

Similar Documents

Publication Publication Date Title
WO2018086231A1 (en) Method and system for video sequence alignment
Masood et al. License plate detection and recognition using deeply learned convolutional neural networks
CN107316007B (en) Monitoring image multi-class object detection and identification method based on deep learning
CN110609920B (en) Pedestrian hybrid search method and system in video monitoring scene
CN104778474B (en) A kind of classifier construction method and object detection method for target detection
US8929595B2 (en) Dictionary creation using image similarity
CN109829467A (en) Image labeling method, electronic device and non-transient computer-readable storage medium
WO2022121766A1 (en) Method and apparatus for detecting free space
CN106778736B (en) Robust license plate recognition method and system
CN109859164A (en) A method of by Quick-type convolutional neural networks to PCBA appearance test
CN112232237B (en) Method, system, computer device and storage medium for monitoring vehicle flow
WO2014082480A1 (en) Method and device for calculating number of pedestrians and crowd movement directions
CN108052931A (en) A kind of license plate recognition result fusion method and device
CN104200218B (en) A kind of across visual angle action identification method and system based on timing information
CN111241987B (en) Multi-target model visual tracking method based on cost-sensitive three-branch decision
CN110222627A (en) A kind of face amended record method
CN106572387A (en) Video sequence alignment method and video sequence alignment system
CN102298695B (en) Visual analyzing and processing method for detecting paper money bundle
CN113269038B (en) Multi-scale-based pedestrian detection method
CN112232236B (en) Pedestrian flow monitoring method, system, computer equipment and storage medium
Lin et al. A traffic sign recognition method based on deep visual feature
CN109325467A (en) A kind of wireless vehicle tracking based on video detection result
CN103310088A (en) Automatic detecting method of abnormal illumination power consumption
CN109325487B (en) Full-category license plate recognition method based on target detection
CN116469164A (en) Human gesture recognition man-machine interaction method and system based on deep learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16921379

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08.10.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 16921379

Country of ref document: EP

Kind code of ref document: A1