TW201907736A

TW201907736A - Method and device for generating video summary

Info

Publication number: TW201907736A
Application number: TW107103624A
Authority: TW
Inventors: 葛雷鳴
Original assignee: 中國商優酷網絡技術（北京）有限公司
Priority date: 2017-07-05
Filing date: 2018-02-01
Publication date: 2019-02-16
Also published as: TWI712316B; CN109213895A; WO2019007020A1

Abstract

Embodiments of the present application disclose a method and device for generating a video summary, a video therein having text description information. The method comprises: extracting a plurality of scene switching frames from the video, and setting scene labels for the scene switching frames, the similarity between two adjacent scene switching frames meeting a designated condition; extracting a theme label corresponding to the video from the text description information; and selecting a target frame from the plurality of scene switching frames according to a correlation between the scene labels of the scene switching frames and the theme label, and generating a video summary of the video based on the target frame. According to the technical solution provided by the present application, the efficiency can be improved, and the theme of the video can be precisely characterized.

Description

Method and device for generating video summary

本申請涉及互聯網技術領域，特別涉及一種視訊摘要的生成方法及裝置。 The present application relates to the field of Internet technologies, and in particular, to a method and an apparatus for generating a video summary.

當前，為了讓用戶在短時間內獲知視訊的內容，視訊播放平台通常會為上傳的視訊製作對應的視訊摘要。所述視訊摘要可以是一個時長較短的視訊，在所述視訊摘要中可以包含原視訊中的一部分場景。這樣，用戶在觀看所述視訊摘要時，可以快速地了解原視訊的大概內容。 Currently, in order to let users know the content of the video in a short time, the video playing platform usually creates a corresponding video summary for the uploaded video. The video summary may be a short duration video, and the video summary may include a part of the scene in the original video. In this way, when viewing the video summary, the user can quickly understand the approximate content of the original video.

目前，在製作視訊摘要時，一方面可以透過人工剪輯的方式，先由視訊播放平台的工作人員觀看整個視訊，然後將其中比較關鍵的片段剪輯出來，構成該視訊的視訊摘要。透過這種方式製作的視訊摘要能夠比較準確地表徵視訊中包含的信息，但是隨著視訊數量的快速增長，這種製作視訊摘要的方式會耗費相當多的人力，而且製作視訊摘要的速度也相當慢。 At present, when creating a video summary, on the one hand, the entire video can be viewed by the video broadcast platform staff through manual editing, and then the more critical segments are clipped to form a video summary of the video. Video digests created in this way can more accurately characterize the information contained in video, but as the number of video messages grows rapidly, this way of making video summaries can take a lot of manpower, and the speed of making video summaries is quite high. slow.

鑑於此，為了節省人力並提高視訊摘要的製作效率，當前通常是透過圖像識別的技術來製作視訊摘要。具體地，可以按照固定的時間間隔對上傳的視訊進行採樣，從而提取出視訊中的多幀圖像。然後可以依次計算相鄰兩幀圖像之間的相似度，並且可以保留相似度較低的兩幀圖像，從而保證保留下來的圖像幀能夠展示多個場景的內容。這樣，可以將最終保留的圖像幀構成該視訊的視訊摘要。 In view of this, in order to save manpower and improve the efficiency of video summary production, video summary is usually produced through image recognition technology. Specifically, the uploaded video can be sampled at a fixed time interval, thereby extracting a multi-frame image in the video. Then, the similarity between adjacent two frames of images can be calculated in turn, and two frames of lower similarity can be retained, thereby ensuring that the retained image frames can display the contents of multiple scenes. In this way, the finally retained image frame can be made up of the video summary of the video.

現有技術中透過圖像識別來製作視訊摘要的方法，儘管能夠提高製作的效率，但是透過固定採樣和比對相似度的方式來挑選視訊摘要中的圖像幀，很容易漏掉視訊中的關鍵場景，從而導致生成的視訊摘要無法準確地反映視訊的主題。 In the prior art, a method for creating a video digest by image recognition can improve the efficiency of production, but it is easy to miss the key in the video by selecting the image frame in the video digest by means of fixed sampling and comparison similarity. The scene, which causes the generated video summary to not accurately reflect the subject of the video.

本申請實施方式的目的是提供一種視訊摘要的生成方法及裝置，能夠在提高效率的同時，精確地表徵視訊的主題。 An object of the embodiments of the present application is to provide a method and a device for generating a video summary, which can accurately characterize the subject of video while improving efficiency.

為實現上述目的，本申請實施方式提供一種視訊摘要的生成方法，所述視訊具備文字描述信息，所述方法包括：從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；從所述文字描述信息中提取所述視訊對應的主題標籤；根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 To achieve the above objective, an embodiment of the present application provides a method for generating a video summary, where the video has text description information, the method includes: extracting a plurality of scene switching frames from the video, and setting a frame for the scene switching frame. a scene label, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; extracting a theme label corresponding to the video from the text description information; and switching a scene label of the frame according to the scene and the theme The association between the tags, the target frame is filtered out from the plurality of scene switching frames, and the video summary of the video is generated based on the target frame.

為實現上述目的，本申請實施方式還提供一種視訊摘要的生成裝置，所述視訊具備文字描述信息，所述裝置包括：場景切換幀提取單元，用於從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；主題標籤提取單元，用於從所述文字描述信息中提取所述視訊對應的主題標籤；視訊摘要生成單元，用於根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 In order to achieve the above object, an embodiment of the present application further provides a video summary generating device, where the video has text description information, and the device includes: a scene switching frame extracting unit, configured to extract multiple scene switching frames from the video. And setting a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; and a topic label extracting unit, configured to extract, from the text description information, the video corresponding to a topic identifier; a video summary generating unit, configured to filter a target frame from the plurality of scene switching frames according to an association between a scene label of the scene switching frame and the theme label, and based on the target frame Generating a video summary of the video.

由上可見，本申請首先可以從視訊中提取相似度滿足指定條件的場景切換幀，並為場景切換幀設置對應的場景標籤。然後可以結合該視訊的文字描述信息，確定該視訊的主題標籤。該主題標籤可以準確地表徵該視訊的主題。接著，透過確定場景標籤與主題標籤之間的關聯性，從而可以從場景切換幀中保留與主題關聯性較緊密的目標幀。這樣，基於所述目標幀生成的視訊摘要從而能夠準確地表徵視訊的主題內容。 As can be seen from the above, the present application can first extract a scene switching frame whose similarity meets the specified condition from the video, and set a corresponding scene label for the scene switching frame. The textual description of the video can then be combined to determine the subject tag of the video. The hashtag can accurately characterize the subject of the video. Then, by determining the association between the scene label and the theme label, the target frame closely related to the theme can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can thus accurately characterize the subject content of the video.

100‧‧‧場景切換幀提取單元 100‧‧‧Scene switching frame extraction unit

200‧‧‧主題標籤提取單元 200‧‧‧Subject label extraction unit

300‧‧‧視訊摘要生成單元 300‧‧‧Video summary generation unit

S1、S3、S5‧‧‧步驟 S1, S3, S5‧‧‧ steps

為了更清楚地說明本申請實施方式或現有技術中的技術方案，下面將對實施方式或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本申請中記載的一些實施方式，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。 In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only Some of the embodiments described in the application can be obtained by those skilled in the art from the drawings without further inventive labor.

圖1為本申請實施方式中視訊摘要的生成方法流程圖；圖2為本申請實施方式中目標幀和場景切換幀的示意圖；圖3為本申請實施方式中場景切換幀的提取示意圖；圖4為本申請實施方式中場景標籤的提取示意圖；圖5為本申請實施方式中視訊摘要的生成裝置的功能模塊圖。 1 is a flowchart of a method for generating a video summary according to an embodiment of the present disclosure; FIG. 2 is a schematic diagram of a target frame and a scene switching frame according to an embodiment of the present disclosure; FIG. 3 is a schematic diagram of extracting a scene switching frame according to an embodiment of the present application; The schematic diagram of the extraction of the scene label in the embodiment of the present application; FIG. 5 is a functional block diagram of the apparatus for generating a video summary in the embodiment of the present application.

為了使本技術領域的人員更好地理解本申請中的技術方案，下面將結合本申請實施方式中的附圖，對本申請實施方式中的技術方案進行清楚、完整地描述，顯然，所描述的實施方式僅僅是本申請一部分實施方式，而不是全部的實施方式。基於本申請中的實施方式，本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施方式，都應當屬於本申請保護的範圍。 In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.

本申請提供一種視訊摘要的生成方法，所述方法可以應用於具備數據處理功能的電子設備中。所述電子設備例如可以是台式電腦、平板電腦、筆記本電腦、智慧型手機、數位助理、智慧型可穿戴設備、導購終端、具有網路訪問功能的電視機等。所述方法還可以應用於在上述電子設備中運行的軟體中。所述軟體可以是具備視訊製作功能或者視訊播放功能的軟體中。此外，所述方法還可以應用於視訊播放網站的伺服器中。所述視訊播放網站例如可以是愛奇藝、搜狐視訊、Acfun等。在本實施方式中並不具體限定所述伺服器的數量。所述伺服器可以為一個伺服器，還可以為幾個伺服器，或者，若干伺服器形成的伺服器集群。 The present application provides a method for generating a video summary, which can be applied to an electronic device having a data processing function. The electronic device may be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television with a network access function, or the like. The method can also be applied to software running in the above electronic device. The software may be in a software having a video production function or a video playback function. In addition, the method can also be applied to a server of a video playing website. The video playing website may be, for example, iQiyi, Sohu video, Acfun, and the like. The number of the servers is not specifically limited in the present embodiment. The server may be a server, or may be several servers, or a server cluster formed by several servers.

在本實施方式中，所述視訊摘要可以基於視訊生成。所述視訊可以是用戶本地的視訊，也可以是用戶上傳至視訊播放網站的視訊。其中，所述視訊通常可以具備文字描述信息。所述文字描述信息可以是所述視訊的標題或者所述視訊的簡介。所述標題和所述簡介可以是視訊製作者或者視訊上傳者預先編輯的，還可以是對視訊進行審核的工作人員添加的，本申請對比並不做限定。當然，在實際應用中，所述文字描述信息除了包括所述視訊的標題和簡介，還可以包括所述視訊的文字標籤或者從該視訊的彈幕信息中提取的描述性短語。 In this embodiment, the video summary may be generated based on video. The video may be a video local to the user, or may be a video uploaded by the user to the video playing website. The video can usually have text description information. The text description information may be a title of the video or an introduction to the video. The title and the profile may be pre-edited by the video creator or the video uploader, or may be added by a staff member who reviews the video. The comparison of the present application is not limited. Of course, in practical applications, the text description information may include a text label of the video or a descriptive phrase extracted from the video information of the video, in addition to the title and the introduction of the video.

請參閱圖1和圖2，本申請提供的視訊摘要的生成方法可以包括以下步驟。 Referring to FIG. 1 and FIG. 2, the method for generating a video summary provided by the present application may include the following steps.

S1：從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件。 S1: Extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, where a similarity between two adjacent scene switching frames satisfies a specified condition.

在本實施方式中，所述視訊可以是存儲於本地的視訊，也可以是存儲於其它設備中的視訊。這樣，所述視訊的獲取方式可以包括按照指定路徑，從本地加載所述視訊或者根據其它設備提供的統一資源定位符(Uniform Resource Locator，URL)下載所述視訊。 In this embodiment, the video may be stored in a local video, or may be a video stored in another device. In this way, the method for acquiring the video may include loading the video locally according to a specified path or downloading the video according to a Uniform Resource Locator (URL) provided by another device.

在本實施方式中，在獲取到所述視訊之後，可以對所述視訊中的每一幀畫面進行分析，以提取其中的多個場景切換幀。為了能夠獲取所述視訊的各個場景對應的場景切換幀，在本實施方式中可以透過逐幀對比的方式進行提取。具體地，首先可以在所述視訊中確定基準幀，並依次計算所述基準幀之後的各個幀與所述基準幀之間的相似度。 In this embodiment, after the video is acquired, each frame of the video may be analyzed to extract a plurality of scene switching frames. In order to be able to acquire the scene switching frame corresponding to each scene of the video, in the embodiment, the extraction may be performed by frame-by-frame comparison. Specifically, first, a reference frame may be determined in the video, and a similarity between each frame after the reference frame and the reference frame may be sequentially calculated.

在本實施方式中，所述基準幀可以在一定範圍內隨機指定的一幀畫面。例如，所述基準幀可以是在所述視訊的開篇2分鐘內隨機選取的一幀畫面。當然，為了不遺漏所述視訊中的場景，可以將所述視訊的第一幀作為所述基準幀。 In this embodiment, the reference frame may be a frame frame randomly specified within a certain range. For example, the reference frame may be a frame of pictures randomly selected within 2 minutes of the beginning of the video. Of course, in order not to miss the scene in the video, the first frame of the video can be used as the reference frame.

在本實施方式中，當確定了所述基準幀之後，可以從所述基準幀開始，將所述基準幀之後的各幀畫面依次與所述基準幀進行對比，以計算後續的各幀畫面與所述基準幀之間的相似度。具體地，在計算各個幀與所述基準幀之間的相似度時，可以分別提取所述基準幀和當前幀的第一特徵向量和第二特徵向量。 In this embodiment, after the reference frame is determined, each frame frame subsequent to the reference frame may be sequentially compared with the reference frame from the reference frame to calculate subsequent frames and The similarity between the reference frames. Specifically, when calculating the similarity between each frame and the reference frame, the first feature vector and the second feature vector of the reference frame and the current frame may be separately extracted.

在本實施方式中，所述第一特徵向量和所述第二特徵向量可以具備多種形式。其中，可以基於每幀畫面中像素點的像素值構建該幀畫面的特徵向量。每幀畫面通常都是由若干的像素點按照一定的順序排列而成的，像素點對應各自的像素值，從而可以構成色彩斑斕的畫面。所述像素值可以是處於指定區間內的數值。例如，所述像素值可以是0至255中的任意一個數值。數值的大小可以表示色彩的深淺。在本實施方式中，可以獲取每幀畫面中各個像素點的像素值，並透過獲取的像素值構成該幀畫面的特徵向量。例如，對於具備9*9=81個像素點的當前幀而言，可以依次獲取其中像素點的像素值，然後根據從左向右從上至下的順序，將獲取的像素值依次排列，從而構成81維的向量。該81維的向量便可以作為所述當前幀的特徵向量。 In this embodiment, the first feature vector and the second feature vector may have various forms. Wherein, the feature vector of the frame picture can be constructed based on the pixel value of the pixel point in each frame of the picture. Each frame of the picture is usually arranged in a certain order by a plurality of pixel points, and the pixel points correspond to respective pixel values, thereby forming a colorful picture. The pixel value may be a value within a specified interval. For example, the pixel value may be any one of 0 to 255. The size of the value can indicate the shade of the color. In this embodiment, the pixel value of each pixel in each frame of the frame may be acquired, and the feature vector of the frame picture is formed by the acquired pixel value. For example, for a current frame having 9*9=81 pixels, the pixel values of the pixels may be sequentially acquired, and then the acquired pixel values are sequentially arranged according to the order from left to right and top to bottom, thereby Forms a vector of 81 dimensions. The 81-dimensional vector can be used as the feature vector of the current frame.

在本實施方式中，所述特徵向量還可以是每幀畫面的CNN(Convolutional Neural Network，卷積神經網路)特徵。具體地，可以將所述基準幀以及所述基準幀之後的各幀畫面輸入卷積神經網路中，然後該卷積神經網路便可以輸出所述基準幀以及其它各幀畫面對應的特徵向量。 In this embodiment, the feature vector may also be a CNN (Convolutional Neural Network) feature of each frame of the picture. Specifically, the reference frame and each frame after the reference frame may be input into a convolutional neural network, and then the convolutional neural network may output the reference vector and the feature vector corresponding to each frame picture. .

在本實施方式中，為了能夠準確地表徵所述基準幀和當前幀中所展示的內容，所述第一特徵向量和所述第二特徵向量還可以分別表示所述基準幀和所述當前幀的尺度不變特徵。這樣，即使改變圖像的旋轉角度、圖像亮度或拍攝視角，提取出的第一特徵向量和所述第二特徵向量仍然能夠很好地體現所述基準幀和當前幀中的內容。具體地，所述第一特徵向量和所述第二特徵向量可以是Sift(Scale-invariant feature transform，尺度不變特徵轉換)特徵、surf特徵(Speed Up Robust Feature，快速魯棒性特徵)或者顏色直方圖特徵等。 In this embodiment, in order to accurately represent the content displayed in the reference frame and the current frame, the first feature vector and the second feature vector may further represent the reference frame and the current frame, respectively. Scale-invariant features. Thus, even if the rotation angle, the image brightness, or the photographing angle of view of the image is changed, the extracted first feature vector and the second feature vector can still well reflect the contents in the reference frame and the current frame. Specifically, the first feature vector and the second feature vector may be a Sift (Scale-invariant feature transform) feature, a VEL feature (Speed Up Robust Feature), or a color. Histogram features, etc.

在本實施方式中，在確定了所述第一特徵向量和所述第二特徵向量之後，可以計算所述第一特徵向量和所述第二特徵向量之間的相似度。具體地，所述相似度在向量空間中可以表示為兩個向量之間的距離。距離越近，表示兩個向量越相似，因此相似度越高。距離越遠，表示兩個向量差別越大，因此相似度越低。因此，在計算所述基準幀和所述當前幀之間的相似度時，可以計算所述第一特徵向量和所述第二特徵向量之間的空間距離，並將所述空間距離的倒數作為所述基準幀與所述當前幀之間的相似度。這樣，空間距離越小，其對應的相似度越大，表明所述基準幀和所述當前幀之間越相似。相反地，空間距離越大，其對應的相似度越小，表明所述基準幀和所述當前幀之間越不相似。 In this embodiment, after the first feature vector and the second feature vector are determined, the similarity between the first feature vector and the second feature vector may be calculated. Specifically, the similarity may be expressed in the vector space as the distance between the two vectors. The closer the distance, the more similar the two vectors are, so the higher the similarity. The further the distance, the greater the difference between the two vectors, so the lower the similarity. Therefore, when calculating the similarity between the reference frame and the current frame, a spatial distance between the first feature vector and the second feature vector may be calculated, and the reciprocal of the spatial distance is taken as The similarity between the reference frame and the current frame. Thus, the smaller the spatial distance, the greater the corresponding similarity, indicating the similarity between the reference frame and the current frame. Conversely, the larger the spatial distance, the smaller the corresponding similarity, indicating the more dissimilar between the reference frame and the current frame.

在本實施方式中，按照上述方式可以依次計算所述基準幀之後的各個幀與所述基準幀之間的相似度。相似度較高的兩幀畫面中所展示的內容也通常是比較相似的，而視訊摘要的主旨是將視訊中不同場景的內容向用戶展示，因此，在本實施方式中，當所述基準幀與當前幀之間的相似度小於或者等於指定閾值時，可以將所述當前幀確定為一個場景切換幀。其中，所述指定閾值可以是預先設定的一個數值，該數值根據實際情況可以靈活地進行調整。例如，當根據該指定閾值篩選出的場景切換幀的數量過多時，可以適當減小該指定閾值的大小。又例如，當根據該指定閾值篩選出的場景切換幀的數量過少時，可以適當增大該指定閾值的大小。在本實施方式中，相似度小於或者等於指定閾值，可以表示兩幀畫面中的內容已經具備明顯的不同，因此可以認為當前幀所展示的場景，與所述基準幀所展示的場景發生了改變。此時，所述當前幀便可以作為場景切換的一幀畫面進行保留。 In the present embodiment, the similarity between each frame after the reference frame and the reference frame can be sequentially calculated in the above manner. The content displayed in the two frames with higher similarity is also generally similar, and the main purpose of the video summary is to display the content of different scenes in the video to the user. Therefore, in the present embodiment, when the reference frame is When the similarity between the current frame and the current frame is less than or equal to the specified threshold, the current frame may be determined as a scene switching frame. The specified threshold may be a preset value, and the value may be flexibly adjusted according to actual conditions. For example, when the number of scene switching frames filtered according to the specified threshold is excessive, the size of the specified threshold may be appropriately reduced. For another example, when the number of scene switching frames filtered according to the specified threshold is too small, the size of the specified threshold may be appropriately increased. In this embodiment, the similarity is less than or equal to the specified threshold, which may indicate that the content in the two frames has been significantly different, so that the scene displayed by the current frame may be changed, and the scene displayed by the reference frame may be changed. . At this time, the current frame can be reserved as one frame of the scene switching.

在本實施方式中，在將所述當前幀確定為一個場景切換幀時，可以繼續確定後續的其它場景切換幀。具體地，從所述基準幀到所述當前幀，可以視為場景發生了一次改變，因此當前的場景便是所述當前幀所展示的內容。基於此，可以將所述當前幀作為新的基準幀，並依次計算所述新的基準幀之後的各個幀與所述新的基準幀之間的相似度，以根據計算的的所述相似度確定下一個場景切換幀。同樣地，在確定下一個場景切換幀時，依然可以透過提取特徵向量以及計算空間距離的方式確定出兩幀畫面之間的相似度，並且可以將確定出的相似度依然與所述指定閾值進行對比，從而確定出從新的基準幀之後場景再次發生變化的下一個場景切換幀。 In this embodiment, when the current frame is determined as one scene switching frame, subsequent other scene switching frames may be determined. Specifically, from the reference frame to the current frame, it may be considered that the scene has changed once, so the current scene is the content displayed by the current frame. Based on this, the current frame may be used as a new reference frame, and the similarity between each frame after the new reference frame and the new reference frame may be sequentially calculated, according to the calculated similarity. Determine the next scene switch frame. Similarly, when determining the next scene switching frame, the similarity between the two frames can still be determined by extracting the feature vector and calculating the spatial distance, and the determined similarity can still be performed with the specified threshold. Contrast, thereby determining the next scene switching frame in which the scene changes again after the new reference frame.

請參閱圖3，在本實施方式中，再確定出下一個場景切換幀之後，可以將該場景切換幀作為新的基準幀，繼續進行後續場景切換幀的提取過程。這樣，透過依次改變基準幀的方式，可以將所述視訊中場景發生變化的各幀畫面均提取出來，從而不會遺漏所述視訊中所展示的場景，以保證視訊摘要的完備性。在圖3中，被斜線填充的矩形條可以作為場景切換幀，相鄰兩個場景切換幀之間的相似度都可以小於或者等於所述指定閾值。 Referring to FIG. 3, in the embodiment, after determining the next scene switching frame, the scene switching frame may be used as a new reference frame, and the subsequent scene switching frame extraction process may be continued. In this way, by sequentially changing the reference frame, each frame of the scene in which the scene changes in the video can be extracted, so that the scene displayed in the video is not missed, so as to ensure the completeness of the video summary. In FIG. 3, a rectangular strip filled with a diagonal line may serve as a scene switching frame, and the similarity between adjacent two scene switching frames may be less than or equal to the specified threshold.

在本實施方式中，透過上述方式提取出的場景切換幀中，任意相鄰兩個場景切換幀之間的相似度都會小於或者等於所述指定閾值，因此，相鄰兩個場景切換幀之間的相似度滿足指定條件便可以指相鄰兩個場景切換幀之間的相似度小於或者等於所述指定閾值。 In this embodiment, in the scene switching frame extracted by the foregoing manner, the similarity between any two adjacent scene switching frames is less than or equal to the specified threshold, and therefore, between two adjacent scene switching frames If the similarity satisfies the specified condition, the similarity between two adjacent scene switching frames may be less than or equal to the specified threshold.

在本實施方式中，在提取了所述多個場景切換幀之後，可以為所述場景切換幀設置場景標籤。所述場景標籤可以是用於表徵所述場景切換幀中所展示的內容的文字標籤。例如，某一個場景切換幀中展示的是兩個人在打鬥，那麼該場景切換幀對應的場景標籤便可以是“武術”、“搏擊”或者“功夫”等。 In this embodiment, after the plurality of scene switching frames are extracted, a scene label may be set for the scene switching frame. The scene tag may be a text tag for characterizing content displayed in the scene switch frame. For example, if a scene switching frame shows that two people are fighting, the scene label corresponding to the scene switching frame may be “martial arts”, “fighting” or “Kung Fu”.

在本實施方式中，可以對場景切換幀中的內容進行識別，以確定場景切換幀對應的場景標籤。具體地，可以提取所述場景切換幀的特徵，其中，所述特徵可以包括顏色特徵、紋理特徵以及形狀特徵中的至少一種。其中，所述顏色特徵可以是基於不同的顏色空間進行提取的特徵。所述顏色空間例如可以包括RGB(Red、Green、Blue，紅、綠、藍)空間、HSV(Hue、Saturation、Value，色調、飽和度、明度)空間、HIS(Hue、Saturation、Intensity，色調、飽和度、亮度)空間等。在顏色空間中，均可以具備多個顏色分量。例如，RGB空間中可以具備R分量、G分量以及B分量。針對不同的畫面，顏色分量也會存在不同。因此，可以用所述顏色分量來表徵場景切換幀的特徵。 In this embodiment, the content in the scene switching frame may be identified to determine a scene label corresponding to the scene switching frame. Specifically, features of the scene switching frame may be extracted, wherein the features may include at least one of a color feature, a texture feature, and a shape feature. Wherein, the color feature may be a feature extracted based on different color spaces. The color space may include, for example, RGB (Red, Green, Blue, Red, Green, Blue) space, HSV (Hue, Saturation, Value, Hue, Saturation, Lightness) space, HIS (Hue, Saturation, Intensity, Hue, Saturation, brightness, space, etc. In the color space, you can have multiple color components. For example, the R component, the G component, and the B component may be provided in the RGB space. The color components will also be different for different screens. Thus, the color components can be used to characterize the features of the scene switching frame.

此外，所述紋理特徵可以用於描述所述場景切換幀對應的材質。所述紋理特徵通常可以透過灰度的分佈來體現。所述紋理特徵可以與圖像頻譜中的低頻分量以及高頻分量相對應。這樣，場景切換幀中包含的圖像的低頻分量和高頻分量便可以作為所述場景切換幀的特徵。 In addition, the texture feature may be used to describe a material corresponding to the scene switching frame. The texture features are typically embodied by a distribution of gray levels. The texture features may correspond to low frequency components and high frequency components in the image spectrum. Thus, the low frequency component and the high frequency component of the image contained in the scene switching frame can be used as features of the scene switching frame.

在本實施方式中，所述形狀特徵可以包括基於邊緣的形狀特徵以及基於區域的形狀特徵。具體地，可以利用傅里葉變換的邊界來作為所述基於邊緣的形狀特徵，還可以利用不變矩描述子來作為所述基於區域的形狀特徵。 In this embodiment, the shape features may include edge-based shape features and region-based shape features. Specifically, the boundary of the Fourier transform may be utilized as the edge-based shape feature, and the invariant moment descriptor may also be utilized as the region-based shape feature.

請參閱圖4，在本實施方式中，在提取出各個場景切換幀中的特徵後，可以將提取的所述特徵與特徵樣本庫中的各個特徵樣本進行比對。所述特徵樣本庫可以是基於圖像識別的歷史數據而總結歸納的一個樣本集合。在所述特徵樣本庫中，可以具備表徵不同內容的特徵樣本。所述特徵樣本同樣可以是上述的顏色特徵、紋理特徵以及形狀特徵中的至少一種。例如，所述特徵樣本庫中，有表徵踢足球的特徵樣本，有表徵舞蹈的特徵樣本，還有表徵搏鬥的特徵樣本等。具體地，所述特徵樣本庫中的所述特徵樣本均可以與文字標籤相關聯，所述文字標籤可以用於描述所述特徵樣本所對應的展示內容。例如，表徵踢足球的特徵樣本關聯的文字標籤可以是“踢足球”，表徵舞蹈的特徵樣本的文字標籤可以是“廣場舞”。 Referring to FIG. 4, in the embodiment, after extracting features in each scene switching frame, the extracted features may be compared with each feature sample in the feature sample library. The feature sample library may be a sample set summarized based on historical data of image recognition. In the feature sample library, feature samples representing different contents may be provided. The feature sample may also be at least one of the color feature, the texture feature, and the shape feature described above. For example, in the feature sample library, there are feature samples that characterize playing football, feature samples that characterize dance, and feature samples that characterize wrestling. Specifically, the feature samples in the feature sample library may be associated with a text tag, and the text tag may be used to describe the display content corresponding to the feature sample. For example, the text label associated with the feature sample representing the soccer game may be "playing football", and the text label representing the feature sample of the dance may be "square dance".

在本實施方式中，提取的所述特徵以及所述特徵樣本庫中的特徵樣本均可以透過向量的形式進行表示。這樣，將提取的所述特徵與特徵樣本庫中的各個特徵樣本進行比對可以指計算所述特徵與各個特徵樣本之間的距離。距離越近，表明提取的所述特徵與特徵樣本越相似。這樣，可以確定所述特徵樣本庫中與提取的所述特徵最相似的目標特徵樣本。其中，所述最相似的目標特徵樣本與所述提取的特徵樣本之間計算出的距離可以是最小的。提取的特徵與所述目標特徵樣本最相似，表明這兩者展示的內容也最相似，因此，可以將所述目標特徵樣本關聯的文字標籤作為所述場景切換幀對應的場景標籤，從而可以為各個場景切換幀設置相應的場景標籤。 In this embodiment, the extracted features and the feature samples in the feature sample library are all expressed in the form of a vector. In this way, comparing the extracted features with each feature sample in the feature sample library may refer to calculating a distance between the feature and each feature sample. The closer the distance, the more similar the extracted features are to the feature samples. In this way, target feature samples of the feature sample library that are most similar to the extracted features can be determined. The distance calculated between the most similar target feature sample and the extracted feature sample may be the smallest. The extracted feature is the most similar to the target feature sample, indicating that the content displayed by the two is also the most similar. Therefore, the text label associated with the target feature sample can be used as the scene label corresponding to the scene switching frame, thereby Each scene switching frame sets a corresponding scene label.

如圖4所示，從場景切換幀中提取的特徵與特徵樣本庫中的各個特徵樣本之間的距離可以分別為0.8、0.5、0.95以及0.6，這樣，距離為0.5的特徵樣本對應的文字標籤就可以作為所述場景切換幀對應的場景標籤。 As shown in FIG. 4, the distance between the feature extracted from the scene switching frame and each feature sample in the feature sample library may be 0.8, 0.5, 0.95, and 0.6, respectively, so that the character label corresponding to the feature sample with a distance of 0.5 It can be used as the scene label corresponding to the scene switching frame.

S3：從所述文字描述信息中提取所述視訊對應的主題標籤。 S3: Extract a topic tag corresponding to the video from the text description information.

在本實施方式中，所述文字描述信息可以比較精確地表明所述視訊的主題。因此，可以從所述文字描述信息中提取所述視訊對應的主題標籤。具體地，視訊播放網站可以針對大量的視訊的文字描述信息進行歸納總結，篩選出可能作為視訊主題的各個文字標籤，並將篩選出的各個文字標籤構成文字標籤庫。所述文字標籤庫中的內容可以不斷進行更新。這樣，在從所述文字描述信息中提取主題標籤時，可以將所述文字描述信息與文字標籤庫中的各個文字標籤進行匹配，並將匹配得到的文字標籤作為所述視訊的主題標籤。例如，所述視訊的文字描述信息為“外國小伙與中國大媽跳廣場舞，驚呆眾人！”那麼將該文字描述信息與所述文字標籤庫中的各個文字標籤進行匹配時，可以得到“廣場舞”這個匹配結果。因此，“廣場舞”便可以作為該視訊的主題標籤。 In this embodiment, the text description information may more accurately indicate the subject of the video. Therefore, the theme tag corresponding to the video may be extracted from the text description information. Specifically, the video playing website can summarize and summarize the text description information of a large number of videos, and filter out various text labels that may be the subject of the video, and form the selected text labels into a text label library. The content in the text tag library can be continuously updated. In this way, when the topic tag is extracted from the text description information, the text description information may be matched with each text tag in the text tag library, and the matched text tag may be used as the theme tag of the video. For example, the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!” Then, when the text description information is matched with each text label in the text tag library, the “square” can be obtained. Dance" this match result. Therefore, "square dance" can be used as the subject label of the video.

需要說明的是，由於視訊的文字描述信息通常比較長，在與文字標籤庫中的文字標籤進行匹配時，可能會匹配得到至少兩個結果。例如，所述視訊的文字描述信息為“外國小伙與中國大媽跳廣場舞，驚呆眾人！”，那麼將該文字描述信息與所述文字標籤庫中的各個文字標籤進行匹配時，可以得到“外國小伙”、“中國大媽”以及“廣場舞”這三個匹配結果。一方面，可以將匹配到的這三個匹配結果同時作為所述視訊的主題標籤。另一方面，當所述視訊的主題標籤的數量有限制時，可以從匹配到的多個結果中篩選出合適的主題標籤。具體地，在本實施方式中，所述文字標籤庫中的各個文字標籤可以與統計次數相關聯，其中，所述統計次數可以用於表徵所述文字標籤作為主題標籤的總次數。所述統計次數越大，表明對應的文字標籤作為視訊的主題標籤的總次數越多，該文字標籤作為主題標籤的可信度也就越高。因此，當匹配得到的文字標籤的數量為至少兩個時，可以按照統計次數從大到小的順序對匹配得到的文字標籤進行排序，並將排序結果中靠前的指定數量個文字標籤作為所述視訊的主題標籤。其中，所述指定數量可以是預先限定的所述視訊的主題標籤的數量。例如，所述視訊的主題標籤的數量限制為最多2個，那麼可以根據統計次數將“外國小伙”、“中國大媽”以及“廣場舞”這三個匹配結果進行排序，並最終將排名前2的“中國大媽”和“廣場舞”作為該視訊的主題標籤。 It should be noted that since the text description information of the video is usually long, when matching with the text label in the text label library, at least two results may be matched. For example, the text description information of the video is "foreign guys and Chinese aunts dancing square dance, stunned everyone!", then the text description information is matched with each text label in the text tag library, and then " The results of the three matches of foreign guys, "Chinese aunt" and "square dance". In one aspect, the matched three matching results can be simultaneously used as the subject tag of the video. On the other hand, when the number of subject tags of the video is limited, a suitable topic tag can be selected from the matched multiple results. Specifically, in this embodiment, each text label in the text label library may be associated with a statistical number, wherein the number of statistics may be used to represent the total number of times the text label is a topic label. The greater the number of statistics, the more the total number of times the corresponding text label is used as the subject label of the video, and the higher the credibility of the text label as the subject label. Therefore, when the number of matched text labels is at least two, the matched text labels may be sorted according to the order of statistics, and the specified number of text labels in the ranking result are used as the The subject tag of the video. The specified number may be a predefined number of the subject tags of the video. For example, if the number of the subject tags of the video is limited to a maximum of two, then the three matching results of "foreign guy", "Chinese aunt" and "square dance" can be sorted according to the number of statistics, and finally the top 2 will be ranked. The "Chinese Aunt" and "Plaza Dance" are the subject labels for the video.

S5：根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 S5: Filter the target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the topic label, and generate a video summary of the video based on the target frame.

在本實施方式中，考慮到視訊中出現的場景會較多，但是場景對應的場景切換幀並非都是與視訊的主題具有緊密聯繫的。為了使得生成的視訊摘要能夠準確地反映視訊的主題，可以根據各個所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀。 In this embodiment, it is considered that there are many scenes appearing in the video, but the scene switching frames corresponding to the scene are not all closely related to the subject of the video. In order to enable the generated video digest to accurately reflect the subject of the video, the association between the scene label of the frame and the topic tag may be switched according to each of the scenes, and the target frame is filtered out from the plurality of scene switching frames.

在本實施方式中，場景標籤與主題標籤之間的關聯性可以指場景標籤和主題標籤之間的相似程度。場景標籤與主題標籤越相似，則表明場景切換幀所展示的內容與視訊的主題越相關。具體地，確定場景標籤與主題標籤之間關聯性的方式可以包括計算各個所述場景切換幀的場景標籤與所述主題標籤之間的相似度。在實際應用中，所述場景標籤與所述主題標籤均可以由詞彙構成，在計算這兩者之間的相似度時，可以透過詞向量(wordvector)的方式來分別表示所述場景標籤和所述主題標籤。這樣，可以透過兩個詞向量之間的空間距離來表示所述場景標籤和所述主題標籤之間的相似度。兩個詞向量之間的空間距離越近，表明所述場景標籤和所述主題標籤之間的相似度越高；相反地，兩個詞向量之間的空間距離越遠，表明所述場景標籤和所述主題標籤之間的相似度越低。這樣，在實際應用場景中，可以將兩個詞向量之間的空間距離的倒數，作為所述場景標籤和所述主題標籤之間的相似度。 In this embodiment, the association between the scene tag and the topic tag may refer to the degree of similarity between the scene tag and the topic tag. The more similar the scene tag is to the topic tag, the more relevant the content displayed by the scene switch frame is to the subject of the video. Specifically, the manner of determining the association between the scene label and the topic label may include calculating a similarity between the scene label of each of the scene switching frames and the theme label. In an actual application, the scene label and the theme label may each be composed of a vocabulary. When calculating the similarity between the two, the scene label and the location may be respectively represented by a word vector. The subject tag. In this way, the similarity between the scene tag and the topic tag can be represented by the spatial distance between the two word vectors. The closer the spatial distance between the two word vectors, the higher the similarity between the scene tag and the topic tag; conversely, the farther the spatial distance between the two word vectors is, indicating the scene tag The lower the similarity between the hashtag and the subject tag. In this way, in the actual application scenario, the reciprocal of the spatial distance between the two word vectors can be used as the similarity between the scene tag and the topic tag.

在本實施方式中，在計算出所述場景標籤和所述主題標籤之間的相似度之後，可以將計算的所述相似度大於指定相似度閾值的場景切換幀確定為所述目標幀。其中，所述指定相似度閾值可以作為衡量場景切換幀與主題之間是否足夠關聯的門檻，當相似度大於所述指定相似度閾值時，可以表明當前的場景切換幀與視訊的主題之間已經足夠關聯，場景切換幀所展示的內容能夠準確地反映視訊的主題，因此可以將該場景切換幀確定為所述目標幀。 In this embodiment, after calculating the similarity between the scene label and the topic label, the calculated scene switching frame whose similarity is greater than the specified similarity threshold may be determined as the target frame. The specified similarity threshold may be used as a threshold for measuring whether the scene switching frame and the topic are sufficiently related. When the similarity is greater than the specified similarity threshold, the current scene switching frame and the video subject may be indicated. Sufficiently associated, the content displayed by the scene switching frame can accurately reflect the subject of the video, so the scene switching frame can be determined as the target frame.

在本實施方式中，從場景切換幀中篩選出的目標幀均與視訊的主體具備比較緊密的聯繫，因此，可以基於所述目標幀生成所述視訊的視訊摘要。具體地，生成所述視訊的視訊摘要的方式可以將各個目標幀按照在視訊中所處的先後順序依次排列，從而構成所述視訊的視訊摘要。此外，考慮到視訊摘要所展示的內容中前後幀之間並不需要保持內容的正常邏輯，因此可以將各個目標幀隨機地進行編排，並將編排後的目標幀序列作為所述視訊的視訊摘要。 In this embodiment, the target frames selected from the scene switching frame are all closely related to the main body of the video. Therefore, the video summary of the video can be generated based on the target frame. Specifically, the method for generating the video summary of the video may sequentially arrange the target frames in the order in which the video is located, thereby forming a video summary of the video. In addition, considering that the content of the video summary does not need to maintain the normal logic of the content between the preceding and succeeding frames, each target frame can be randomly arranged, and the sequenced target frame sequence is used as the video summary of the video. .

在本申請一個實施方式中，考慮到各個場景切換幀的場景標籤通常是針對場景切換幀的整體內容進行設置的，因此場景標籤無法準確地反映場景切換幀中的局部細節。為了進一步地提高目標幀與視訊主題的關聯性，在本實施方式中可以對場景切換幀中包含的目標對象進行識別，並在識別出的目標對象的基礎上進行目標幀的篩選。具體地，在計算各個所述場景切換幀的場景標籤與所述主題標籤之間的相似度之後，可以根據計算得到的所述相似度，為對應的場景切換幀設置權重係數。其中，場景標籤與主題標籤之間的相似度越高，為對應的場景切換幀設置的權重係數就越大。所述權重係數可以是處於0和1之間的數值。例如，當前視訊的主題標籤為“廣場舞”，那麼針對場景標籤為“舞蹈”和“功夫”的兩個場景切換幀而言，場景標籤為“舞蹈”的場景切換幀設置的權重係數可以為0.8，而場景標籤為“功夫”的場景切換幀設置的權重係數可以為0.4。 In an embodiment of the present application, the scene label of the scene switching frame is generally set for the overall content of the scene switching frame, so the scene label cannot accurately reflect the local details in the scene switching frame. In order to further improve the association between the target frame and the video theme, in the present embodiment, the target object included in the scene switching frame may be identified, and the target frame may be filtered based on the identified target object. Specifically, after calculating the similarity between the scene label of each scene switching frame and the topic label, the weighting coefficient may be set for the corresponding scene switching frame according to the calculated similarity. The higher the similarity between the scene label and the topic label, the larger the weight coefficient set for the corresponding scene switching frame. The weighting factor can be a value between 0 and 1. For example, if the theme label of the current video is “square dance”, then for the two scene switching frames whose scene labels are “dance” and “kungfu”, the weight coefficient of the scene switching frame set with the scene label “dance” may be 0.8, and the scene switching frame set with the scene label "Kung Fu" may have a weight coefficient of 0.4.

在本實施方式中，在為各個場景切換幀設置了權重係數之後，可以識別所述場景切換幀中包含的目標對象。具體地，在識別場景切換幀中包含的目標對象時，可以採用adaboost算法、R-CNN(Region-based Convolutional Neural Network，基於區域的捲積神經網路)算法或者SSD(Single Shot Detector，單目標檢測)算法，來檢測所述場景切換幀中所包含的目標對象。例如，對於場景標籤為“舞蹈”的場景切換幀而言，可以透過R-CNN算法識別出該場景切換幀中包括“女人”、“音響”這兩種目標對象。這樣，在識別出各個場景切換幀中包含的目標對象之後，可以根據識別出的所述目標對象與所述主題標籤之間的關聯性，為所述場景切換幀設置關聯值。具體地，所述主題標籤可以與至少一個對象相關聯。所述對象可以是與所述主題標籤聯繫比較緊密的對象。與主題標籤相關聯的至少一個對象可以是透過對歷史數據進行分析得到的。例如，主題標籤為“海灘”時，其關聯的至少一個對象可以包括“海水”、“沙灘”、“海鷗”、“泳裝”、“遮陽傘”等。這樣，可以將從所述場景切換幀中識別出的目標對象與所述至少一個對象進行對比，並統計在所述至少一個對像中出現的目標對象的數量。具體地，針對“海灘”這個主題標籤，假設從場景切換幀中識別出的目標對象為“遮陽傘”、“汽車”、“沙灘”、“樹木”以及“海水”，那麼在將目標對象與所述至少一個對象進行對比時，可以確定在所述至少一個對像中出現的目標對象為“遮陽傘”、“沙灘”以及“海水”。也就是說，在所述至少一個對像中出現的目標對象的數量為3。在本實施方式中，可以將統計的所述數量與指定數值的乘積作為所述場景切換幀的關聯值。所述指定數值可以是預先設置的數值，例如，所述指定數值可以是10，那麼上述例子中所述場景切換幀的關聯值可以為30。這樣，在所述至少一個對像中出現的目標對象的數量越多，表明該場景切換幀中的局部細節與視訊主題之間的關聯也越緊密，對應的關聯值也越高。 In the present embodiment, after the weight coefficients are set for each scene switching frame, the target object included in the scene switching frame can be identified. Specifically, when identifying the target object included in the scene switching frame, an adaboost algorithm, an R-CNN (Region-based Convolutional Neural Network) algorithm, or an SSD (Single Shot Detector) may be used. Detecting an algorithm to detect a target object included in the scene switching frame. For example, for a scene switching frame whose scene label is "dance", the R-CNN algorithm can recognize that the scene switching frame includes two types of target objects, "woman" and "audio". In this way, after identifying the target object included in each scene switching frame, the associated value may be set for the scene switching frame according to the identified association between the target object and the theme tag. In particular, the subject tag can be associated with at least one object. The object may be an object that is more closely associated with the subject tag. At least one object associated with the topic tag may be obtained by analyzing historical data. For example, when the theme tag is "Beach", at least one of its associated objects may include "sea water", "beach", "seagull", "swimwear", "sun umbrella", and the like. In this way, the target object identified in the scene switch frame can be compared with the at least one object, and the number of target objects appearing in the at least one object can be counted. Specifically, for the theme label "Beach", assuming that the target objects identified from the scene switching frame are "parasols", "cars", "beach", "trees", and "seawater", then the target object is When the at least one object is compared, it may be determined that the target objects appearing in the at least one object are "parasols", "beach", and "seawater". That is, the number of target objects appearing in the at least one object is three. In the present embodiment, the product of the number of statistics and the specified value may be used as the associated value of the scene switching frame. The specified value may be a preset value. For example, the specified value may be 10, and the associated value of the scene switching frame in the above example may be 30. Thus, the greater the number of target objects appearing in the at least one object, the more closely the association between the local details and the video subject in the scene switching frame is, and the corresponding associated value is also higher.

在本實施方式中，在確定目標幀時，可以基於場景切換幀的整體特徵和局部特徵來進行判斷。具體地，可以計算各個所述場景切換幀的權重係數與關聯值的乘積，並將所述乘積大於指定乘積閾值的場景切換幀確定為所述目標幀。利用乘積來作為判斷的依據，從而可以綜合了場景切換幀的整體特徵和局部特徵。所述指定乘積閾值可以是衡量場景切換幀是否為目標幀的門檻。所述指定乘積閾值在實際應用場景中可以靈活地進行調整。 In the present embodiment, when determining the target frame, the determination may be made based on the overall features and local features of the scene switching frame. Specifically, a product of a weight coefficient of each of the scene switching frames and an associated value may be calculated, and a scene switching frame whose product is greater than a specified product threshold is determined as the target frame. The product is used as the basis for the judgment, so that the overall characteristics and local features of the scene switching frame can be integrated. The specified product threshold may be a threshold for measuring whether a scene switching frame is a target frame. The specified product threshold can be flexibly adjusted in an actual application scenario.

在本申請一個實施方式中，考慮到有些場景中，可能會預先限制視訊摘要中畫面幀的總數量(或者是總時長)。在這種情況下，在確定目標幀時，還需要綜合考慮預先限制的幀總數量。具體地，當各個所述場景切換幀的總數量大於或者等於所述指定的幀總數量時，表明能夠從場景切換幀中提取出足夠的幀數來構成視訊摘要。在這種情況下愛，可以基於上述實施方式中計算出的各個場景切換幀對應的權重係數與關聯值的乘積，按照乘積從大到小的順序對各個所述場景切換幀進行排序。然後可以將排序結果中靠前的所述指定的幀總數量個場景切換幀確定為所述目標幀。舉例來說明，當前限制了視訊摘要中的幀總數量為1440幀，而當前從視訊中提取的場景切換幀的數量為2000幀。這樣，可以依次計算各個場景切換幀對應的權重係數和關聯值的乘積，並且按照乘積進行從大到小的順序排序之後，將排名前1440的場景切換幀作為所述目標幀，從而可以由1440幀目標幀構成符合要求的視訊摘要。 In one embodiment of the present application, it is considered that in some scenarios, the total number of picture frames (or the total duration) in the video summary may be limited in advance. In this case, when determining the target frame, it is also necessary to comprehensively consider the total number of frames that are pre-limited. Specifically, when the total number of the respective scene switching frames is greater than or equal to the total number of the specified frames, it indicates that a sufficient number of frames can be extracted from the scene switching frame to constitute a video summary. In this case, the love may be based on the product of the weight coefficient corresponding to each scene switching frame calculated in the above embodiment and the associated value, and the scene switching frames are sorted according to the order of the products from large to small. Then, the total number of the specified frames of the specified number of frames in the ranking result may be determined as the target frame. For example, the total number of frames in the video digest is currently limited to 1440 frames, and the number of scene switching frames currently extracted from the video is 2000 frames. In this way, the product of the weight coefficient and the associated value corresponding to each scene switching frame can be calculated in turn, and after the ordering of the products from large to small, the scene switching frame of the top 1440 is used as the target frame, thereby being 1440. The frame target frame constitutes a video summary that meets the requirements.

在本實施方式中，當各個所述場景切換幀的總數量小於所述指定的幀總數量時，表明當前提取的所有的場景切換幀都不足以構成符合要求的視訊摘要。在這種情況下，需要在提取出的場景切換幀之間插入原視訊中一定數量的畫面幀，從而達到視訊摘要限定的幀總數量的要求。具體地，在插入原視訊中的畫面幀時，可以在場景跳轉較大的兩個場景切換幀之間進行，這樣可以保持內容的連貫性。在本實施方式中，可以在相似度小於判定閾值的兩個相鄰的場景切換幀之間，插入所述視訊中的至少一個視訊幀。其中，相似度小於判定閾值的兩個相鄰的場景切換幀可以被視為內容關聯性較弱的兩個場景切換幀。在本實施方式中，在關聯性較弱的兩個場景切換幀之間可以逐幀插入原視訊中的畫面幀，直至插入所述至少一個視訊幀之後的場景切換幀的總數量等於所述指定的幀總數量。這樣，原有的場景切換幀和插入的畫面幀的整體都可以作為所述目標幀，從而構成所述視訊的視訊摘要。 In this embodiment, when the total number of the scene switching frames is less than the total number of the specified frames, it indicates that all the currently selected scene switching frames are not enough to constitute a video summary that meets the requirements. In this case, a certain number of picture frames in the original video are inserted between the extracted scene switching frames, so as to meet the total number of frames defined by the video summary. Specifically, when the picture frame in the original video is inserted, it can be performed between two scene switching frames with a large scene jump, so that the consistency of the content can be maintained. In this embodiment, at least one video frame in the video may be inserted between two adjacent scene switching frames whose similarity is less than the determination threshold. The two adjacent scene switching frames whose similarity is smaller than the determination threshold may be regarded as two scene switching frames with weak content relevance. In this embodiment, a picture frame in the original video frame may be inserted frame by frame between two scene switching frames with weak correlation, until the total number of scene switching frames after inserting the at least one video frame is equal to the specified The total number of frames. In this way, the original scene switching frame and the inserted picture frame as a whole can be used as the target frame, thereby constituting the video summary of the video.

在本申請一個實施方式中，從視訊的文字描述信息中提取的主題標籤的數量可能為至少兩個，在這種情況下，可以針對所述場景切換幀，計算所述場景切換幀的場景標籤與各個所述主題標籤之間的相似度。例如，當前的主題標籤為標籤1和標籤2，那麼可以分別計算當前場景切換幀與標籤1以及標籤2之間的相似度，從而可以得到所述當前場景切換幀對應的第一相似度和第二相似度。在計算出場景切換幀對應的各個相似度之後，可以將針對所述場景切換幀計算得出的各個相似度進行累加，以得到所述場景切換幀對應的累計相似度。例如，可以將上述的第一相似度和第二相似度之和作為所述當前場景切換幀對應的累計相似度。在本實施方式中，在計算出各個場景切換幀對應的累計相似度之後，同樣可以將累計相似度與指定相似度閾值進行比對，並將累計相似度大於指定相似度閾值的場景切換幀確定為所述目標幀。 In an embodiment of the present application, the number of topic tags extracted from the text description information of the video may be at least two. In this case, the scene label of the scene switching frame may be calculated for the scene switching frame. Similarity to each of the subject tags. For example, if the current topic label is the label 1 and the label 2, the similarity between the current scene switching frame and the label 1 and the label 2 can be separately calculated, so that the first similarity and the corresponding corresponding to the current scene switching frame can be obtained. Two similarities. After the respective similarities corresponding to the scene switching frame are calculated, the respective similarities calculated for the scene switching frame may be accumulated to obtain a cumulative similarity corresponding to the scene switching frame. For example, the sum of the first similarity and the second similarity described above may be used as the cumulative similarity corresponding to the current scene switching frame. In this embodiment, after calculating the cumulative similarity corresponding to each scene switching frame, the cumulative similarity may be compared with the specified similarity threshold, and the scene switching frame whose cumulative similarity is greater than the specified similarity threshold may be determined. Is the target frame.

請參閱圖5，本申請還提供一種視訊摘要的生成裝置，所述視訊具備文字描述信息，所述裝置包括：場景切換幀提取單元100，用於從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；主題標籤提取單元200，用於從所述文字描述信息中提取所述視訊對應的主題標籤；視訊摘要生成單元300，用於根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 Referring to FIG. 5, the present application further provides a video summary generating device, where the video has text description information, and the device includes: a scene switching frame extracting unit 100, configured to extract a plurality of scene switching frames from the video. And setting a scene label for the scene switching frame, where the similarity between two adjacent scene switching frames satisfies a specified condition; the topic label extracting unit 200 is configured to extract, from the text description information, the video corresponding to the a topic identifier; a video summary generating unit 300, configured to filter a target frame from the plurality of scene switching frames according to an association between a scene label of the scene switching frame and the theme label, and based on the target The frame generates a video summary of the video.

在本實施方式中，所述場景切換幀提取單元100包括：相似度計算模塊，用於在所述視訊中確定基準幀，並依次計算所述基準幀之後的幀與所述基準幀之間的相似度；場景切換幀確定模塊，用於當所述基準幀與當前幀之間的相似度小於或者等於指定閾值時，將所述當前幀確定為一個場景切換幀；循環執行模塊，用於將所述當前幀作為新的基準幀，並依次計算所述新的基準幀之後的幀與所述新的基準幀之間的相似度，以根據計算的的所述相似度確定下一個場景切換幀。 In this embodiment, the scene switching frame extraction unit 100 includes: a similarity calculation module, configured to determine a reference frame in the video, and sequentially calculate a frame between the reference frame and the reference frame a similarity degree; a scene switching frame determining module, configured to determine the current frame as a scene switching frame when the similarity between the reference frame and the current frame is less than or equal to a specified threshold; a loop execution module, configured to The current frame is used as a new reference frame, and the similarity between the frame after the new reference frame and the new reference frame is sequentially calculated to determine the next scene switching frame according to the calculated similarity. .

在本實施方式中，所述場景切換幀提取單元100包括：特徵提取模塊，用於提取所述場景切換幀的特徵，所述特徵包括顏色特徵、紋理特徵以及形狀特徵中的至少一種；比對模塊，用於將提取的所述特徵與特徵樣本庫中的特徵樣本進行比對，其中，所述特徵樣本庫中的所述特徵樣本均與文字標籤相關聯；目標特徵樣本確定模塊，用於確定所述特徵樣本庫中與提取的所述特徵最相似的目標特徵樣本，並將所述目標特徵樣本關聯的文字標籤作為所述場景切換幀對應的場景標籤。 In this embodiment, the scene switching frame extraction unit 100 includes: a feature extraction module, configured to extract a feature of the scene switching frame, where the feature includes at least one of a color feature, a texture feature, and a shape feature; a module, configured to compare the extracted feature with a feature sample in a feature sample library, wherein the feature samples in the feature sample library are all associated with a text tag; and the target feature sample determining module is configured to: Determining a target feature sample that is most similar to the extracted feature in the feature sample library, and using a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.

在本實施方式中，所述視訊摘要生成單元300包括：相似度計算模塊，用於計算所述場景切換幀的場景標籤與所述主題標籤之間的相似度；權重係數設置模塊，用於根據計算得到的所述相似度，為對應的場景切換幀設置權重係數；關聯值設置模塊，用於識別所述場景切換幀中包含的目標對象，並根據識別出的所述目標對象與所述主題標籤之間的關聯性，為所述場景切換幀設置關聯值；目標幀確定模塊，用於計算所述場景切換幀的權重係數與關聯值的乘積，並將所述乘積大於指定乘積閾值的場景切換幀確定為所述目標幀。 In this embodiment, the video summary generating unit 300 includes: a similarity calculation module, configured to calculate a similarity between the scene label of the scene switching frame and the theme label; and a weight coefficient setting module, configured to Calculating the similarity, and setting a weighting coefficient for the corresponding scene switching frame; the associated value setting module is configured to identify the target object included in the scene switching frame, and according to the identified target object and the theme The association between the labels, the associated value is set for the scene switching frame; the target frame determining module is configured to calculate a product of the weight coefficient of the scene switching frame and the associated value, and the product whose product is greater than the specified product threshold The switching frame is determined to be the target frame.

本申請可以在由計算機執行的計算機可執行指令的一般上下文中描述，例如程序模塊。一般地，程序模塊包括執行特定任務或實現特定抽像數據類型的例程、程序、對象、組件、數據結構等等。也可以在分佈式計算環境中實踐本申請，在這些分佈式計算環境中，由透過通信網路而被連接的遠程處理設備來執行任務。在分佈式計算環境中，程序模塊可以位於包括存儲設備在內的本地和遠程計算機存儲介質中。 The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular types of abstracted data. The present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including storage devices.

本領域技術人員也知道，除了以純計算機可讀程序代碼方式實現裝置以外，完全可以透過將方法步驟進行邏輯編程來使得裝置以邏輯門、開關、專用集成電路、可編程邏輯控制器和嵌入微控制器等的形式來實現相同功能。因此這種裝置可以被認為是一種硬體部件，而對其內包括的用於實現各種功能的裝置也可以視為硬體部件內的結構。或者甚至，可以將用於實現各種功能的裝置視為既可以是實現方法的軟體模塊又可以是硬體部件內的結構。 Those skilled in the art will also appreciate that in addition to implementing the device in purely computer readable program code, the device can be logically programmed to implement logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded micro-controls. The form of the device etc. to achieve the same function. Thus such a device can be considered a hardware component, and the devices included therein for accomplishing various functions can also be considered as structures within the hardware component. Or even a device for implementing various functions may be considered to be either a software module implementing the method or a structure within the hardware component.

在20世紀90年代，對於一個技術的改進可以很明顯地區分是硬體上的改進(例如，對二極體、電晶體、開關等電路結構的改進)還是軟體上的改進(對於方法流程的改進)。然而，隨著技術的發展，當今的很多方法流程的改進已經可以視為硬體電路結構的直接改進。設計人員幾乎都透過將改進的方法流程編程到硬體電路中來得到相應的硬體電路結構。因此，不能說一個方法流程的改進就不能用硬體實體模塊來實現。例如，可編程邏輯器件(Programmable Logic Device,PLD)(例如現場可編程門陣列(Field Programmable Gate Array，FPGA))就是這樣一種集成電路，其邏輯功能由用戶對器件編程來確定。由設計人員自行編程來把一個數位系統“集成”在一片PLD上，而不需要請晶片製造廠商來設計和製作專用的集成電路晶片。而且，如今，取代手工地製作集成電路晶片，這種編程也多半改用“邏輯編譯器(logic compiler)”軟體來實現，它與程序開發撰寫時所用的軟體編譯器相類似，而要編譯之前的原始代碼也得用特定的編程語言來撰寫，此稱之為硬體描述語言(Hardware Description Language，HDL)，而HDL也並非僅有一種，而是有許多種，如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等，目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)與Verilog。本領域技術人員也應該清楚，只需要將方法流程用上述幾種硬體描述語言稍作邏輯編程並編程到集成電路中，就可以很容易得到實現該邏輯方法流程的硬體電路。 In the 1990s, improvements to a technology could clearly distinguish between hardware improvements (eg, improvements to circuit structures such as diodes, transistors, switches, etc.) or software improvements (for method flow). Improve). However, as technology advances, many of today's method flow improvements can be seen as direct improvements in hardware circuit architecture. Designers almost always get the corresponding hardware structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be implemented by a hardware entity module. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is an integrated circuit whose logic function is determined by the user programming the device. Designers program themselves to "integrate" a digital system on a single PLD without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit die. Moreover, today, instead of manually making integrated circuit chips, this programming is mostly implemented using "logic compiler" software, which is similar to the software compiler used in programming development, but before compiling The original code has to be written in a specific programming language. This is called the Hardware Description Language (HDL). HDL is not the only one, but there are many kinds, such as ABEL (Advanced Boolean Expression Language). ), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc. VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are used. It should also be apparent to those skilled in the art that the hardware flow for implementing the logic method flow can be easily obtained by simply programming the method flow into the integrated circuit with a few logic description languages.

透過以上的實施方式的描述可知，本領域的技術人員可以清楚地了解到本申請可藉助軟體加必需的通用硬體平台的方式來實現。基於這樣的理解，本申請的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該計算機軟體產品可以存儲在存儲介質中，如ROM/RAM、磁碟、光碟等，包括若干指令用以使得一台計算機設備(可以是個人計算機，伺服器，或者網路設備等)執行本申請各個實施方式或者實施方式的某些部分所述的方法。 As can be seen from the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of a software plus a necessary universal hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a disk, Optical disks, etc., include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.

本說明書中的各個實施方式均採用遞進的方式描述，各個實施方式之間相同相似的部分互相參見即可，每個實施方式重點說明的都是與其他實施方式的不同之處。尤其，針對裝置的實施方式來說，均可以參照前述方法的實施方式的介紹對照解釋。 The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the embodiment of the device, reference can be made to the introduction of the embodiment of the aforementioned method.

雖然透過實施方式描繪了本申請，本領域普通技術人員知道，本申請有許多變形和變化而不脫離本申請的精神，希望所附的申請專利範圍包括這些變形和變化而不脫離本申請的精神。 While the present invention has been described in terms of the embodiments of the present invention, it will be understood by those skilled in the art .

Claims

A method for generating a video summary, the video having text description information, the method comprising: extracting a plurality of scene switching frames from the video, and setting a scene label for the scene switching frame, wherein two adjacent scenes The similarity between the switching frames satisfies the specified condition; the topic tag corresponding to the video is extracted from the text description information; and the association between the scene tag of the frame and the topic tag is switched according to the scenario, The target frame is filtered out in the plurality of scene switching frames, and the video summary of the video is generated based on the target frame.

The method of claim 1, wherein extracting the plurality of scene switching frames from the video comprises: determining a reference frame in the video, and sequentially calculating between the frame after the reference frame and the reference frame a similarity degree; when the similarity between the reference frame and the current frame is less than or equal to a specified threshold, determining the current frame as a scene switching frame; using the current frame as a new reference frame, and sequentially calculating The similarity between the frame after the new reference frame and the new reference frame to determine the next scene switching frame according to the calculated similarity.

The method of claim 2, wherein the similarity between two adjacent scene switching frames satisfies the specified condition, that the similarity between the two adjacent scene switching frames is less than or equal to the specified threshold.

The method of claim 2, wherein calculating the similarity between the frame after the reference frame and the reference frame comprises: extracting the first feature vector and the second feature vector of the reference frame and the current frame, respectively The first feature vector and the second feature vector respectively represent scale-invariant features of the reference frame and the current frame; calculating between the first feature vector and the second feature vector A spatial distance, and the reciprocal of the spatial distance is used as a similarity between the reference frame and the current frame.

The method of claim 1, wherein the setting a scene label for the scene switching frame comprises: extracting a feature of the scene switching frame, the feature comprising at least one of a color feature, a texture feature, and a shape feature; Comparing the feature with the feature sample in the feature sample library, wherein the feature sample in the feature sample library is associated with a text tag; determining that the feature sample library is most similar to the extracted feature And a text label associated with the target feature sample is used as a scene label corresponding to the scene switching frame.

The method of claim 1, wherein the text description information includes a title and/or a profile of the video; and correspondingly, extracting, from the text description information, the subject tag corresponding to the video includes: The text description information is matched with the text label in the text label library, and the matched text label is used as the theme label of the video.

The method of claim 6, wherein the text label in the text tag library is associated with a number of statistics, the number of statistics being used to represent the total number of times the text tag is used as a topic tag; accordingly, when the match is obtained When the number of the text labels is at least two, the method further includes: sorting the matched text labels according to the order of statistics from the largest to the smallest, and using the specified number of text labels in the sorting result as the front The subject tag of the video.

The method of claim 1, wherein the filtering the target frame from the plurality of scene switching frames comprises: calculating a similarity between the scene label of the scene switching frame and the theme label, and calculating The scene switching frame whose similarity is greater than the specified similarity threshold is determined as the target frame.

The method of claim 8, wherein after calculating the similarity between the scene label of the scene switching frame and the topic label, the method further comprises: corresponding to the calculated similarity The scene switching frame sets a weighting factor; identifies a target object included in the scene switching frame, and sets an associated value for the scene switching frame according to the identified association between the target object and the theme label; Calculating a product of a weight coefficient of the scene switching frame and an associated value, and determining a scene switching frame whose product is greater than a specified product threshold as the target frame.

The method of claim 9, wherein the subject tag is associated with at least one object; and correspondingly, setting the associated value for the scene switching frame comprises: selecting a target object to be recognized from the scene switching frame The at least one object is compared, and the number of target objects appearing in the at least one object is counted; the product of the counted number and the specified value is used as an associated value of the scene switching frame.

The method of claim 9, wherein the video summary of the video has a specified total number of frames; correspondingly, after calculating a product of the weight coefficient of the scene switching frame and the associated value, the method further includes: When the total number of the scene switching frames is greater than or equal to the total number of the specified frames, sort the scene switching frames according to the order of the products from large to small, and the foregoing in the ranking result The total number of specified frame switching frames is determined as the target frame.

The method of claim 11, wherein the method further comprises: when the total number of the scene switching frames is less than the total number of the specified frames, switching between two adjacent scenes whose similarity is less than a determination threshold Between the frames, at least one video frame in the video is inserted such that the total number of scene switching frames after the insertion of the at least one video frame is equal to the total number of the specified frames.

The method of claim 1, wherein, when the number of the subject tags is at least two, screening the target frames from the plurality of scene switching frames comprises: calculating the scenes for the scene switching frames Switching the similarity between the scene label of the frame and the topic label; accumulating the similarity calculated for the scene switching frame to obtain a cumulative similarity corresponding to the scene switching frame; A scene switching frame specifying a similarity threshold is determined as the target frame.

A video summary generating device, the video having text description information, the device comprising: a scene switching frame extracting unit, configured to extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame And the similarity between the adjacent two scene switching frames satisfies the specified condition; the theme label extracting unit is configured to extract the theme label corresponding to the video from the text description information; and the video summary generating unit is configured to An association between the scene label of the scene switching frame and the topic label, filtering a target frame from the plurality of scene switching frames, and generating a video summary of the video based on the target frame.

The apparatus according to claim 14, wherein the scene switching frame extracting unit comprises: a similarity calculating module, configured to determine a reference frame in the video, and sequentially calculate a frame subsequent to the reference frame and the reference a similarity between the frames; a scene switching frame determining module, configured to determine the current frame as a scene switching frame when the similarity between the reference frame and the current frame is less than or equal to a specified threshold; the loop execution module And the current frame is used as a new reference frame, and the similarity between the frame after the new reference frame and the new reference frame is sequentially calculated to determine the similarity according to the calculated similarity. A scene switch frame.

The apparatus of claim 14, wherein the scene switching frame extraction unit comprises: a feature extraction module, configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature a comparison module, configured to compare the extracted feature with a feature sample in a feature sample library, wherein the feature samples in the feature sample library are all associated with a text tag; the target feature sample is determined And a module, configured to determine a target feature sample that is most similar to the extracted feature in the feature sample library, and use a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.

The device according to claim 14, wherein the video summary generating unit comprises: a similarity calculating module, configured to calculate a similarity between the scene label of the scene switching frame and the topic tag; and a weight coefficient setting module And a weighting coefficient is set for the corresponding scene switching frame according to the calculated similarity degree; the associated value setting module is configured to identify the target object included in the scene switching frame, and according to the identified target object Correlating with the topic tag, setting an association value for the scene switching frame; a target frame determining module, configured to calculate a product of a weight coefficient of the scene switching frame and an associated value, and the product is greater than specified A scene switching frame of a product threshold is determined as the target frame.