TWI712316B

TWI712316B - Method and device for generating video summary

Info

Publication number: TWI712316B
Application number: TW107103624A
Authority: TW
Inventors: 葛雷鳴
Original assignee: 英屬開曼群島商阿里巴巴集團控股有限公司
Priority date: 2017-07-05
Filing date: 2018-02-01
Publication date: 2020-12-01
Also published as: TW201907736A; WO2019007020A1; CN109213895A

Abstract

本申請實施方式公開了一種視訊摘要的生成方法及裝置，其中，所述視訊具備文字描述信息，所述方法包括：從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；從所述文字描述信息中提取所述視訊對應的主題標籤；根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。本申請提供的技術方案，能夠在提高效率的同時，精確地表徵視訊的主題。 The embodiment of the application discloses a method and device for generating a video summary, wherein the video has text description information, and the method includes: extracting a plurality of scene switching frames from the video, and setting the scene switching frame Scene label, wherein the similarity between two adjacent scene switching frames meets a specified condition; extract the topic tag corresponding to the video from the text description information; switch the scene tag of the frame according to the scene tag and the topic Based on the correlation between the tags, a target frame is filtered from the multiple scene switching frames, and a video summary of the video is generated based on the target frame. The technical solution provided by this application can accurately characterize the subject of the video while improving efficiency.

Description

Method and device for generating video summary

本申請涉及互聯網技術領域，特別涉及一種視訊摘要的生成方法及裝置。 This application relates to the field of Internet technology, and in particular to a method and device for generating video abstracts.

當前，為了讓用戶在短時間內獲知視訊的內容，視訊播放平台通常會為上傳的視訊製作對應的視訊摘要。所述視訊摘要可以是一個時長較短的視訊，在所述視訊摘要中可以包含原視訊中的一部分場景。這樣，用戶在觀看所述視訊摘要時，可以快速地了解原視訊的大概內容。 Currently, in order to let users know the content of the video in a short time, the video playback platform usually makes a corresponding video summary for the uploaded video. The video summary may be a short video, and the video summary may include a part of scenes in the original video. In this way, the user can quickly understand the approximate content of the original video when viewing the video summary.

目前，在製作視訊摘要時，一方面可以透過人工剪輯的方式，先由視訊播放平台的工作人員觀看整個視訊，然後將其中比較關鍵的片段剪輯出來，構成該視訊的視訊摘要。透過這種方式製作的視訊摘要能夠比較準確地表徵視訊中包含的信息，但是隨著視訊數量的快速增長，這種製作視訊摘要的方式會耗費相當多的人力，而且製作視訊摘要的速度也相當慢。 At present, when making a video summary, on the one hand, manual editing can be used to first watch the entire video by the staff of the video playback platform, and then edit the more critical segments to form the video summary of the video. The video summary produced in this way can more accurately characterize the information contained in the video, but with the rapid growth of the number of videos, this method of producing video summary will consume a lot of manpower, and the speed of preparing the video summary is also considerable. slow.

鑑於此，為了節省人力並提高視訊摘要的製作效率，當前通常是透過圖像識別的技術來製作視訊摘要。具體地，可以按照固定的時間間隔對上傳的視訊進行採樣，從而提取出視訊中的多幀圖像。然後可以依次計算相鄰兩幀圖像之間的相似度，並且可以保留相似度較低的兩幀圖像，從而保證保留下來的圖像幀能夠展示多個場景的內容。這樣，可以將最終保留的圖像幀構成該視訊的視訊摘要。 In view of this, in order to save manpower and improve the production efficiency of video summaries, video summaries are usually produced through image recognition technology. Specifically, the uploaded video can be sampled at a fixed time interval to extract multiple frames of images in the video. Then the similarity between two adjacent frames of images can be calculated in turn, and the two images with lower similarity can be retained, so as to ensure that the retained image frames can display the content of multiple scenes. In this way, the finally retained image frames can be formed into a video summary of the video.

現有技術中透過圖像識別來製作視訊摘要的方法，儘管能夠提高製作的效率，但是透過固定採樣和比對相似度的方式來挑選視訊摘要中的圖像幀，很容易漏掉視訊中的關鍵場景，從而導致生成的視訊摘要無法準確地反映視訊的主題。 In the prior art, the method of making a video summary through image recognition can improve the efficiency of production, but it is easy to miss the key points in the video by selecting the image frames in the video summary by means of fixed sampling and comparing similarity. The scene, resulting in the generated video summary cannot accurately reflect the subject of the video.

本申請實施方式的目的是提供一種視訊摘要的生成方法及裝置，能夠在提高效率的同時，精確地表徵視訊的主題。 The purpose of the embodiments of the present application is to provide a method and device for generating a video summary, which can accurately characterize the subject of the video while improving efficiency.

為實現上述目的，本申請實施方式提供一種視訊摘要的生成方法，所述視訊具備文字描述信息，所述方法包括：從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；從所述文字描述信息中提取所述視訊對應的主題標籤；根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 In order to achieve the above objective, the embodiments of the present application provide a method for generating a video summary, the video having text description information, and the method includes: extracting a plurality of scene switching frames from the video, and setting the scene switching frame Scene label, wherein the similarity between two adjacent scene switching frames meets a specified condition; extract the topic tag corresponding to the video from the text description information; switch the scene tag of the frame according to the scene tag and the topic Based on the correlation between the tags, a target frame is filtered from the multiple scene switching frames, and a video summary of the video is generated based on the target frame.

為實現上述目的，本申請實施方式還提供一種視訊摘要的生成裝置，所述視訊具備文字描述信息，所述裝置包括：場景切換幀提取單元，用於從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；主題標籤提取單元，用於從所述文字描述信息中提取所述視訊對應的主題標籤；視訊摘要生成單元，用於根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 In order to achieve the above objective, the embodiments of the present application also provide a device for generating a video summary, the video having text description information, and the device includes: a scene switching frame extraction unit for extracting multiple scene switching frames from the video , And set a scene tag for the scene switching frame, where the similarity between two adjacent scene switching frames meets a specified condition; the topic tag extraction unit is used to extract the video corresponding to the text description information Topic tag; a video summary generating unit for filtering out target frames from the plurality of scene switching frames according to the correlation between the scene tags of the scene switching frames and the topic tags, and based on the target frame Generate a video summary of the video.

由上可見，本申請首先可以從視訊中提取相似度滿足指定條件的場景切換幀，並為場景切換幀設置對應的場景標籤。然後可以結合該視訊的文字描述信息，確定該視訊的主題標籤。該主題標籤可以準確地表徵該視訊的主題。接著，透過確定場景標籤與主題標籤之間的關聯性，從而可以從場景切換幀中保留與主題關聯性較緊密的目標幀。這樣，基於所述目標幀生成的視訊摘要從而能夠準確地表徵視訊的主題內容。 It can be seen from the above that this application can first extract scene switching frames whose similarity meets the specified conditions from the video, and set corresponding scene tags for the scene switching frames. Then, the subject tag of the video can be determined by combining the text description information of the video. The subject tag can accurately represent the subject of the video. Then, by determining the relevance between the scene tag and the topic tag, the target frame that is more closely related to the topic can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can accurately characterize the subject matter of the video.

100‧‧‧場景切換幀提取單元 100‧‧‧Scene switching frame extraction unit

200‧‧‧主題標籤提取單元 200‧‧‧Subject tag extraction unit

300‧‧‧視訊摘要生成單元 300‧‧‧Video Summary Generation Unit

S1、S3、S5‧‧‧步驟 S1, S3, S5‧‧‧Step

為了更清楚地說明本申請實施方式或現有技術中的技術方案，下面將對實施方式或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本申請中記載的一些實施方式，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。 In order to more clearly explain the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are merely the present For some of the implementations described in the application, for those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

圖1為本申請實施方式中視訊摘要的生成方法流程圖；圖2為本申請實施方式中目標幀和場景切換幀的示意圖；圖3為本申請實施方式中場景切換幀的提取示意圖；圖4為本申請實施方式中場景標籤的提取示意圖；圖5為本申請實施方式中視訊摘要的生成裝置的功能模塊圖。 Fig. 1 is a flowchart of a method for generating a video summary in an embodiment of the application; Fig. 2 is a schematic diagram of a target frame and a scene switching frame in an embodiment of the application; Fig. 3 is a schematic diagram of extracting a scene switching frame in an embodiment of the application; This is a schematic diagram of extracting scene tags in an embodiment of this application; FIG. 5 is a functional block diagram of an apparatus for generating a video summary in an embodiment of this application.

為了使本技術領域的人員更好地理解本申請中的技術方案，下面將結合本申請實施方式中的附圖，對本申請實施方式中的技術方案進行清楚、完整地描述，顯然，所描述的實施方式僅僅是本申請一部分實施方式，而不是全部的實施方式。基於本申請中的實施方式，本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施方式，都應當屬於本申請保護的範圍。 In order to enable those skilled in the art to better understand the technical solutions in this application, the following will clearly and completely describe the technical solutions in the embodiments of this application with reference to the drawings in the embodiments of this application. Obviously, the described The embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the implementation in this application, all other implementations obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

本申請提供一種視訊摘要的生成方法，所述方法可以應用於具備數據處理功能的電子設備中。所述電子設備例如可以是台式電腦、平板電腦、筆記本電腦、智慧型手機、數位助理、智慧型可穿戴設備、導購終端、具有網路訪問功能的電視機等。所述方法還可以應用於在上述電子設備中運行的軟體中。所述軟體可以是具備視訊製作功能或者視訊播放功能的軟體中。此外，所述方法還可以應用於視訊播放網站的伺服器中。所述視訊播放網站例如可以是愛奇藝、搜狐視訊、Acfun等。在本實施方式中並不具體限定所述伺服器的數量。所述伺服器可以為一個伺服器，還可以為幾個伺服器，或者，若干伺服器形成的伺服器集群。 The present application provides a method for generating a video digest, which can be applied to electronic equipment with data processing functions. The electronic device may be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a TV with Internet access function, etc. The method can also be applied to software running in the above electronic equipment. The software may be a software with a video production function or a video playback function. In addition, the method can also be applied to the server of a video broadcasting website. The video playback website may be, for example, iQiyi, Sohu Video, Acfun, etc. In this embodiment, the number of the servers is not specifically limited. The server may be one server, or several servers, or a server cluster formed by several servers.

在本實施方式中，所述視訊摘要可以基於視訊生成。所述視訊可以是用戶本地的視訊，也可以是用戶上傳至視訊播放網站的視訊。其中，所述視訊通常可以具備文字描述信息。所述文字描述信息可以是所述視訊的標題或者所述視訊的簡介。所述標題和所述簡介可以是視訊製作者或者視訊上傳者預先編輯的，還可以是對視訊進行審核的工作人員添加的，本申請對比並不做限定。當然，在實際應用中，所述文字描述信息除了包括所述視訊的標題和簡介，還可以包括所述視訊的文字標籤或者從該視訊的彈幕信息中提取的描述性短語。 In this embodiment, the video summary may be generated based on video. The video can be a local video of the user, or a video uploaded by the user to a video playback website. Wherein, the video can usually have text description information. The text description information may be the title of the video or the introduction of the video. The title and the introduction may be pre-edited by the video producer or the video uploader, or added by the staff reviewing the video, and the comparison of this application is not limited. Of course, in practical applications, in addition to the title and introduction of the video, the text description information may also include the text tag of the video or a descriptive phrase extracted from the barrage information of the video.

請參閱圖1和圖2，本申請提供的視訊摘要的生成方法可以包括以下步驟。 Please refer to FIG. 1 and FIG. 2. The method for generating a video digest provided by this application may include the following steps.

S1：從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件。 S1: Extract multiple scene switching frames from the video, and set a scene label for the scene switching frame, wherein the similarity between two adjacent scene switching frames meets a specified condition.

在本實施方式中，所述視訊可以是存儲於本地的視訊，也可以是存儲於其它設備中的視訊。這樣，所述視訊的獲取方式可以包括按照指定路徑，從本地加載所述視訊或者根據其它設備提供的統一資源定位符(Uniform Resource Locator，URL)下載所述視訊。 In this embodiment, the video may be a video stored locally or a video stored in other devices. In this way, the method for obtaining the video may include loading the video locally according to a designated path or downloading the video according to a Uniform Resource Locator (URL) provided by other devices.

在本實施方式中，在獲取到所述視訊之後，可以對所述視訊中的每一幀畫面進行分析，以提取其中的多個場景切換幀。為了能夠獲取所述視訊的各個場景對應的場景切換幀，在本實施方式中可以透過逐幀對比的方式進行提取。具體地，首先可以在所述視訊中確定基準幀，並依次計算所述基準幀之後的各個幀與所述基準幀之間的相似度。 In this embodiment, after the video is acquired, each frame of the video can be analyzed to extract multiple scene switching frames. In order to be able to obtain the scene switching frames corresponding to each scene of the video, in this embodiment, the extraction can be carried out through a frame-by-frame comparison. Specifically, first, a reference frame may be determined in the video, and the similarity between each frame after the reference frame and the reference frame may be sequentially calculated.

在本實施方式中，所述基準幀可以在一定範圍內隨機指定的一幀畫面。例如，所述基準幀可以是在所述視訊的開篇2分鐘內隨機選取的一幀畫面。當然，為了不遺漏所述視訊中的場景，可以將所述視訊的第一幀作為所述基準幀。 In this embodiment, the reference frame may be a randomly designated frame within a certain range. For example, the reference frame may be a frame selected randomly within the first 2 minutes of the video. Of course, in order not to omit the scene in the video, the first frame of the video can be used as the reference frame.

在本實施方式中，當確定了所述基準幀之後，可以從所述基準幀開始，將所述基準幀之後的各幀畫面依次與所述基準幀進行對比，以計算後續的各幀畫面與所述基準幀之間的相似度。具體地，在計算各個幀與所述基準幀之間的相似度時，可以分別提取所述基準幀和當前幀的第一特徵向量和第二特徵向量。 In this embodiment, after the reference frame is determined, starting from the reference frame, each frame after the reference frame may be sequentially compared with the reference frame to calculate the difference between the subsequent frames The similarity between the reference frames. Specifically, when calculating the similarity between each frame and the reference frame, the first feature vector and the second feature vector of the reference frame and the current frame may be extracted respectively.

在本實施方式中，所述第一特徵向量和所述第二特徵向量可以具備多種形式。其中，可以基於每幀畫面中像素點的像素值構建該幀畫面的特徵向量。每幀畫面通常都是由若干的像素點按照一定的順序排列而成的，像素點對應各自的像素值，從而可以構成色彩斑斕的畫面。所述像素值可以是處於指定區間內的數值。例如，所述像素值可以是0至255中的任意一個數值。數值的大小可以表示色彩的深淺。在本實施方式中，可以獲取每幀畫面中各個像素點的像素值，並透過獲取的像素值構成該幀畫面的特徵向量。例如，對於具備9*9=81個像素點的當前幀而言，可以依次獲取其中像素點的像素值，然後根據從左向右從上至下的順序，將獲取的像素值依次排列，從而構成81維的向量。該81維的向量便可以作為所述當前幀的特徵向量。 In this embodiment, the first feature vector and the second feature vector may have various forms. Among them, the feature vector of the frame can be constructed based on the pixel value of the pixel in each frame. Each frame of picture is usually composed of a number of pixels arranged in a certain order, and the pixels correspond to their respective pixel values, which can form a colorful picture. The pixel value may be a value within a specified interval. For example, the pixel value may be any value from 0 to 255. The size of the value can indicate the depth of the color. In this embodiment, the pixel value of each pixel in each frame can be obtained, and the obtained pixel value can be used to form the feature vector of the frame. For example, for the current frame with 9*9=81 pixels, the pixel values of the pixels can be obtained in sequence, and then the obtained pixel values can be arranged in order from left to right and top to bottom, thus It constitutes an 81-dimensional vector. The 81-dimensional vector can be used as the feature vector of the current frame.

在本實施方式中，所述特徵向量還可以是每幀畫面的CNN(Convolutional Neural Network，卷積神經網路)特徵。具體地，可以將所述基準幀以及所述基準幀之後的各幀畫面輸入卷積神經網路中，然後該卷積神經網路便可以輸出所述基準幀以及其它各幀畫面對應的特徵向量。 In this embodiment, the feature vector may also be a CNN (Convolutional Neural Network, convolutional neural network) feature of each frame. Specifically, the reference frame and the frames after the reference frame can be input into the convolutional neural network, and then the convolutional neural network can output the reference frame and the feature vectors corresponding to the other frames .

在本實施方式中，為了能夠準確地表徵所述基準幀和當前幀中所展示的內容，所述第一特徵向量和所述第二特徵向量還可以分別表示所述基準幀和所述當前幀的尺度不變特徵。這樣，即使改變圖像的旋轉角度、圖像亮度或拍攝視角，提取出的第一特徵向量和所述第二特徵向量仍然能夠很好地體現所述基準幀和當前幀中的內容。具體地，所述第一特徵向量和所述第二特徵向量可以是Sift(Scale-invariant feature transform，尺度不變特徵轉換)特徵、surf特徵(Speed Up Robust Feature，快速魯棒性特徵)或者顏色直方圖特徵等。 In this embodiment, in order to accurately characterize the content displayed in the reference frame and the current frame, the first feature vector and the second feature vector may also represent the reference frame and the current frame, respectively The scale invariant feature. In this way, even if the rotation angle, image brightness, or shooting angle of view of the image is changed, the extracted first feature vector and the second feature vector can still well reflect the content in the reference frame and the current frame. Specifically, the first feature vector and the second feature vector may be Sift (Scale-invariant feature transform) features, surf features (Speed Up Robust Feature, fast robust features), or color Histogram features, etc.

在本實施方式中，在確定了所述第一特徵向量和所述第二特徵向量之後，可以計算所述第一特徵向量和所述第二特徵向量之間的相似度。具體地，所述相似度在向量空間中可以表示為兩個向量之間的距離。距離越近，表示兩個向量越相似，因此相似度越高。距離越遠，表示兩個向量差別越大，因此相似度越低。因此，在計算所述基準幀和所述當前幀之間的相似度時，可以計算所述第一特徵向量和所述第二特徵向量之間的空間距離，並將所述空間距離的倒數作為所述基準幀與所述當前幀之間的相似度。這樣，空間距離越小，其對應的相似度越大，表明所述基準幀和所述當前幀之間越相似。相反地，空間距離越大，其對應的相似度越小，表明所述基準幀和所述當前幀之間越不相似。 In this embodiment, after the first feature vector and the second feature vector are determined, the similarity between the first feature vector and the second feature vector can be calculated. Specifically, the similarity can be expressed as the distance between two vectors in the vector space. The closer the distance, the more similar the two vectors, so the higher the similarity. The farther the distance, the greater the difference between the two vectors, and therefore the lower the similarity. Therefore, when calculating the similarity between the reference frame and the current frame, the spatial distance between the first feature vector and the second feature vector can be calculated, and the reciprocal of the spatial distance can be used as The similarity between the reference frame and the current frame. In this way, the smaller the spatial distance, the greater the corresponding similarity, which indicates that the reference frame and the current frame are more similar. Conversely, the larger the spatial distance, the smaller the corresponding similarity, which indicates that the reference frame and the current frame are less similar.

在本實施方式中，按照上述方式可以依次計算所述基準幀之後的各個幀與所述基準幀之間的相似度。相似度較高的兩幀畫面中所展示的內容也通常是比較相似的，而視訊摘要的主旨是將視訊中不同場景的內容向用戶展示，因此，在本實施方式中，當所述基準幀與當前幀之間的相似度小於或者等於指定閾值時，可以將所述當前幀確定為一個場景切換幀。其中，所述指定閾值可以是預先設定的一個數值，該數值根據實際情況可以靈活地進行調整。例如，當根據該指定閾值篩選出的場景切換幀的數量過多時，可以適當減小該指定閾值的大小。又例如，當根據該指定閾值篩選出的場景切換幀的數量過少時，可以適當增大該指定閾值的大小。在本實施方式中，相似度小於或者等於指定閾值，可以表示兩幀畫面中的內容已經具備明顯的不同，因此可以認為當前幀所展示的場景，與所述基準幀所展示的場景發生了改變。此時，所述當前幀便可以作為場景切換的一幀畫面進行保留。 In this embodiment, the similarity between each frame after the reference frame and the reference frame can be calculated sequentially in the above-mentioned manner. The content displayed in the two frames with higher similarity is usually relatively similar, and the main purpose of the video summary is to show the content of different scenes in the video to the user. Therefore, in this embodiment, when the reference frame When the similarity with the current frame is less than or equal to the specified threshold, the current frame may be determined as a scene switching frame. Wherein, the specified threshold may be a preset numerical value, which can be flexibly adjusted according to actual conditions. For example, when the number of scene switching frames filtered out according to the specified threshold is too large, the size of the specified threshold can be appropriately reduced. For another example, when the number of scene switching frames filtered out according to the specified threshold is too small, the size of the specified threshold may be appropriately increased. In this embodiment, the similarity is less than or equal to the specified threshold, which can indicate that the content in the two frames is already significantly different. Therefore, it can be considered that the scene shown in the current frame has changed from the scene shown in the reference frame. . At this time, the current frame can be retained as a frame of scene switching.

在本實施方式中，在將所述當前幀確定為一個場景切換幀時，可以繼續確定後續的其它場景切換幀。具體地，從所述基準幀到所述當前幀，可以視為場景發生了一次改變，因此當前的場景便是所述當前幀所展示的內容。基於此，可以將所述當前幀作為新的基準幀，並依次計算所述新的基準幀之後的各個幀與所述新的基準幀之間的相似度，以根據計算的的所述相似度確定下一個場景切換幀。同樣地，在確定下一個場景切換幀時，依然可以透過提取特徵向量以及計算空間距離的方式確定出兩幀畫面之間的相似度，並且可以將確定出的相似度依然與所述指定閾值進行對比，從而確定出從新的基準幀之後場景再次發生變化的下一個場景切換幀。 In this embodiment, when the current frame is determined as a scene switching frame, other subsequent scene switching frames can be determined continuously. Specifically, from the reference frame to the current frame, it can be considered that the scene has changed once, so the current scene is the content displayed in the current frame. Based on this, the current frame can be used as a new reference frame, and the similarity between each frame after the new reference frame and the new reference frame can be calculated in sequence, so as to be based on the calculated similarity Determine the next scene switching frame. Similarly, when determining the next scene switching frame, the similarity between the two frames can still be determined by extracting the feature vector and calculating the spatial distance, and the determined similarity can still be compared with the specified threshold. By comparison, the next scene switching frame in which the scene changes again after the new reference frame is determined.

請參閱圖3，在本實施方式中，再確定出下一個場景切換幀之後，可以將該場景切換幀作為新的基準幀，繼續進行後續場景切換幀的提取過程。這樣，透過依次改變基準幀的方式，可以將所述視訊中場景發生變化的各幀畫面均提取出來，從而不會遺漏所述視訊中所展示的場景，以保證視訊摘要的完備性。在圖3中，被斜線填充的矩形條可以作為場景切換幀，相鄰兩個場景切換幀之間的相似度都可以小於或者等於所述指定閾值。 Referring to FIG. 3, in this embodiment, after the next scene switching frame is determined, the scene switching frame can be used as a new reference frame to continue the extraction process of subsequent scene switching frames. In this way, by sequentially changing the reference frame, each frame in the video whose scene has changed can be extracted, so that the scene displayed in the video will not be omitted, so as to ensure the completeness of the video summary. In FIG. 3, a rectangular bar filled with diagonal lines can be used as a scene switching frame, and the similarity between two adjacent scene switching frames can be less than or equal to the specified threshold.

在本實施方式中，透過上述方式提取出的場景切換幀中，任意相鄰兩個場景切換幀之間的相似度都會小於或者等於所述指定閾值，因此，相鄰兩個場景切換幀之間的相似度滿足指定條件便可以指相鄰兩個場景切換幀之間的相似度小於或者等於所述指定閾值。 In this embodiment, among the scene switching frames extracted by the above method, the similarity between any two adjacent scene switching frames will be less than or equal to the specified threshold. Therefore, between two adjacent scene switching frames If the similarity of satisfies the specified condition, it may mean that the similarity between two adjacent scene switching frames is less than or equal to the specified threshold.

在本實施方式中，在提取了所述多個場景切換幀之後，可以為所述場景切換幀設置場景標籤。所述場景標籤可以是用於表徵所述場景切換幀中所展示的內容的文字標籤。例如，某一個場景切換幀中展示的是兩個人在打鬥，那麼該場景切換幀對應的場景標籤便可以是“武術”、“搏擊”或者“功夫”等。 In this embodiment, after the multiple scene switching frames are extracted, a scene label may be set for the scene switching frame. The scene label may be a text label used to characterize the content displayed in the scene switching frame. For example, if two people are fighting in a scene switching frame, the scene label corresponding to the scene switching frame can be "martial arts", "fighting" or "kung fu".

在本實施方式中，可以對場景切換幀中的內容進行識別，以確定場景切換幀對應的場景標籤。具體地，可以提取所述場景切換幀的特徵，其中，所述特徵可以包括顏色特徵、紋理特徵以及形狀特徵中的至少一種。其中，所述顏色特徵可以是基於不同的顏色空間進行提取的特徵。所述顏色空間例如可以包括RGB(Red、Green、Blue，紅、綠、藍)空間、HSV(Hue、Saturation、Value，色調、飽和度、明度)空間、HIS(Hue、Saturation、Intensity，色調、飽和度、亮度)空間等。在顏色空間中，均可以具備多個顏色分量。例如，RGB空間中可以具備R分量、G分量以及B分量。針對不同的畫面，顏色分量也會存在不同。因此，可以用所述顏色分量來表徵場景切換幀的特徵。 In this embodiment, the content in the scene switching frame can be identified to determine the scene tag corresponding to the scene switching frame. Specifically, the feature of the scene switching frame can be extracted, where the feature can include at least one of a color feature, a texture feature, and a shape feature. Wherein, the color features may be features extracted based on different color spaces. The color space may include, for example, RGB (Red, Green, Blue, red, green, blue) space, HSV (Hue, Saturation, Value, hue, saturation, lightness) space, HIS (Hue, Saturation, Intensity, hue, Saturation, brightness) space, etc. In the color space, each can have multiple color components. For example, an R component, a G component, and a B component may be provided in the RGB space. For different pictures, the color components will also be different. Therefore, the color component can be used to characterize the characteristics of the scene switching frame.

此外，所述紋理特徵可以用於描述所述場景切換幀對應的材質。所述紋理特徵通常可以透過灰度的分佈來體現。所述紋理特徵可以與圖像頻譜中的低頻分量以及高頻分量相對應。這樣，場景切換幀中包含的圖像的低頻分量和高頻分量便可以作為所述場景切換幀的特徵。 In addition, the texture feature may be used to describe the material corresponding to the scene switching frame. The texture feature can usually be embodied by the distribution of gray levels. The texture feature may correspond to low-frequency components and high-frequency components in the image spectrum. In this way, the low-frequency component and the high-frequency component of the image contained in the scene switching frame can be used as features of the scene switching frame.

在本實施方式中，所述形狀特徵可以包括基於邊緣的形狀特徵以及基於區域的形狀特徵。具體地，可以利用傅里葉變換的邊界來作為所述基於邊緣的形狀特徵，還可以利用不變矩描述子來作為所述基於區域的形狀特徵。 In this embodiment, the shape features may include edge-based shape features and region-based shape features. Specifically, a Fourier transform boundary can be used as the edge-based shape feature, and a moment invariant descriptor can also be used as the region-based shape feature.

請參閱圖4，在本實施方式中，在提取出各個場景切換幀中的特徵後，可以將提取的所述特徵與特徵樣本庫中的各個特徵樣本進行比對。所述特徵樣本庫可以是基於圖像識別的歷史數據而總結歸納的一個樣本集合。在所述特徵樣本庫中，可以具備表徵不同內容的特徵樣本。所述特徵樣本同樣可以是上述的顏色特徵、紋理特徵以及形狀特徵中的至少一種。例如，所述特徵樣本庫中，有表徵踢足球的特徵樣本，有表徵舞蹈的特徵樣本，還有表徵搏鬥的特徵樣本等。具體地，所述特徵樣本庫中的所述特徵樣本均可以與文字標籤相關聯，所述文字標籤可以用於描述所述特徵樣本所對應的展示內容。例如，表徵踢足球的特徵樣本關聯的文字標籤可以是“踢足球”，表徵舞蹈的特徵樣本的文字標籤可以是“廣場舞”。 Referring to FIG. 4, in this embodiment, after the features in each scene switching frame are extracted, the extracted features can be compared with each feature sample in the feature sample library. The feature sample library may be a sample set summarized based on historical data of image recognition. In the feature sample library, there may be feature samples that characterize different content. The characteristic sample may also be at least one of the aforementioned color characteristic, texture characteristic and shape characteristic. For example, in the feature sample library, there are feature samples that characterize playing football, there are feature samples that characterize dancing, and there are feature samples that characterize fighting. Specifically, the feature samples in the feature sample library can all be associated with text labels, and the text labels can be used to describe the display content corresponding to the feature samples. For example, the text label associated with the feature sample that characterizes playing football may be "playing football", and the text label of the feature sample that characterizes dancing may be "square dance".

在本實施方式中，提取的所述特徵以及所述特徵樣本庫中的特徵樣本均可以透過向量的形式進行表示。這樣，將提取的所述特徵與特徵樣本庫中的各個特徵樣本進行比對可以指計算所述特徵與各個特徵樣本之間的距離。距離越近，表明提取的所述特徵與特徵樣本越相似。這樣，可以確定所述特徵樣本庫中與提取的所述特徵最相似的目標特徵樣本。其中，所述最相似的目標特徵樣本與所述提取的特徵樣本之間計算出的距離可以是最小的。提取的特徵與所述目標特徵樣本最相似，表明這兩者展示的內容也最相似，因此，可以將所述目標特徵樣本關聯的文字標籤作為所述場景切換幀對應的場景標籤，從而可以為各個場景切換幀設置相應的場景標籤。 In this embodiment, both the extracted features and the feature samples in the feature sample library can be represented in the form of vectors. In this way, comparing the extracted feature with each feature sample in the feature sample library may refer to calculating the distance between the feature and each feature sample. The closer the distance, the more similar the extracted feature and the feature sample. In this way, the target feature sample that is most similar to the extracted feature in the feature sample library can be determined. Wherein, the calculated distance between the most similar target feature sample and the extracted feature sample may be the smallest. The extracted feature is the most similar to the target feature sample, indicating that the content displayed by the two is also the most similar. Therefore, the text label associated with the target feature sample can be used as the scene label corresponding to the scene switching frame, which can be Set the corresponding scene label for each scene switching frame.

如圖4所示，從場景切換幀中提取的特徵與特徵樣本庫中的各個特徵樣本之間的距離可以分別為0.8、0.5、0.95以及0.6，這樣，距離為0.5的特徵樣本對應的文字標籤就可以作為所述場景切換幀對應的場景標籤。 As shown in Figure 4, the distance between the feature extracted from the scene switching frame and each feature sample in the feature sample library can be 0.8, 0.5, 0.95, and 0.6 respectively, so that the text label corresponding to the feature sample with a distance of 0.5 It can be used as the scene label corresponding to the scene switching frame.

S3：從所述文字描述信息中提取所述視訊對應的主題標籤。 S3: Extract the topic tag corresponding to the video from the text description information.

在本實施方式中，所述文字描述信息可以比較精確地表明所述視訊的主題。因此，可以從所述文字描述信息中提取所述視訊對應的主題標籤。具體地，視訊播放網站可以針對大量的視訊的文字描述信息進行歸納總結，篩選出可能作為視訊主題的各個文字標籤，並將篩選出的各個文字標籤構成文字標籤庫。所述文字標籤庫中的內容可以不斷進行更新。這樣，在從所述文字描述信息中提取主題標籤時，可以將所述文字描述信息與文字標籤庫中的各個文字標籤進行匹配，並將匹配得到的文字標籤作為所述視訊的主題標籤。例如，所述視訊的文字描述信息為“外國小伙與中國大媽跳廣場舞，驚呆眾人！”那麼將該文字描述信息與所述文字標籤庫中的各個文字標籤進行匹配時，可以得到“廣場舞”這個匹配結果。因此，“廣場舞”便可以作為該視訊的主題標籤。 In this embodiment, the text description information can more accurately indicate the subject of the video. Therefore, the topic tag corresponding to the video can be extracted from the text description information. Specifically, a video broadcasting website can summarize and summarize the text description information of a large number of videos, filter out various text tags that may be the subject of the video, and form a text tag library with the selected text tags. The content in the text label library can be continuously updated. In this way, when the topic tags are extracted from the text description information, the text description information can be matched with each text tag in the text tag library, and the matched text tags can be used as the topic tags of the video. For example, the text description information of the video is "Foreign guys and Chinese aunts dance square dance, shocking everyone!" Then when the text description information is matched with each text tag in the text tag library, you can get "square "Dance" this matching result. Therefore, "square dance" can be used as the theme tag of the video.

需要說明的是，由於視訊的文字描述信息通常比較長，在與文字標籤庫中的文字標籤進行匹配時，可能會匹配得到至少兩個結果。例如，所述視訊的文字描述信息為“外國小伙與中國大媽跳廣場舞，驚呆眾人！”，那麼將該文字描述信息與所述文字標籤庫中的各個文字標籤進行匹配時，可以得到“外國小伙”、“中國大媽”以及“廣場舞”這三個匹配結果。一方面，可以將匹配到的這三個匹配結果同時作為所述視訊的主題標籤。另一方面，當所述視訊的主題標籤的數量有限制時，可以從匹配到的多個結果中篩選出合適的主題標籤。具體地，在本實施方式中，所述文字標籤庫中的各個文字標籤可以與統計次數相關聯，其中，所述統計次數可以用於表徵所述文字標籤作為主題標籤的總次數。所述統計次數越大，表明對應的文字標籤作為視訊的主題標籤的總次數越多，該文字標籤作為主題標籤的可信度也就越高。因此，當匹配得到的文字標籤的數量為至少兩個時，可以按照統計次數從大到小的順序對匹配得到的文字標籤進行排序，並將排序結果中靠前的指定數量個文字標籤作為所述視訊的主題標籤。其中，所述指定數量可以是預先限定的所述視訊的主題標籤的數量。例如，所述視訊的主題標籤的數量限制為最多2個，那麼可以根據統計次數將“外國小伙”、“中國大媽”以及“廣場舞”這三個匹配結果進行排序，並最終將排名前2的“中國大媽”和“廣場舞”作為該視訊的主題標籤。 It should be noted that since the text description information of the video is usually relatively long, when matching with the text tags in the text tag library, at least two results may be obtained by matching. For example, the text description information of the video is "Foreign guys and Chinese aunts dance square dance, shocking everyone!", then when the text description information is matched with each text tag in the text tag library, you can get " The three matching results of "Foreign Guy", "Chinese Aunt" and "Square Dance". On the one hand, these three matching results can be used as the topic tags of the video at the same time. On the other hand, when the number of topic tags of the video is limited, suitable topic tags can be filtered from multiple matching results. Specifically, in this embodiment, each text tag in the text tag library may be associated with a counted number of times, where the counted number of times may be used to characterize the total number of times the text tag is used as a topic tag. The greater the number of statistics, the more the total number of times the corresponding text label is used as the topic label of the video, and the higher the credibility of the text label as the topic label. Therefore, when the number of matching text labels is at least two, the matching text labels can be sorted in the order of statistical times, and the top specified number of text labels in the sorting result can be used as all the text labels. Describe the topic tag of the video. Wherein, the specified number may be a predetermined number of topic tags of the video. For example, if the number of topic tags of the video is limited to a maximum of 2, then the three matching results of "foreign guy", "Chinese aunt" and "square dance" can be sorted according to the number of statistics, and finally the top 2 "Chinese Auntie" and "Square Dance" are the theme tags of this video.

S5：根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 S5: According to the association between the scene tag of the scene switching frame and the topic tag, a target frame is filtered from the multiple scene switching frames, and a video summary of the video is generated based on the target frame.

在本實施方式中，考慮到視訊中出現的場景會較多，但是場景對應的場景切換幀並非都是與視訊的主題具有緊密聯繫的。為了使得生成的視訊摘要能夠準確地反映視訊的主題，可以根據各個所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀。 In this embodiment, considering that there will be more scenes in the video, not all scene switching frames corresponding to the scenes are closely related to the theme of the video. In order to enable the generated video summary to accurately reflect the theme of the video, the target frame may be filtered from the multiple scene switching frames according to the correlation between the scene tag of each scene switching frame and the topic tag.

在本實施方式中，場景標籤與主題標籤之間的關聯性可以指場景標籤和主題標籤之間的相似程度。場景標籤與主題標籤越相似，則表明場景切換幀所展示的內容與視訊的主題越相關。具體地，確定場景標籤與主題標籤之間關聯性的方式可以包括計算各個所述場景切換幀的場景標籤與所述主題標籤之間的相似度。在實際應用中，所述場景標籤與所述主題標籤均可以由詞彙構成，在計算這兩者之間的相似度時，可以透過詞向量(wordvector)的方式來分別表示所述場景標籤和所述主題標籤。這樣，可以透過兩個詞向量之間的空間距離來表示所述場景標籤和所述主題標籤之間的相似度。兩個詞向量之間的空間距離越近，表明所述場景標籤和所述主題標籤之間的相似度越高；相反地，兩個詞向量之間的空間距離越遠，表明所述場景標籤和所述主題標籤之間的相似度越低。這樣，在實際應用場景中，可以將兩個詞向量之間的空間距離的倒數，作為所述場景標籤和所述主題標籤之間的相似度。 In this embodiment, the association between the scene tag and the topic tag may refer to the degree of similarity between the scene tag and the topic tag. The more similar the scene tag and the topic tag are, the more relevant the content displayed in the scene switching frame is to the topic of the video. Specifically, the manner of determining the association between the scene tag and the topic tag may include calculating the similarity between the scene tag of each scene switching frame and the topic tag. In practical applications, both the scene label and the topic label can be composed of vocabulary. When calculating the similarity between the two, the scene label and the topic label can be respectively represented by a word vector. Describe the topic tags. In this way, the similarity between the scene label and the topic label can be expressed by the spatial distance between two word vectors. The closer the spatial distance between two word vectors, the higher the similarity between the scene label and the topic label; on the contrary, the farther the spatial distance between the two word vectors, the higher the similarity between the scene label The lower the similarity with the topic tag. In this way, in an actual application scenario, the reciprocal of the spatial distance between two word vectors can be used as the similarity between the scene label and the topic label.

在本實施方式中，在計算出所述場景標籤和所述主題標籤之間的相似度之後，可以將計算的所述相似度大於指定相似度閾值的場景切換幀確定為所述目標幀。其中，所述指定相似度閾值可以作為衡量場景切換幀與主題之間是否足夠關聯的門檻，當相似度大於所述指定相似度閾值時，可以表明當前的場景切換幀與視訊的主題之間已經足夠關聯，場景切換幀所展示的內容能夠準確地反映視訊的主題，因此可以將該場景切換幀確定為所述目標幀。 In this embodiment, after calculating the similarity between the scene tag and the topic tag, a scene switching frame whose calculated similarity is greater than a specified similarity threshold may be determined as the target frame. Wherein, the specified similarity threshold can be used as a threshold to measure whether the scene switching frame and the subject are sufficiently related. When the similarity is greater than the specified similarity threshold, it can indicate that the current scene switching frame and the subject of the video have been With sufficient association, the content displayed in the scene switching frame can accurately reflect the theme of the video, so the scene switching frame can be determined as the target frame.

在本實施方式中，從場景切換幀中篩選出的目標幀均與視訊的主體具備比較緊密的聯繫，因此，可以基於所述目標幀生成所述視訊的視訊摘要。具體地，生成所述視訊的視訊摘要的方式可以將各個目標幀按照在視訊中所處的先後順序依次排列，從而構成所述視訊的視訊摘要。此外，考慮到視訊摘要所展示的內容中前後幀之間並不需要保持內容的正常邏輯，因此可以將各個目標幀隨機地進行編排，並將編排後的目標幀序列作為所述視訊的視訊摘要。 In this embodiment, the target frames selected from the scene switching frames are closely related to the main body of the video. Therefore, a video summary of the video can be generated based on the target frame. Specifically, the way of generating the video summary of the video may be that each target frame is arranged in sequence according to the sequence of the video, so as to form the video summary of the video. In addition, considering that the content shown in the video summary does not need to maintain the normal logic of the content between the previous and subsequent frames, each target frame can be randomly arranged, and the sequence of the target frames after the arrangement can be used as the video summary of the video .

在本申請一個實施方式中，考慮到各個場景切換幀的場景標籤通常是針對場景切換幀的整體內容進行設置的，因此場景標籤無法準確地反映場景切換幀中的局部細節。為了進一步地提高目標幀與視訊主題的關聯性，在本實施方式中可以對場景切換幀中包含的目標對象進行識別，並在識別出的目標對象的基礎上進行目標幀的篩選。具體地，在計算各個所述場景切換幀的場景標籤與所述主題標籤之間的相似度之後，可以根據計算得到的所述相似度，為對應的場景切換幀設置權重係數。其中，場景標籤與主題標籤之間的相似度越高，為對應的場景切換幀設置的權重係數就越大。所述權重係數可以是處於0和1之間的數值。例如，當前視訊的主題標籤為“廣場舞”，那麼針對場景標籤為“舞蹈”和“功夫”的兩個場景切換幀而言，場景標籤為“舞蹈”的場景切換幀設置的權重係數可以為0.8，而場景標籤為“功夫”的場景切換幀設置的權重係數可以為0.4。 In one embodiment of the present application, considering that the scene label of each scene switching frame is usually set for the overall content of the scene switching frame, the scene label cannot accurately reflect the local details in the scene switching frame. In order to further improve the relevance between the target frame and the video theme, in this embodiment, the target objects included in the scene switching frame can be identified, and the target frame can be filtered on the basis of the identified target objects. Specifically, after calculating the similarity between the scene label of each scene switching frame and the topic label, a weight coefficient may be set for the corresponding scene switching frame according to the calculated similarity. Among them, the higher the similarity between the scene label and the topic label, the larger the weight coefficient set for the corresponding scene switching frame. The weight coefficient may be a value between 0 and 1. For example, if the theme tag of the current video is "Square Dance", for the two scene switching frames with the scene tags "Dance" and "Kung Fu", the weight coefficient set for the scene switching frame with the scene tag "Dance" can be 0.8, and the weight coefficient set for the scene switching frame with the scene label "Kung Fu" can be 0.4.

在本實施方式中，在為各個場景切換幀設置了權重係數之後，可以識別所述場景切換幀中包含的目標對象。具體地，在識別場景切換幀中包含的目標對象時，可以採用adaboost算法、R-CNN(Region-based Convolutional Neural Network，基於區域的捲積神經網路)算法或者SSD(Single Shot Detector，單目標檢測)算法，來檢測所述場景切換幀中所包含的目標對象。例如，對於場景標籤為“舞蹈”的場景切換幀而言，可以透過R-CNN算法識別出該場景切換幀中包括“女人”、“音響”這兩種目標對象。這樣，在識別出各個場景切換幀中包含的目標對象之後，可以根據識別出的所述目標對象與所述主題標籤之間的關聯性，為所述場景切換幀設置關聯值。具體地，所述主題標籤可以與至少一個對象相關聯。所述對象可以是與所述主題標籤聯繫比較緊密的對象。與主題標籤相關聯的至少一個對象可以是透過對歷史數據進行分析得到的。例如，主題標籤為“海灘”時，其關聯的至少一個對象可以包括“海水”、“沙灘”、“海鷗”、“泳裝”、“遮陽傘”等。這樣，可以將從所述場景切換幀中識別出的目標對象與所述至少一個對象進行對比，並統計在所述至少一個對像中出現的目標對象的數量。具體地，針對“海灘”這個主題標籤，假設從場景切換幀中識別出的目標對象為“遮陽傘”、“汽車”、“沙灘”、“樹木”以及“海水”，那麼在將目標對象與所述至少一個對象進行對比時，可以確定在所述至少一個對像中出現的目標對象為“遮陽傘”、“沙灘”以及“海水”。也就是說，在所述至少一個對像中出現的目標對象的數量為3。在本實施方式中，可以將統計的所述數量與指定數值的乘積作為所述場景切換幀的關聯值。所述指定數值可以是預先設置的數值，例如，所述指定數值可以是10，那麼上述例子中所述場景切換幀的關聯值可以為30。這樣，在所述至少一個對像中出現的目標對象的數量越多，表明該場景切換幀中的局部細節與視訊主題之間的關聯也越緊密，對應的關聯值也越高。 In this embodiment, after setting the weight coefficient for each scene switching frame, the target object contained in the scene switching frame can be identified. Specifically, when recognizing the target object contained in the scene switching frame, adaboost algorithm, R-CNN (Region-based Convolutional Neural Network, region-based convolutional neural network) algorithm, or SSD (Single Shot Detector, single target Detection) algorithm to detect the target object contained in the scene switching frame. For example, for a scene switching frame with a scene label of "dance", it can be recognized through the R-CNN algorithm that the scene switching frame includes two target objects: "woman" and "sound". In this way, after identifying the target object contained in each scene switching frame, an association value can be set for the scene switching frame according to the identified association between the target object and the topic tag. Specifically, the topic tag may be associated with at least one object. The object may be an object closely related to the topic tag. At least one object associated with the topic tag may be obtained by analyzing historical data. For example, when the topic tag is "beach", at least one object associated with it may include "sea water", "beach", "seagull", "swimsuit", "sunshade" and so on. In this way, the target object recognized from the scene switching frame can be compared with the at least one object, and the number of target objects appearing in the at least one object can be counted. Specifically, for the theme label of "beach", assuming that the target objects identified from the scene switching frame are "parasols", "cars", "beach", "trees" and "sea water", then the target objects are When the at least one object is compared, it can be determined that the target objects appearing in the at least one object are "parasol," "beach," and "sea water." That is, the number of target objects appearing in the at least one object is three. In this embodiment, the product of the counted number and a specified value may be used as the associated value of the scene switching frame. The specified value may be a preset value, for example, the specified value may be 10, then the associated value of the scene switching frame in the above example may be 30. In this way, the greater the number of target objects appearing in the at least one object, the closer the correlation between the local details in the scene switching frame and the video theme, and the higher the corresponding correlation value.

在本實施方式中，在確定目標幀時，可以基於場景切換幀的整體特徵和局部特徵來進行判斷。具體地，可以計算各個所述場景切換幀的權重係數與關聯值的乘積，並將所述乘積大於指定乘積閾值的場景切換幀確定為所述目標幀。利用乘積來作為判斷的依據，從而可以綜合了場景切換幀的整體特徵和局部特徵。所述指定乘積閾值可以是衡量場景切換幀是否為目標幀的門檻。所述指定乘積閾值在實際應用場景中可以靈活地進行調整。 In this embodiment, when determining the target frame, the judgment can be made based on the overall feature and the local feature of the scene switching frame. Specifically, the product of the weight coefficient and the associated value of each of the scene switching frames may be calculated, and the scene switching frame whose product is greater than a specified product threshold may be determined as the target frame. The product is used as the basis for judgment, which can synthesize the overall and local features of the scene switching frame. The specified product threshold may be a threshold for measuring whether the scene switching frame is the target frame. The specified product threshold can be flexibly adjusted in actual application scenarios.

在本申請一個實施方式中，考慮到有些場景中，可能會預先限制視訊摘要中畫面幀的總數量(或者是總時長)。在這種情況下，在確定目標幀時，還需要綜合考慮預先限制的幀總數量。具體地，當各個所述場景切換幀的總數量大於或者等於所述指定的幀總數量時，表明能夠從場景切換幀中提取出足夠的幀數來構成視訊摘要。在這種情況下愛，可以基於上述實施方式中計算出的各個場景切換幀對應的權重係數與關聯值的乘積，按照乘積從大到小的順序對各個所述場景切換幀進行排序。然後可以將排序結果中靠前的所述指定的幀總數量個場景切換幀確定為所述目標幀。舉例來說明，當前限制了視訊摘要中的幀總數量為1440幀，而當前從視訊中提取的場景切換幀的數量為2000幀。這樣，可以依次計算各個場景切換幀對應的權重係數和關聯值的乘積，並且按照乘積進行從大到小的順序排序之後，將排名前1440的場景切換幀作為所述目標幀，從而可以由1440幀目標幀構成符合要求的視訊摘要。 In an embodiment of the present application, considering that in some scenarios, the total number of picture frames (or total duration) in the video summary may be limited in advance. In this case, when determining the target frame, it is also necessary to comprehensively consider the pre-restricted total number of frames. Specifically, when the total number of each of the scene switching frames is greater than or equal to the specified total number of frames, it indicates that a sufficient number of frames can be extracted from the scene switching frames to form a video summary. In this case, based on the product of the weight coefficient corresponding to each scene switching frame and the associated value calculated in the foregoing embodiment, the scene switching frames may be sorted in descending order of the product. Then, the specified total number of scene switching frames at the top of the sorting result may be determined as the target frame. For example, the total number of frames in the video summary is currently limited to 1440 frames, while the current number of scene switching frames extracted from the video is 2000 frames. In this way, the product of the weight coefficient and the associated value corresponding to each scene switching frame can be calculated in sequence, and after the product is sorted from largest to smallest, the top 1440 scene switching frames are used as the target frame, which can be determined by 1440 The frame target frame constitutes a video digest that meets the requirements.

在本實施方式中，當各個所述場景切換幀的總數量小於所述指定的幀總數量時，表明當前提取的所有的場景切換幀都不足以構成符合要求的視訊摘要。在這種情況下，需要在提取出的場景切換幀之間插入原視訊中一定數量的畫面幀，從而達到視訊摘要限定的幀總數量的要求。具體地，在插入原視訊中的畫面幀時，可以在場景跳轉較大的兩個場景切換幀之間進行，這樣可以保持內容的連貫性。在本實施方式中，可以在相似度小於判定閾值的兩個相鄰的場景切換幀之間，插入所述視訊中的至少一個視訊幀。其中，相似度小於判定閾值的兩個相鄰的場景切換幀可以被視為內容關聯性較弱的兩個場景切換幀。在本實施方式中，在關聯性較弱的兩個場景切換幀之間可以逐幀插入原視訊中的畫面幀，直至插入所述至少一個視訊幀之後的場景切換幀的總數量等於所述指定的幀總數量。這樣，原有的場景切換幀和插入的畫面幀的整體都可以作為所述目標幀，從而構成所述視訊的視訊摘要。 In this embodiment, when the total number of each of the scene switching frames is less than the specified total number of frames, it indicates that all the currently extracted scene switching frames are not enough to form a video digest that meets the requirements. In this case, it is necessary to insert a certain number of picture frames in the original video between the extracted scene switching frames, so as to meet the requirement of the total number of frames limited by the video summary. Specifically, when inserting a frame in the original video, it can be performed between two scene switching frames with a larger scene jump, so that the continuity of the content can be maintained. In this embodiment, at least one video frame of the video can be inserted between two adjacent scene switching frames whose similarity is less than the determination threshold. Among them, two adjacent scene switching frames whose similarity is less than the determination threshold may be regarded as two scene switching frames with weaker content relevance. In this embodiment, the picture frames in the original video can be inserted frame by frame between two scene switching frames with weaker correlation, until the total number of scene switching frames after inserting the at least one video frame is equal to the specified The total number of frames. In this way, the whole of the original scene switching frame and the inserted picture frame can be used as the target frame, thereby constituting the video summary of the video.

在本申請一個實施方式中，從視訊的文字描述信息中提取的主題標籤的數量可能為至少兩個，在這種情況下，可以針對所述場景切換幀，計算所述場景切換幀的場景標籤與各個所述主題標籤之間的相似度。例如，當前的主題標籤為標籤1和標籤2，那麼可以分別計算當前場景切換幀與標籤1以及標籤2之間的相似度，從而可以得到所述當前場景切換幀對應的第一相似度和第二相似度。在計算出場景切換幀對應的各個相似度之後，可以將針對所述場景切換幀計算得出的各個相似度進行累加，以得到所述場景切換幀對應的累計相似度。例如，可以將上述的第一相似度和第二相似度之和作為所述當前場景切換幀對應的累計相似度。在本實施方式中，在計算出各個場景切換幀對應的累計相似度之後，同樣可以將累計相似度與指定相似度閾值進行比對，並將累計相似度大於指定相似度閾值的場景切換幀確定為所述目標幀。 In one embodiment of the present application, the number of topic tags extracted from the text description information of the video may be at least two. In this case, the scene tags of the scene switching frame may be calculated for the scene switching frame Similarity with each of the topic tags. For example, if the current topic tags are tag 1 and tag 2, then the similarity between the current scene switching frame and tag 1 and tag 2 can be calculated respectively, so that the first similarity and the first similarity corresponding to the current scene switching frame can be obtained. Two similarity. After the respective similarities corresponding to the scene switching frames are calculated, the respective similarities calculated for the scene switching frames may be accumulated to obtain the accumulated similarities corresponding to the scene switching frames. For example, the sum of the aforementioned first similarity and the second similarity may be used as the accumulated similarity corresponding to the current scene switching frame. In this embodiment, after the accumulated similarity corresponding to each scene switching frame is calculated, the accumulated similarity can also be compared with the specified similarity threshold, and the scene switching frame with the accumulated similarity greater than the specified similarity threshold can be determined Is the target frame.

請參閱圖5，本申請還提供一種視訊摘要的生成裝置，所述視訊具備文字描述信息，所述裝置包括：場景切換幀提取單元100，用於從所述視訊中提取多個場景切換幀，並為所述場景切換幀設置場景標籤，其中，相鄰兩個場景切換幀之間的相似度滿足指定條件；主題標籤提取單元200，用於從所述文字描述信息中提取所述視訊對應的主題標籤；視訊摘要生成單元300，用於根據所述場景切換幀的場景標籤與所述主題標籤之間的關聯性，從所述多個場景切換幀中篩選出目標幀，並基於所述目標幀生成所述視訊的視訊摘要。 Referring to FIG. 5, the present application also provides a video summary generating device, the video having text description information, and the device includes: a scene switching frame extraction unit 100 for extracting multiple scene switching frames from the video; And set a scene tag for the scene switching frame, where the similarity between two adjacent scene switching frames meets a specified condition; the topic tag extraction unit 200 is configured to extract the video corresponding to the text description information Topic tag; video summary generating unit 300, used to filter out the target frame from the multiple scene switching frames according to the correlation between the scene tag of the scene switching frame and the topic tag, and based on the target The frame generates a video summary of the video.

在本實施方式中，所述場景切換幀提取單元100包括：相似度計算模塊，用於在所述視訊中確定基準幀，並依次計算所述基準幀之後的幀與所述基準幀之間的相似度；場景切換幀確定模塊，用於當所述基準幀與當前幀之間的相似度小於或者等於指定閾值時，將所述當前幀確定為一個場景切換幀；循環執行模塊，用於將所述當前幀作為新的基準幀，並依次計算所述新的基準幀之後的幀與所述新的基準幀之間的相似度，以根據計算的的所述相似度確定下一個場景切換幀。 In this embodiment, the scene switching frame extraction unit 100 includes: a similarity calculation module for determining a reference frame in the video, and sequentially calculating the difference between the frame after the reference frame and the reference frame Similarity; a scene switching frame determination module, used to determine the current frame as a scene switching frame when the similarity between the reference frame and the current frame is less than or equal to a specified threshold; a loop execution module, used to The current frame is used as a new reference frame, and the similarity between the frames after the new reference frame and the new reference frame is sequentially calculated to determine the next scene switching frame according to the calculated similarity .

在本實施方式中，所述場景切換幀提取單元100包括：特徵提取模塊，用於提取所述場景切換幀的特徵，所述特徵包括顏色特徵、紋理特徵以及形狀特徵中的至少一種；比對模塊，用於將提取的所述特徵與特徵樣本庫中的特徵樣本進行比對，其中，所述特徵樣本庫中的所述特徵樣本均與文字標籤相關聯；目標特徵樣本確定模塊，用於確定所述特徵樣本庫中與提取的所述特徵最相似的目標特徵樣本，並將所述目標特徵樣本關聯的文字標籤作為所述場景切換幀對應的場景標籤。 In this embodiment, the scene switching frame extraction unit 100 includes: a feature extraction module for extracting features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature; A module for comparing the extracted features with feature samples in a feature sample library, wherein the feature samples in the feature sample library are all associated with text labels; the target feature sample determination module is used for Determine the target feature sample most similar to the extracted feature in the feature sample library, and use the text label associated with the target feature sample as the scene label corresponding to the scene switching frame.

在本實施方式中，所述視訊摘要生成單元300包括：相似度計算模塊，用於計算所述場景切換幀的場景標籤與所述主題標籤之間的相似度；權重係數設置模塊，用於根據計算得到的所述相似度，為對應的場景切換幀設置權重係數；關聯值設置模塊，用於識別所述場景切換幀中包含的目標對象，並根據識別出的所述目標對象與所述主題標籤之間的關聯性，為所述場景切換幀設置關聯值；目標幀確定模塊，用於計算所述場景切換幀的權重係數與關聯值的乘積，並將所述乘積大於指定乘積閾值的場景切換幀確定為所述目標幀。 In this embodiment, the video summary generating unit 300 includes: a similarity calculation module for calculating the similarity between the scene tag of the scene switching frame and the topic tag; and a weight coefficient setting module for calculating The calculated similarity is used to set the weight coefficient for the corresponding scene switching frame; the associated value setting module is used to identify the target object contained in the scene switching frame, and according to the identified target object and the theme The correlation between the tags sets the correlation value for the scene switching frame; the target frame determination module is used to calculate the product of the weight coefficient of the scene switching frame and the correlation value, and the scene where the product is greater than the specified product threshold The switching frame is determined as the target frame.

本申請可以在由計算機執行的計算機可執行指令的一般上下文中描述，例如程序模塊。一般地，程序模塊包括執行特定任務或實現特定抽像數據類型的例程、程序、對象、組件、數據結構等等。也可以在分佈式計算環境中實踐本申請，在這些分佈式計算環境中，由透過通信網路而被連接的遠程處理設備來執行任務。在分佈式計算環境中，程序模塊可以位於包括存儲設備在內的本地和遠程計算機存儲介質中。 This application may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments. In these distributed computing environments, remote processing devices connected through a communication network perform tasks. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

本領域技術人員也知道，除了以純計算機可讀程序代碼方式實現裝置以外，完全可以透過將方法步驟進行邏輯編程來使得裝置以邏輯門、開關、專用集成電路、可編程邏輯控制器和嵌入微控制器等的形式來實現相同功能。因此這種裝置可以被認為是一種硬體部件，而對其內包括的用於實現各種功能的裝置也可以視為硬體部件內的結構。或者甚至，可以將用於實現各種功能的裝置視為既可以是實現方法的軟體模塊又可以是硬體部件內的結構。 Those skilled in the art also know that in addition to implementing the device in a purely computer-readable program code manner, it is completely possible to program the method steps to make the device use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded microcontrollers. To achieve the same function in the form of a device, etc. Therefore, such a device can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure in a hardware component.

在20世紀90年代，對於一個技術的改進可以很明顯地區分是硬體上的改進(例如，對二極體、電晶體、開關等電路結構的改進)還是軟體上的改進(對於方法流程的改進)。然而，隨著技術的發展，當今的很多方法流程的改進已經可以視為硬體電路結構的直接改進。設計人員幾乎都透過將改進的方法流程編程到硬體電路中來得到相應的硬體電路結構。因此，不能說一個方法流程的改進就不能用硬體實體模塊來實現。例如，可編程邏輯器件(Programmable Logic Device,PLD)(例如現場可編程門陣列(Field Programmable Gate Array，FPGA))就是這樣一種集成電路，其邏輯功能由用戶對器件編程來確定。由設計人員自行編程來把一個數位系統“集成”在一片PLD上，而不需要請晶片製造廠商來設計和製作專用的集成電路晶片。而且，如今，取代手工地製作集成電路晶片，這種編程也多半改用“邏輯編譯器(logic compiler)”軟體來實現，它與程序開發撰寫時所用的軟體編譯器相類似，而要編譯之前的原始代碼也得用特定的編程語言來撰寫，此稱之為硬體描述語言(Hardware Description Language，HDL)，而HDL也並非僅有一種，而是有許多種，如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等，目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)與Verilog。本領域技術人員也應該清楚，只需要將方法流程用上述幾種硬體描述語言稍作邏輯編程並編程到集成電路中，就可以很容易得到實現該邏輯方法流程的硬體電路。 In the 1990s, the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (for method flow Improve). However, with the development of technology, the improvement of many methods and processes of today can be regarded as a direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by hardware entity modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device. It is programmed by the designer to "integrate" a digital system on a PLD without requiring the chip manufacturer to design and produce a dedicated integrated circuit chip. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly realized by using "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL), and HDL is not only one, but there are many, such as ABEL (Advanced Boolean Expression Language) ), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., currently the most common VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are used. It should also be clear to those skilled in the art that only a little logic programming of the method flow using the above hardware description languages and programming into an integrated circuit can easily obtain the hardware circuit that implements the logic method flow.

透過以上的實施方式的描述可知，本領域的技術人員可以清楚地了解到本申請可藉助軟體加必需的通用硬體平台的方式來實現。基於這樣的理解，本申請的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該計算機軟體產品可以存儲在存儲介質中，如ROM/RAM、磁碟、光碟等，包括若干指令用以使得一台計算機設備(可以是個人計算機，伺服器，或者網路設備等)執行本申請各個實施方式或者實施方式的某些部分所述的方法。 From the description of the above embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk, An optical disc, etc., includes a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments or some parts of the embodiments of this application.

本說明書中的各個實施方式均採用遞進的方式描述，各個實施方式之間相同相似的部分互相參見即可，每個實施方式重點說明的都是與其他實施方式的不同之處。尤其，針對裝置的實施方式來說，均可以參照前述方法的實施方式的介紹對照解釋。 The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the implementation of the device, all can be explained with reference to the introduction of the foregoing method implementation.

雖然透過實施方式描繪了本申請，本領域普通技術人員知道，本申請有許多變形和變化而不脫離本申請的精神，希望所附的申請專利範圍包括這些變形和變化而不脫離本申請的精神。 Although the application has been described through the embodiments, those of ordinary skill in the art know that there are many variations and changes in this application without departing from the spirit of the application, and it is hoped that the scope of the appended patent application includes these variations and changes without departing from the spirit of the application .

S1、S3、S5‧‧‧步驟 S1, S3, S5‧‧‧Step

Claims

A method for generating a video summary, the video having text description information, the method comprising: extracting a plurality of scene switching frames from the video, and setting a scene label for the scene switching frame, wherein two adjacent scenes The similarity between the switching frames satisfies a specified condition; extracting the topic tag corresponding to the video from the text description information; according to the correlation between the scene tag of the scene switching frame and the topic tag, from the Selecting the target frame from multiple scene switching frames includes: calculating the similarity between the scene tag of the scene switching frame and the topic tag; and setting a weight for the corresponding scene switching frame according to the calculated similarity Coefficient; identifying the target object contained in the scene switching frame, and setting an associated value for the scene switching frame according to the identified association between the target object and the topic tag; calculating the scene switching frame The product of the weight coefficient of and the associated value, and the scene switching frame whose product is greater than a specified product threshold is determined as the target frame; and a video summary of the video is generated based on the target frame.

The method according to claim 1, wherein extracting a plurality of scene switching frames from the video includes: determining a reference frame in the video, and sequentially calculating the distance between the frame after the reference frame and the reference frame When the similarity between the reference frame and the current frame is less than or equal to the specified threshold, the current frame is determined as a scene switching frame; the current frame is taken as the new reference frame and calculated in sequence The new benchmark The similarity between the frame after the frame and the new reference frame is used to determine the next scene switching frame according to the calculated similarity.

The method according to claim 2, wherein the similarity between two adjacent scene switching frames satisfying a specified condition includes: the similarity between two adjacent scene switching frames is less than or equal to the specified threshold.

The method according to claim 2, wherein calculating the similarity between the frame after the reference frame and the reference frame comprises: extracting a first feature vector and a second feature vector of the reference frame and the current frame, respectively , Wherein the first feature vector and the second feature vector respectively represent the scale-invariant features of the reference frame and the current frame; calculating the difference between the first feature vector and the second feature vector The spatial distance, and the reciprocal of the spatial distance is used as the similarity between the reference frame and the current frame.

The method according to claim 1, wherein setting a scene label for the scene switching frame includes: extracting features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature; The features in the feature sample library are compared with feature samples in the feature sample library, wherein the feature samples in the feature sample library are associated with text labels; it is determined that the feature sample library is most similar to the extracted feature And use the text label associated with the target feature sample as the scene label corresponding to the scene switching frame.

The method according to claim 1, wherein the text description information includes the title and/or brief introduction of the video; accordingly, extracting the topic tag corresponding to the video from the text description information includes: The text description information is matched with the text tags in the text tag library, and the matched text tags are used as the topic tags of the video.

The method according to claim 6, wherein the text tags in the text tag library are associated with statistical times, and the statistical times are used to characterize the total number of times the text labels are used as topic tags; accordingly, when matching is obtained When the number of text labels in is at least two, the method further includes: sorting the matched text labels in the order of statistical times, and taking the top specified number of text labels in the sorting result as all text labels. Describe the topic tag of the video.

The method according to claim 1, wherein the topic tag is associated with at least one object; correspondingly, setting an associated value for the scene switching frame includes: the target object identified from the scene switching frame is associated with The at least one object is compared, and the number of target objects appearing in the at least one object is counted; the product of the counted number and a specified value is used as the associated value of the scene switching frame.

The method according to claim 1, wherein the video summary of the video has a specified total number of frames; accordingly, after calculating the product of the weight coefficient of the scene switching frame and the associated value, the method further includes: When the total number of scene switching frames is greater than or equal to the specified total number of frames In the case of the quantity, the scene switching frames are sorted in the descending order of the product, and the specified total number of scene switching frames are determined as the target frame.

The method according to claim 9, wherein the method further comprises: when the total number of scene switching frames is less than the specified total number of frames, switching between two adjacent scenes whose similarity is less than a determination threshold Between frames, at least one video frame in the video is inserted, so that the total number of scene switching frames after the insertion of the at least one video frame is equal to the specified total number of frames.

The method according to claim 1, wherein when the number of the subject tags is at least two, filtering out the target frame from the multiple scene switching frames includes: calculating the scene for the scene switching frame The similarity between the scene label of the switching frame and the topic label; the similarity calculated for the scene switching frame is accumulated to obtain the cumulative similarity corresponding to the scene switching frame; the cumulative similarity is greater than The scene switching frame with the specified similarity threshold is determined as the target frame.

A device for generating a video summary, the video is provided with text description information, the device includes: a scene switching frame extraction unit for extracting a plurality of scene switching frames from the video, and setting a scene label for the scene switching frame , Wherein the similarity between two adjacent scene switching frames meets a specified condition; the topic tag extraction unit is configured to extract the topic tag corresponding to the video from the text description information; The video summary generating unit is configured to filter out target frames from the multiple scene switching frames according to the correlation between the scene tags of the scene switching frames and the topic tags, and generate the target frames based on the target frames The video summary of the video; wherein, the video summary generating unit includes: a similarity calculation module for calculating the similarity between the scene tag of the scene switching frame and the topic tag; and a weight coefficient setting module for The calculated similarity is used to set the weight coefficient for the corresponding scene switching frame; the associated value setting module is used to identify the target object contained in the scene switching frame, and according to the identified target object and the theme The correlation between the tags sets the correlation value for the scene switching frame; the target frame determination module is used to calculate the product of the weight coefficient of the scene switching frame and the correlation value, and the scene where the product is greater than the specified product threshold The switching frame is determined as the target frame.

The device according to claim 12, wherein the scene switching frame extraction unit includes: a similarity calculation module, configured to determine a reference frame in the video, and sequentially calculate the frame after the reference frame and the reference frame Similarity between frames; a scene switching frame determination module, used to determine the current frame as a scene switching frame when the similarity between the reference frame and the current frame is less than or equal to a specified threshold; loop execution module , Used to use the current frame as a new reference frame, and sequentially calculate the similarity between the frames after the new reference frame and the new reference frame, so as to determine the next reference frame according to the calculated similarity A scene switch frame.

The apparatus according to claim 12, wherein the scene switching frame extraction unit includes: a feature extraction module configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature One; a comparison module for comparing the extracted features with feature samples in a feature sample library, wherein the feature samples in the feature sample library are all associated with text labels; target feature samples are determined The module is used to determine the target feature sample most similar to the extracted feature in the feature sample library, and use the text label associated with the target feature sample as the scene label corresponding to the scene switching frame.