TWI704805B

TWI704805B - Video editing method and device

Info

Publication number: TWI704805B
Application number: TW108117520A
Authority: TW
Inventors: 楊正大; 張京; 巫宗翰; 張哲豪
Original assignee: 麥奇數位股份有限公司
Priority date: 2019-04-16
Filing date: 2019-05-21
Publication date: 2020-09-11
Also published as: CN109889920A; CN109889920B; TW202041037A

Abstract

一種影片編輯方法，由一影片編輯裝置來實施，該影片編輯裝置儲存有一影片，該影片包括一演講者，該影片編輯方法包含以下步驟：(A)根據該影片的音訊獲得多個目標子影片段落；(B)根據該等目標子影片段落的內容將該等目標子影片段落進行排序；(C)根據該等目標子影片段落的排序，從該等目標子影片中篩選出多個待合成子影片段落；及(D)將該等待合成子影片段落合成，以產生一合成影片。此外，本發明還提供一種影片編輯裝置。A video editing method is implemented by a video editing device. The video editing device stores a video including a speaker. The video editing method includes the following steps: (A) Obtain multiple target sub-videos based on the audio of the video Paragraphs; (B) sort the target sub-video segments according to the content of the target sub-video segments; (C) filter out multiple target sub-videos to be synthesized according to the order of the target sub-video segments Sub-video segment; and (D) synthesize the sub-video segment waiting to be synthesized to generate a composite video. In addition, the present invention also provides a video editing device.

Description

Video editing method and device

本發明是有關於一種影片編輯方法，特別是指一種用於編輯演講影片或教學影片的影片編輯方法。The present invention relates to a video editing method, in particular to a video editing method used for editing lecture videos or teaching videos.

隨著數位時代的來臨，影片可以更方便地儲存、傳輸和流通，因此，現有許多影音平台提供演講影片或教學影片，供大眾觀看學習。With the advent of the digital age, videos can be stored, transmitted and circulated more conveniently. Therefore, many existing audio-visual platforms provide speech videos or teaching videos for the public to watch and learn.

然而，一段完整的演講或教學影片的內容會有高低起伏，有時候影片過於冗長，會使觀看者觀看的興致降低，若需要找到影片中演說者的亮點與主要說話的畫面，擷取影片中精華的部分，則通常必須經過影片編輯者通過長時間的篩選出精彩片段，再加以後製成精彩片段影片，非常費時，再者，所篩選出之精彩片段，往往是影片編輯者之單向主觀認定，因此也可能會遺漏其他關鍵精彩片段，而無法客觀地呈現精彩片段影片。However, the content of a complete speech or instructional video will fluctuate. Sometimes the video is too long, which will make the viewer less interested in watching. If you need to find the highlights of the speaker and the main speaking screen in the video, capture the video For the essence part, it is usually necessary for the film editor to filter out the highlights for a long time, and then make the highlight film, which is very time-consuming. Moreover, the selected highlights are often one-way for the film editor. Subjectively determined, so other key highlights may be missed, and the highlight film cannot be presented objectively.

因此，本發明的目的，即在提供一種能提高影片編輯效率且能客觀呈現的影片編輯方法。Therefore, the purpose of the present invention is to provide a video editing method that can improve the efficiency of video editing and can be objectively presented.

於是，本發明影片編輯方法，由一影片編輯裝置來實施，該影片編輯裝置儲存有一影片，該影片包括一演講者，該影片編輯方法，包含一步驟(A)、一步驟(B)、一步驟(C)，及一步驟(D)。Therefore, the video editing method of the present invention is implemented by a video editing device that stores a video, the video includes a speaker, and the video editing method includes one step (A), one step (B), one Step (C), and one step (D).

在該步驟(A)中，該影片編輯裝置根據該影片的音訊獲得多個目標子影片段落。In this step (A), the video editing device obtains a plurality of target sub-video segments according to the audio of the video.

在該步驟(B)中，該影片編輯裝置根據該等目標子影片段落的內容將該等目標子影片段落進行排序。In the step (B), the video editing device sorts the target sub-video segments according to the content of the target sub-video segments.

在該步驟(C)中，該影片編輯裝置根據該等目標子影片段落的排序，從該等目標子影片中篩選出多個待合成子影片段落。In the step (C), the video editing device selects multiple sub-video segments to be synthesized from the target sub-videos according to the order of the target sub-video segments.

在該步驟(D)中，該影片編輯裝置將該等待合成子影片段落合成，以產生一合成影片。In the step (D), the video editing device synthesizes the sub-video segments waiting to be synthesized to generate a synthesized video.

本發明的目的，即在提供一種能提高影片編輯效率且能客觀呈現的影片編輯裝置。The purpose of the present invention is to provide a video editing device that can improve the efficiency of video editing and can present objectively.

於是，該影片編輯裝置，包含一儲存單元及一處理單元。Therefore, the video editing device includes a storage unit and a processing unit.

該儲存單元，儲存有一影片，該影片包括一演講者。The storage unit stores a video, and the video includes a speaker.

該處理單元電連接該儲存單元，該處理單元根據該影片的音訊獲得多個目標子影片段落，再根據該等目標子影片段落的內容將該等目標子影片段落進行排序，接著根據該等目標子影片段落的排序，從該等目標子影片中篩選出多個待合成子影片段落，最後將該等待合成子影片段落合成，以產生一合成影片。The processing unit is electrically connected to the storage unit. The processing unit obtains a plurality of target sub-video segments according to the audio information of the video, and then sorts the target sub-video segments according to the content of the target sub-video segments, and then according to the targets The sequence of the sub-video segments is to filter out multiple sub-video segments to be synthesized from the target sub-videos, and finally synthesize the sub-video segments to be synthesized to generate a synthesized video.

本發明之功效在於：藉由該影片編輯裝置根據該等目標子影片段落的內容將該等目標子影片段落進行排序，並根據該等目標子影片段落的排序，從該等目標子影片中篩選出該等待合成子影片段落，以提高影片編輯效率，並能客觀地呈現精彩片段影片。The effect of the present invention is that the video editing device sorts the target sub-video segments according to the content of the target sub-video segments, and filters the target sub-videos according to the order of the target sub-video segments The waiting to be synthesized sub-movie paragraphs are generated to improve the efficiency of movie editing and present the highlight movie objectively.

在本發明被詳細描述前，應當注意在以下的說明內容中，類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are represented by the same numbers.

參閱圖1，本發明影片編輯裝置100的一實施例，包含一儲存單元11、及一電連接該儲存單元11的處理單元12。該儲存單元11儲存有一影片，該影片包括一演講者。Referring to FIG. 1, an embodiment of a video editing apparatus 100 of the present invention includes a storage unit 11 and a processing unit 12 electrically connected to the storage unit 11. The storage unit 11 stores a video, and the video includes a speaker.

參閱圖1與圖2，說明本發明影片編輯裝置100如何執行本發明金流關係圖產生方法之一實施例。Referring to FIGS. 1 and 2, it will be described how the video editing apparatus 100 of the present invention executes an embodiment of the method for generating a gold flow relationship graph of the present invention.

在步驟201中，該處理單元12濾除該影片中一預定頻率區間外的聲音，值得注意的是，在本實施例中，該預定頻率區間例如500Hz到2000Hz的非人聲頻率，以去除背景音及雜音，但不以此為限。In step 201, the processing unit 12 filters out sounds outside a predetermined frequency range in the movie. It is worth noting that in this embodiment, the predetermined frequency range, for example, non-human voice frequencies of 500 Hz to 2000 Hz, to remove background sounds And noise, but not limited to this.

在步驟202中，該處理單元12根據該影片的音訊將該影片分割成多個影片段落，在本實施例中，該處理單元12對音訊進行語音活性檢測(Voice Activity Detection)在不切斷完整語音段落的前提下進行分割，其主要方式係音波間的時間間隔小於一預定時間(例如3秒)時表示同一段話，視為同一個影片段落。In step 202, the processing unit 12 divides the movie into multiple movie segments according to the audio of the movie. In this embodiment, the processing unit 12 performs Voice Activity Detection on the audio without cutting off the completeness. The segmentation is performed on the premise of the speech paragraph, and the main method is that when the time interval between sound waves is less than a predetermined time (for example, 3 seconds), the same paragraph is represented and regarded as the same movie paragraph.

在步驟203中，該處理單元12根據該等影片段落的音訊從該等影片段落篩選出該等感興趣影片段落，其中，篩選出的該等感興趣影片段落的影片長度大於一第一預定時段(例如6秒)。In step 203, the processing unit 12 filters out the video segments of interest from the video segments according to the audio information of the video segments, wherein the video length of the filtered video segments of interest is greater than a first predetermined period of time (E.g. 6 seconds).

在步驟204中，對於每一感興趣影片段落，該處理單元12將該感興趣影片段落進行語音辨識，以獲得一文字檔。In step 204, for each video segment of interest, the processing unit 12 performs voice recognition on the video segment of interest to obtain a text file.

在步驟205中，對於每一感興趣影片段落，該處理單元12根據該感興趣影片段落所對應的該文字檔將該感興趣影片段落分割成多個包括完整句子且影片長度小於一第二預定時段(例如30秒)的候選子影片段落，每一候選子影片段落對應一子文字檔。值得注意的是，在本實施例中，該處理單元12係利用自然語言處理（Natural Language Processing）對該文字檔進行分詞，以獲得該等候選子影片段落，在其他實施方式中，亦可利用Bi-LSTM-CRF模型或深度學習模型對該文字檔進行分詞，不以此為限。In step 205, for each movie segment of interest, the processing unit 12 divides the movie segment of interest into a plurality of complete sentences and the length of the movie is less than a second predetermined according to the text file corresponding to the movie segment of interest. The candidate sub-video segments of a time period (for example, 30 seconds), each candidate sub-video segment corresponds to a sub-text file. It is worth noting that, in this embodiment, the processing unit 12 uses Natural Language Processing to segment the text file to obtain the candidate sub-video paragraphs. In other embodiments, it can also be used The Bi-LSTM-CRF model or the deep learning model performs word segmentation on the text file, which is not limited.

搭配參閱圖3，由音波圖可知，該處理單元12根據該影片的音訊將該影片分割成多個影片段落A、B、C，其中該等影片段落A、B、C的音波間的時間間隔大於等於該預定時間，且該等影片段落A、B、C的影片長度皆大於該第一預定時段，故皆為感興趣影片段落，該處理單元12再根據該感興趣影片段落A對應的文字檔，將感興趣影片段落A分割成多個候選子影片段落A1、A2、A3，根據該感興趣影片段落B對應的文字檔，將感興趣影片段落B分割成多個候選子影片段落B1、B2、B3，根據該感興趣影片段落C對應的文字檔，將感興趣影片段落A分割成多個候選子影片段落C1、C2。With reference to FIG. 3, it can be seen from the sound wave diagram that the processing unit 12 divides the video into a plurality of video segments A, B, and C according to the audio information of the video, and the time interval between the sound waves of the video segments A, B, and C Is greater than or equal to the predetermined time, and the video lengths of the video segments A, B, and C are all greater than the first predetermined time period, so they are all interesting video segments, and the processing unit 12 then according to the text corresponding to the interesting video segment A File, divide the movie segment A of interest into multiple candidate sub-movie paragraphs A1, A2, A3, and divide the movie segment B of interest into multiple candidate sub-movie paragraphs B1 according to the text file corresponding to the movie segment B of interest. B2, B3, according to the text file corresponding to the interesting video segment C, divide the interesting video segment A into a plurality of candidate sub-movie segments C1, C2.

在步驟206中，從該等感興趣影片段落所對應的候選子影片段落篩選出該等目標子影片段落。搭配參閱圖4，步驟206包括子步驟61、62，以下說明步驟61、62。In step 206, the target sub-video segments are filtered out from the candidate sub-video segments corresponding to the video segments of interest. Referring to FIG. 4 in conjunction, step 206 includes sub-steps 61 and 62. Steps 61 and 62 are described below.

在步驟61中，對於每一候選子影片段落，該處理單元12根據該候選子影片段落所對應的該子文字檔將對應有子文字檔包括連續重複一預定次數(例如3次)之字詞的候選子影片段落刪除。In step 61, for each candidate sub-video segment, the processing unit 12 corresponds to the sub-text file corresponding to the sub-text file corresponding to the candidate sub-video segment, including words that are repeated a predetermined number of times (for example, 3 times) continuously. The candidate sub-movie paragraphs are deleted.

在步驟62中，對於每一未刪除的候選子影片段落，該處理單元12將對應有響度大於一預定分貝(例如90分貝)的候選子影片段落刪除。In step 62, for each undeleted candidate sub-film segment, the processing unit 12 deletes the candidate sub-film segment corresponding to a loudness greater than a predetermined decibel (for example, 90 decibels).

要特別注意的是，在本實施例中步驟61在步驟62之前，在其他實施方式中，步驟62亦可在步驟61之前，不以此為限。It should be particularly noted that in this embodiment, step 61 precedes step 62. In other embodiments, step 62 may also precede step 61, which is not limited thereto.

在步驟207中，該處理單元12根據該等目標子影片段落的內容將該等目標子影片段落進行排序。搭配參閱圖5，步驟207包括子步驟71~80，以下說明步驟71~80。In step 207, the processing unit 12 sorts the target sub-film segments according to the content of the target sub-film segments. With reference to FIG. 5, step 207 includes sub-steps 71-80, and steps 71-80 are described below.

在步驟71中，對於該等目標子影片段落的每一影像，該處理單元12獲得該影像中相關於該演講者的多個第一臉部特徵點（例如眼睛、鼻子、嘴巴、左側鬢角、右側鬢角等），以確定該演講者的臉部在影像中的位置範圍。值得注意的是，在本實施例中，該處理單元12係使用開源的OpenCV作為抓取該等第一臉部特徵點的工具，利用該等第一臉部特徵點算出臉部的角度及範圍，並在使用前提供大量的資料訓練其準確度，但不以此為限。In step 71, for each image of the target sub-video segment, the processing unit 12 obtains a plurality of first facial feature points (such as eyes, nose, mouth, left sideburn, left sideburn, etc.) related to the speaker in the image. Right sideburn, etc.) to determine the position range of the speaker’s face in the image. It is worth noting that in this embodiment, the processing unit 12 uses the open source OpenCV as a tool for capturing the first facial feature points, and uses the first facial feature points to calculate the angle and range of the face , And provide a lot of data to train its accuracy before use, but not limited to this.

在步驟72中，對於每一目標子影片段落，該處理單元12根據該目標子影片段落的所有第一臉部特徵點判定出相關於該演講者的臉部處於該目標子影片段落的一臉部位置狀態，其中該臉部位置狀態指示出一置中狀態及一非置中狀態。值得注意的是，在本實施例中，對於每一影像，當該演講者的臉部範圍的長與寬在影像所佔的比例在一預定範圍(例如40%~70%)內，且該演講者的臉部在影像中的位置範圍距離影像的每一邊緣的距離佔影像的比例大於等於一預設值(例如(100%-長寬比平均)*k%)時，其中0＜k＜1，該處理單元12視該影像為臉部置中，而對於每一目標子影片段落，視該為臉部置中的影像幀數大於視該為臉部非置中的影像幀數時，該目標子影片段落的該臉部位置狀態指示出置中狀態，但不以此為限。In step 72, for each target sub-film segment, the processing unit 12 determines, according to all the first facial feature points of the target sub-film segment, that the face related to the speaker is in the face of the target sub-film segment. The position state of the face, where the position state of the face indicates a centered state and a non-centered state. It is worth noting that in this embodiment, for each image, when the length and width of the face of the speaker are within a predetermined range (for example, 40% to 70%), and the The ratio of the distance between the speaker’s face in the image and each edge of the image to the image is greater than or equal to a preset value (for example (100%-average aspect ratio)*k%), where 0＜k <1, the processing unit 12 regards the image as a face centered, and for each target sub-movie segment, when the number of image frames deemed to be face centered is greater than the number of image frames deemed to be face uncentered , The face position state of the target sub-movie segment indicates the center state, but not limited to this.

搭配參閱圖6，舉例來說，在其中一影像中長為X，寬為Y，該演講者的臉部在影像中的距離影像的左側邊緣為x ₁，距離影像的右側邊緣為x ₃，距離影像的上側邊緣為y ₁，距離影像的下側邊緣為y ₃，該演講者的臉部在影像中的長度為x ₂，寬度為y ₂，則當x ₂/X及y ₂/Y在該預定範圍內，且x ₁/X、x ₃/X、y ₁/Y、y ₃/Y皆大於等於該預設值時，該處理單元12視該影像為臉部置中。 Refer to Figure 6 together. For example, in one of the images, the length is X and the width is Y. The distance of the speaker’s face in the image is x ₁ from the left edge of the image and x ₃ from the right edge of the image. The upper edge of the distance image is y ₁ and the lower edge of the distance image is y _3. The length of the speaker’s face in the image is x ₂ and the width is y ₂ , then when x ₂ /X and y ₂ /Y When within the predetermined range and x ₁ /X, x ₃ /X, y ₁ /Y, and y ₃ /Y are all greater than or equal to the preset value, the processing unit 12 regards the image as a face centered.

在步驟73中，該處理單元12根據該等目標子影片段落對應的臉部位置狀態排序該等目標子影片段落。值得注意的是，在本實施例中，該處理單元12將該等目標子影片段落分成2群，該臉部位置狀態指示出該置中狀態的目標子影片段落分成一群排序在前，該臉部位置狀態指示出該非置中狀態的目標子影片段落分成另一群排序在後，如下表1。表1 順序:先--------------------------------------後臉部位置狀態置中非置中 In step 73, the processing unit 12 sorts the target sub-video segments according to the face position states corresponding to the target sub-video segments. It is worth noting that in this embodiment, the processing unit 12 divides the target sub-movie segments into 2 groups, and the face position state indicates that the target sub-movie segment in the centered state is divided into a group and sorted first. The partial position status indicates that the target sub-movie segment in the non-centered state is divided into another group and sorted later, as shown in Table 1 below. Table 1 Order: first -------------------------------------- second Face position status Set in Non-centered

在步驟74中，對於該等目標子影片段落的每一影像，該處理單元12獲得該目標子影片段落中相關於該演講者的多個第二臉部特徵點。值得注意的是，在本實施例中，該處理單元12是利用例如臉部動作編碼系統（Facial Action Coding System，FACS）的概念，利用OpenCV抓取該等第二臉部特徵點，但不以此為限。In step 74, for each image in the target sub-video segment, the processing unit 12 obtains a plurality of second facial feature points related to the speaker in the target sub-video segment. It is worth noting that, in this embodiment, the processing unit 12 uses, for example, the concept of Facial Action Coding System (Facial Action Coding System, FACS), and uses OpenCV to capture the second facial feature points, but not This is limited.

在步驟75中，對於每一目標子影片段落，該處理單元12根據該目標子影片段落的所有第二臉部特徵點，判定出相關於該演講者的一表情情緒狀態，其中該表情情緒狀態指示出一正向狀態、一一般狀態，及一負面狀態之其中一者。值得注意的是，在本實施例中，若該處理單元12根據該等第二臉部特徵點辨識出該演講者眼睛皆睜開且嘴角上揚，則判定出該表情情緒狀態指示出該正向狀態；若該處理單元12根據該等第二臉部特徵點辨識出該演講者眼睛皆睜開且嘴角平齊，則判定出該表情情緒狀態指示出該一般狀態；若該處理單元12根據該等第二臉部特徵點辨識出該演講者眼睛閉眼且嘴角向下，則判定出該表情情緒狀態指示出該負面狀態，但不以此為限。In step 75, for each target sub-film segment, the processing unit 12 determines an expression emotional state related to the speaker according to all the second facial feature points of the target sub-film segment, wherein the expression emotional state Indicates one of a positive state, a general state, and a negative state. It is worth noting that in this embodiment, if the processing unit 12 recognizes that the speaker’s eyes are open and the corners of his mouth are raised according to the second facial feature points, it is determined that the emotional state of the expression indicates the positive direction. State; if the processing unit 12 recognizes that the speaker’s eyes are open and the corners of his mouth are flush based on the second facial feature points, it is determined that the emotional state of the expression indicates the general state; if the processing unit 12 is based on the After the second facial feature point recognizes that the speaker has eyes closed and the corner of his mouth is downward, it is determined that the emotional state of the expression indicates the negative state, but not limited to this.

在步驟76中，該處理單元12根據該等目標子影片段落對應的表情情緒狀態排序該等目標子影片段落。值得注意的是，在本實施例中，該處理單元12將該等目標子影片段落分成5群，依序分別為該臉部位置狀態指示出該置中狀態且該表情情緒狀態指示出該正面狀態的目標子影片段落、該臉部位置狀態指示出該置中狀態且該表情情緒狀態指示出該一般狀態的目標子影片段落、該臉部位置狀態指示出該非置中狀態且該表情情緒狀態指示出該正面狀態的目標子影片段落、該臉部位置狀態指示出該非置中狀態且該表情情緒狀態指示出該一般狀態的目標子影片段落、及該表情情緒狀態指示出該負面狀態的目標子影片段落，如下表2。表2 順序:先--------------------------------------後臉部位置狀態置中非置中置中/非置中表情情緒狀態正面一般正面一般負面 In step 76, the processing unit 12 sorts the target sub-video segments according to the emotional state of the expressions corresponding to the target sub-video segments. It is worth noting that, in this embodiment, the processing unit 12 divides the target sub-movie paragraphs into 5 groups, and respectively indicate the centered state for the face position state and the facial emotion state indicates the positive The target sub-movie segment of the state, the face position state indicates the centered state and the expression emotional state indicates the target sub-movie paragraph of the general state, the face position state indicates the non-centered state, and the expression emotional state The target sub-video segment indicating the positive state, the face position state indicating the non-centered state and the target sub-video segment indicating the general state of the expression emotional state, and the expression emotional state indicating the target of the negative state Sub-movie paragraphs are shown in Table 2 below. Table 2 Order: first -------------------------------------- second Face position status Set in Non-centered Centered/not centered Emotional state positive general positive general Negative

在步驟77中，對於該等目標子影片段落的每一影像，該處理單元12獲得該影像中相關於該演講者的多個肢體特徵點。In step 77, for each image of the target sub-video segments, the processing unit 12 obtains a plurality of physical feature points in the image related to the speaker.

在步驟78中，對於每一目標子影片段落，根據該目標子影片段落的所有肢體特徵點，判定出相關於該演講者的一肢體情緒狀態，其中該肢體情緒狀態指示出一正向狀態、一一般狀態，及一負面狀態之其中一者。值得注意的是，在本實施例中，該處理單元12係先根據每一目標子影片段落所有肢體特徵點判定出該演講者於每一幀影像的肢體位置，再由該等肢體位置判定出該肢體情緒狀態，若該演講者高舉雙手、高舉單手、正常速度移動，則該處理單元12判定出該肢體情緒狀態指示出該正向狀態；若該演講肢體軀幹歪斜、肢體軀幹異常晃動、肢體軀幹移動速度過快，則該處理單元12判定出該肢體情緒狀態指示出該負向狀態；其他情形該處理單元12則判定出該肢體情緒狀態指示出該一般狀態，其中高舉雙手、高舉單手、肢體軀幹歪斜等可以通過識別主要肢體特徵點位置（例如肩部、手肘等）來進行判定，而正常移動速度、移動速度過快、異常晃動等可以通過特定肢體特徵點(例如肢體軀幹)的移動速度來進行判定，但不以此為限。In step 78, for each target sub-film segment, according to all the physical feature points of the target sub-film segment, a physical emotional state related to the speaker is determined, wherein the physical emotional state indicates a positive state, One of a general state and a negative state. It is worth noting that, in this embodiment, the processing unit 12 first determines the position of the speaker’s limbs in each frame of the image based on all the limb feature points of each target sub-movie, and then determines the position of the limbs For the emotional state of the limbs, if the speaker raises his hands, raises one hand, and moves at a normal speed, the processing unit 12 determines that the emotional state of the limbs indicates the positive state; if the torso of the limbs is skewed or the limbs are abnormally shaking , The body’s torso moves too fast, the processing unit 12 determines that the emotional state of the limb indicates the negative state; in other cases, the processing unit 12 determines that the emotional state of the limb indicates the general state, in which hands are raised, Raising one hand, skewed limbs, etc. can be judged by identifying the position of main limb feature points (such as shoulders, elbows, etc.), while normal movement speed, excessive movement speed, abnormal shaking, etc. can be determined by specific limb feature points (such as The movement speed of the limbs and trunk) is used to determine, but not limited to this.

在步驟79中，該處理單元12根據該等目標子影片段落對應的肢體情緒狀態排序該等目標子影片段落。值得注意的是，在本實施例中，該處理單元12將該等目標子影片段落分成9群，依序分別為該臉部位置狀態指示出該置中狀態且該表情情緒狀態指示出該正面狀態且該肢體情緒狀態指示出該正面狀態的目標子影片段落、該臉部位置狀態指示出該置中狀態且該表情情緒狀態指示出該正面狀態且該肢體情緒狀態指示出該一般狀態的目標子影片段落、該臉部位置狀態指示出該置中狀態且該表情情緒狀態指示出該一般狀態的目標子影片段落且該肢體情緒狀態指示出該正面狀態的目標子影片段落、該臉部位置狀態指示出該置中狀態且該表情情緒狀態指示出該一般狀態的目標子影片段落且該肢體情緒狀態指示出該一般狀態的目標子影片段落、該臉部位置狀態指示出該非置中狀態且該表情情緒狀態指示出該正面狀態且該肢體情緒狀態指示出該正面狀態的目標子影片段落、該臉部位置狀態指示出該非置中狀態且該表情情緒狀態指示出該正面狀態且該肢體情緒狀態指示出該一般狀態的目標子影片段落、該臉部位置狀態指示出該非置中狀態且該表情情緒狀態指示出該一般狀態且該肢體情緒狀態指示出該正面狀態的目標子影片段落、該臉部位置狀態指示出該非置中狀態且該表情情緒狀態指示出該一般狀態且該肢體情緒狀態指示出該一般狀態的目標子影片段落、及該表情情緒狀態指示出該負面狀態或該肢體情緒狀態指示出該負面狀態的目標子影片段落，如下表3。表3 順序:先--------------------------------------------後臉部位置狀態置中非置中置中/非置中表情情緒狀態正面一般正面一般正面/一般/負面肢體情緒狀態正面一般正面一般正面一般正面一般正面/一般/負面 In step 79, the processing unit 12 sorts the target sub-video segments according to the emotional state of the limbs corresponding to the target sub-video segments. It is worth noting that in this embodiment, the processing unit 12 divides the target sub-movie paragraphs into 9 groups, and respectively indicate the centered state for the face position state and the emotional state of the facial expression indicates the front face. State and the limb emotional state indicates the target sub-video segment of the positive state, the face position state indicates the centered state, the expression emotional state indicates the positive state, and the limb emotional state indicates the target of the general state The sub-video segment, the face position state indicates the centered state, the expression emotional state indicates the target sub-video segment of the general state, and the body emotional state indicates the target sub-video segment of the positive state, and the face position The state indicates the centered state and the emotional state of expression indicates the target sub-movie section of the general state and the body emotional state indicates the target sub-movie section of the general state, the face position state indicates the non-centered state, and The emotional state of the expression indicates the positive state and the emotional state of the limb indicates the target sub-video segment of the positive state, the position state of the face indicates the non-centered state, and the emotional state of the expression indicates the positive state and the physical emotion The state indicates the target sub-movie section of the general state, the face position state indicates the non-centered state, the expression emotional state indicates the general state, and the body emotional state indicates the target sub-movie section of the positive state, the The face position state indicates the non-centered state, the expression emotional state indicates the general state, and the body emotional state indicates the target sub-video segment of the general state, and the expression emotional state indicates the negative state or the body emotion The status indicates the target sub-movie segment of the negative status, as shown in Table 3 below. table 3 Order: first-------------------------------------------- second Face position status Set in Non-centered Centered/not centered Emotional state positive general positive general Positive/normal/negative Physical emotional state positive general positive general positive general positive general Positive/normal/negative

在步驟80中，該處理單元12根據每一目標子影片段落所對應的子文字檔中相關於至少一預定字詞(例如關鍵字詞、同義字詞)的出現次數來排序該等目標子影片段落。值得注意的是，在本實施例中，該處理單元12係分別對9個群組進行排序，對應的子文字檔中該至少一預定字詞的出現次數越多的目標子影片段落順序越前(若出現次數相同則以影片長度越長者越前)，如下表4。表4 順序:先---------------------------------------------後臉部位置狀態置中非置中置中/非置中表情情緒狀態正面一般正面一般正面/一般/負面肢體情緒狀態正面一般正面一般正面一般正面一般正面/一般/負面預定字詞多至少多至少多至少多至少多至少多至少多至少多至少多至少 In step 80, the processing unit 12 sorts the target sub-videos according to the number of occurrences of at least one predetermined word (such as keyword words, synonyms) in the sub-text file corresponding to each target sub-video paragraph paragraph. It is worth noting that, in this embodiment, the processing unit 12 sorts the 9 groups respectively, and the more the number of occurrences of the at least one predetermined word in the corresponding sub-text file, the higher the sequence of the target sub-video paragraph ( If the number of occurrences is the same, the longer the length of the movie, the earlier), as shown in Table 4. Table 4 Sequence: first---------------------------------------------last Face position status Set in Non-centered Centered/not centered Emotional state positive general positive general Positive/normal/negative Physical emotional state positive general positive general positive general positive general Positive/normal/negative Predetermined term More at least More at least More at least More at least More at least More at least More at least More at least More at least

要特別注意的是，在其他實施方式中，步驟71~73可在步驟74~76或步驟77~79之後，步驟74~76可在步驟77~79之後，不以此為限，根據執行步驟順序不同，所獲得的排序亦不相同。It should be particularly noted that in other embodiments, steps 71 to 73 can be after steps 74 to 76 or 77 to 79, and steps 74 to 76 can be after steps 77 to 79. It is not limited to this. The order is different, the order obtained is also different.

在步驟208中，該處理單元12根據該等目標子影片段落的排序，從該等目標子影片中篩選出多個待合成子影片段落，其中，該等待合成子影片段落的影片長度總和低於一第三預定時段(例如60秒)。值得注意的是，在本實施例中，該等待合成子影片段落分別屬於不同的感興趣影片段落，但不以此為限。In step 208, the processing unit 12 filters out a plurality of sub-video segments to be synthesized from the target sub-videos according to the ordering of the target sub-video segments, wherein the total length of the sub-video segments waiting to be synthesized is less than A third predetermined time period (for example, 60 seconds). It should be noted that, in this embodiment, the sub-movie segments waiting to be synthesized belong to different video segments of interest, but it is not limited to this.

在步驟209中，該處理單元12將該等待合成子影片段落合成，以產生一合成影片。In step 209, the processing unit 12 synthesizes the sub-movie segment waiting to be synthesized to generate a synthesized movie.

綜上所述，本發明影片編輯方法及裝置，藉由該影片編輯裝置100的該處理單元12根據該等目標子影片段落的內容將該等目標子影片段落進行排序，並根據該等目標子影片段落的排序，從該等目標子影片中篩選出該等待合成子影片段落，以提高影片編輯效率，並能客觀地呈現精彩片段影片，故確實能達成本發明的目的。To sum up, in the video editing method and device of the present invention, the processing unit 12 of the video editing device 100 sorts the target sub-video segments according to the content of the target sub-video segments, and according to the target sub-video segments The sequence of the video segments is to filter out the sub-video segments waiting to be synthesized from the target sub-videos, so as to improve the editing efficiency of the video and present the highlights of the video objectively, so it can indeed achieve the purpose of the invention.

惟以上所述者，僅為本發明的實施例而已，當不能以此限定本發明實施的範圍，凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾，皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.

100:影片編輯裝置100: Video editing device

11:儲存單元11: storage unit

12:處理單元12: Processing unit

201~209:步驟201~209: Steps

61、62:步驟61, 62: steps

71~80:步驟71~80: steps

本發明的其他的特徵及功效，將於參照圖式的實施方式中清楚地呈現，其中：圖1是一方塊圖，說明本發明影片編輯裝置的一實施例；圖2是一流程圖，說明本發明影片編輯方法的一實施例；圖3是一示意圖，說明一影片分割成多個影片段落；圖4是一流程圖，輔助說明圖2的步驟206的子步驟61、62；圖5是一流程圖，輔助說明圖2的步驟207的子步驟71~80；及圖6是一示意圖，說明判定一目標子影片段落的一影像的一臉部位置狀態。 Other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, in which: Figure 1 is a block diagram illustrating an embodiment of the video editing device of the present invention; Figure 2 is a flowchart illustrating an embodiment of the film editing method of the present invention; Figure 3 is a schematic diagram illustrating that a movie is divided into multiple movie segments; Fig. 4 is a flowchart to assist in explaining sub-steps 61 and 62 of step 206 in Fig. 2; FIG. 5 is a flowchart to assist in explaining sub-steps 71 to 80 of step 207 in FIG. 2; and FIG. 6 is a schematic diagram illustrating the state of determining a face position of an image of a target sub-movie segment.

201~209:步驟 201~209: Steps

Claims

A video editing method is implemented by a video editing device. The video editing device stores a video including a speaker. The video editing method includes the following steps: (A) Obtain multiple target sub-videos based on the audio of the video Paragraphs; (B) sort the target sub-video segments according to the content of the target sub-video segments, wherein step (B) includes the following sub-steps (B-1) for each image of the target sub-video segments , To obtain multiple facial feature points related to the speaker in the image, (B-2) For each target sub-video segment, determine that it is related to the speaker based on all the facial feature points in the target sub-video segment The face of is in a face position state of the target sub-video segment, where the face position state indicates a centered state and a non-centered state, and (B-3) corresponding to the target sub-video segments Sort the target sub-video segments according to the position state of the face; (C) filter out multiple sub-video segments to be synthesized from the target sub-videos according to the sorting of the target sub-video segments; and (D) the waiting to be synthesized The sub-video paragraphs are synthesized to produce a synthesized video.

The video editing method according to claim 1, wherein step (A) includes the following sub-steps: (A-1) divide the video into multiple video segments according to the audio of the video; (A-2) according to these The audio of video segments is filtered from those video segments Out the video segments of interest; (A-3) For each video segment of interest, according to the audio of the video segment of interest, divide the video segment of interest into multiple candidate sub video segments; and (A-4) ) Filter out the target sub-video segments from the candidate sub-video segments corresponding to the interesting video segments.

The video editing method according to claim 2, wherein in step (A-2), the length of the selected video segments of interest is greater than a first predetermined time period.

The video editing method according to claim 2, wherein step (A-3) includes the following sub-steps: (A-3-1) For each video segment of interest, perform voice recognition on the video segment of interest to Obtain a text file; and (A-3-2) For each video segment of interest, according to the corresponding text file, segment the video segment of interest into the segment including complete sentences and the video length less than a second predetermined period of time Wait for candidate sub-video paragraphs, and each candidate sub-video paragraph corresponds to a sub-text file.

The video editing method according to claim 4, wherein, in step (A-4), for each candidate sub-video paragraph, corresponding to the sub-text file corresponding to the sub-text file includes repeating a predetermined number of times continuously The candidate sub-video segments of the word "Zhi" are deleted to filter out the target sub-video segments.

The video editing method according to claim 4, wherein, in step (B), the target sub-video segments are sorted according to the sub-text files corresponding to the target sub-video segments.

The video editing method according to claim 6, wherein, in step (B), the video editing device is based on the number of occurrences of at least one predetermined word in the sub-text file corresponding to each target sub-video paragraph Sort the target sub-video segments.

The film editing method according to claim 4, wherein, in step (A-4), for each candidate sub-movie section, delete candidate sub-movie sections corresponding to a loudness greater than a predetermined decibel to filter out the The target sub-movie paragraph.

The film editing method as described in claim 1, before step (A), further includes the following steps: (G) filtering out sounds outside a predetermined frequency interval in the film.

The film editing method according to claim 1, wherein, in step (C), the total film length of the sub-segments waiting to be synthesized is lower than a third predetermined time period.

The video editing method according to claim 1, wherein step (B) further includes the following sub-steps: (B-4) For each image in the target sub-video paragraphs, obtain information related to the speaker in the image Multiple physical feature points; (B-5) For each target sub-film segment, according to all the physical feature points in the target sub-film segment, determine a physical emotional state related to the speaker, where the physical emotional state indicates One of a positive state, a general state, and a negative state; and (B-6) sort the target sub-video segments according to the emotional state of the body corresponding to the target sub-video segments.

The film editing method according to claim 1, wherein step (B) further includes: The next sub-step: (B-4) For each image in the target sub-video segment, obtain multiple facial feature points related to the speaker in the target sub-video segment; (B-5) For the target sub-video In the video segment, according to all the facial feature points of the target sub-video segment, an expression emotional state related to the speaker is determined, wherein the expression emotional state indicates a positive state, a general state, and a negative state. One of them; and (B-6) sort the target sub-video segments according to the emotional state of the expressions corresponding to the target sub-video segments.

A video editing device includes: a storage unit storing a video including a speaker; a processing unit electrically connected to the storage unit to obtain a plurality of target sub-video segments according to the audio of the video, and then according to the targets The content of the sub-video paragraphs sorts the target sub-video paragraphs, wherein for each image of the target sub-video paragraphs, multiple facial feature points related to the speaker in the image are obtained, and for each target sub-video A video segment, based on all facial feature points of the target sub-video segment, it is determined that the face related to the speaker is in a face position state of the target sub-video segment, wherein the face position state indicates a centered state And a non-centered state, sort the target sub-video segments according to the face position states corresponding to the target sub-video segments, and then filter out multiple target sub-videos according to the order of the target sub-video segments The sub-video segment to be synthesized is finally synthesized to generate a synthesized video.