TWI830604B

TWI830604B - Video topic analysis system, method and computer readable medium thereof

Info

Publication number: TWI830604B
Application number: TW112106317A
Authority: TW
Inventors: 陳冠元
Original assignee: 中華電信股份有限公司
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2024-01-21

Abstract

The present invention is a video topic analysis system and method thereof. Features of text data, image data and voice data from video data are extracted by an intercross attention calculation and coding module of each mode. The features of text data, image data and voice data are projected into the same latent vector space for alignment by a cross-modal intercross attention calculation and coding module for extracting and fusing the features of multi-modal information. Finally, through procedures such as clustering, topic creation, and topic reduction, the video topic analysis results are obtained. In this way, the video data can be automatically clustered and representative topic hashtag and representative video content information of each group can be obtained quickly and precisely. The present invention also provides a computer-readable medium for executing the method of the present invention.

Description

Video topic analysis system, method and computer-readable medium

本發明係有關於分析視頻主題之技術，尤指一種基於多模態交錯注意力編碼機制的視頻主題分析系統、方法及其電腦可讀媒介。 The present invention relates to technology for analyzing video themes, and in particular, to a video theme analysis system and method based on a multi-modal interleaved attention coding mechanism and a computer-readable medium thereof.

Transformer模型(又稱變換器)是一種神經網路，並使用注意力(attention)或自我注意力(self-attention)之技術來執行，能追蹤序列資料中的關係，進而學習上、下文之間的脈絡及意義，易言之，Transformer模型為一種採用自注意力機制的深度學習模型。現今，Transformer模型逐漸取代以往常見的卷積神經網路(Convolutional Neural Network，CNN)和遞歸神經網路(Recurrent neural network，RNN)，成為目前深度學習模型的主角。 The Transformer model (also known as the transformer) is a neural network that uses attention or self-attention technology to execute. It can track relationships in sequence data and learn between context and context. In short, the Transformer model is a deep learning model that uses a self-attention mechanism. Nowadays, the Transformer model has gradually replaced the common convolutional neural network (CNN) and recursive neural network (RNN) in the past, and has become the protagonist of the current deep learning model.

另外，傳統深度學習之技術經常需使用大規模的標註資料(加入標籤之資料)對模型進行訓練，然對大規模資料集進行資料標註不僅耗時，所須成本也高，顯有改進空間。而Transformer模型因具有習得元素之間的關係之特性，且具有強大的特徵擷取能力，因此，在透過適當的設計下，可使用無標註資料對Transformer進行預先訓練，在無須使用加入標籤之大規模資料集下，成效會明顯優於其他模型方法。 In addition, traditional deep learning technology often requires the use of large-scale annotated data (labeled data) to train the model. However, data annotation of large-scale data sets is not only time-consuming, but also costly, and there is obvious room for improvement. The Transformer model has the characteristics of learning the relationship between elements and has powerful feature extraction capabilities. Therefore, with appropriate design, the Transformer can be pre-trained using unlabeled data without the need to add labels. Under large-scale data sets, the results will be significantly better than other model methods.

現有對於文本主題分析之技術並非罕見，常見於例如文章、新聞或書籍之文字分析，透過分析進而達到主題分群，惟僅限於文字類型進行解析，但對於視頻來說，裡面可能包含影像、文字或語音，僅有文字解析恐有不足，又或是，部分技術可以針對影像或語音作分析，但也僅是針對單一模態，亦即，即便進行影像分析和語音分析，但無法將兩者分析結果進行整合，因為兩者特徵擷取時並非同一類型，也難以整合，因而目前對視頻主題分析之技術，顯有不足之處。 Existing technology for text topic analysis is not uncommon. It is commonly used in text analysis such as articles, news or books. Topic grouping is achieved through analysis, but the analysis is limited to text types. However, for videos, it may contain images, text or For speech, only text analysis may be insufficient, or some technologies can analyze images or speech, but only for a single modality. That is, even if image analysis and speech analysis are performed, they cannot analyze both. The results are integrated, because the two features are not of the same type when extracted, and it is difficult to integrate. Therefore, the current video topic analysis technology has obvious shortcomings.

綜上，如何提供一種視頻主題分析技術，特別是，能針對視頻可能涵蓋語音、影像、文字等不同模態之資訊進行解析，並且能將各種模態之特徵資料進行整合，以各全面分析視頻主題，此將成為目前本技術領域人員急欲追求之目標。 In summary, how to provide a video topic analysis technology that can analyze information in different modalities such as voice, image, text, etc., and integrate the characteristic data of various modalities to comprehensively analyze the video? Theme, this will become the goal that people in this technical field are currently eager to pursue.

為解決上述現有技術之問題，本發明係揭露一種視頻主題分析系統，係包括：文本特徵交錯注意力計算與編碼模組，用於對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理，以得到文字序列的代表性特徵；影像特徵交錯注意力計算與編碼模組，用於對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理，以得到影像序列的代表性特徵；語音特徵交錯注意力計算與編碼模組，用於對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理，以得到語音序列的代表性特徵；跨模態交錯注意力計算與編碼模組，用於將該文字序列的代表性特徵、該影像序列的代表性特徵及該語音序列的代表性特徵進行非線性的隱向量對齊，以得到代表該視頻資料之多模態向量表徵；分群模組，用於將該視頻資料之多模態向量表徵進行分群，以產生分群結果；以及主題生成模組，用於對該分群結果進行主題創建以產生至少一主題群別，再透過各該主題群別之相似度計算以將相似度大於一閥值之主題群別進行整併，俾得到視頻分析結果。 In order to solve the above-mentioned problems of the prior art, the present invention discloses a video topic analysis system, which includes: a text feature interleaved attention calculation and encoding module, used for performing interleaved attention calculation and coding on text sequence data obtained from video data. Encoding processing to obtain the representative features of the text sequence; the image feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the image frame sequence obtained from the video data to obtain the image sequence. Representative features; the speech feature interleaved attention calculation and coding module is used to perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data to obtain the representative features of the speech sequence; cross-modal interleaved attention The force calculation and coding module is used to perform nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain a multi-modal representation of the video data. state Vector representation; a grouping module, used to group the multi-modal vector representation of the video material to generate a grouping result; and a topic generation module, used to create a topic on the grouping result to generate at least one topic group, Then, through the similarity calculation of each topic group, the topic groups whose similarity is greater than a threshold are integrated to obtain the video analysis results.

於一實施例中，該文本特徵交錯注意力計算與編碼模組復包括：切詞單元，用於對該文本序列資料進行切詞處理，以產生數個詞彙；文字序列向量初始化單元，用於對各該詞彙作向量編碼以產生各該詞彙之詞彙向量，以於對各該詞彙向量初始化後，得到文字資料的輸入序列向量；以及文本交錯注意力計算與編碼處理單元，用於將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該文字序列的代表性特徵。 In one embodiment, the text feature interleaved attention calculation and encoding module further includes: a word segmentation unit for performing word segmentation processing on the text sequence data to generate several words; a text sequence vector initialization unit for Vector encoding is performed on each word to generate a word vector of each word, so that after initializing each word vector, an input sequence vector of text data is obtained; and a text interleaved attention calculation and encoding processing unit is used to convert the text The input sequence vector of the data is processed by interleaved attention calculation and encoding to obtain the representative features of the text sequence.

於一實施例中，該影像特徵交錯注意力計算與編碼模組復包括：影像切幀單元，用於對該影像幀序列進行切幀處理，以產生數個影像幀；影像序列向量初始化單元，用於對各該影像幀作向量編碼以產生各該影像幀之影像幀向量，以於對各該影像幀向量初始化後，得到影像資料的輸入序列向量；以及影像交錯注意力計算與編碼處理單元，用於將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該影像序列的代表性特徵。 In one embodiment, the image feature interleaved attention calculation and encoding module further includes: an image frame cutting unit for performing frame cutting processing on the image frame sequence to generate several image frames; an image sequence vector initialization unit, Used to vector encode each image frame to generate an image frame vector of each image frame, so as to obtain an input sequence vector of image data after initializing each image frame vector; and an image interleaved attention calculation and encoding processing unit , used to perform interleaved attention calculation and coding processing on the input sequence vector of the image data to obtain the representative features of the image sequence.

於一實施例中，該語音特徵交錯注意力計算與編碼模組復包括：語音切幀單元，用於對該語音序列進行切幀處理，以產生數個語音幀；語音序列向量初始化單元，用於對各該語音幀作向量編碼以產生各該語音幀之語音幀向量，以於對各該語音幀向量初始化後，得到語音資料的輸入序列向量；以及語音交錯注意力計算與編碼處理單元，用於將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該語音序列的代表性特徵。 In one embodiment, the speech feature interleaved attention calculation and encoding module further includes: a speech frame cutting unit, used to perform frame cutting processing on the speech sequence to generate several speech frames; a speech sequence vector initialization unit, using performing vector encoding on each speech frame to generate a speech frame vector of each speech frame, so as to obtain an input sequence vector of speech data after initializing each speech frame vector; and a speech interleaved attention calculation and encoding processing unit, It is used to perform interleaved attention calculation and coding processing on the input sequence vector of the speech data to obtain the representative features of the speech sequence.

於一實施例中，該主題生成模組復包括：主題創建單元，係於對各該主題群別進行主題創建時，產生各該主題群別代表性的主題標籤；以及主題消融單元，係透過文字相似度計算及各主題間代表性特徵相似度計算，得到各主題間之特徵向量與該主題標籤的相似度，以將平均相似度大於該閥值的主題群別進行消融，俾得到該視頻分析結果。 In one embodiment, the topic generation module further includes: a topic creation unit, which generates topic tags representative of each topic group when creating topics for each topic group; and a topic ablation unit, which generates topic tags representative of each topic group through Calculate text similarity and representative feature similarity between each topic to obtain the similarity between the feature vectors between each topic and the topic tag, so as to ablate the topic groups whose average similarity is greater than the threshold to obtain the video Analyze the results.

本發明復揭露一種視頻主題分析方法，係由電腦設備執行該方法，該方法包括以下步驟：令文本特徵交錯注意力計算與編碼模組對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理，以得到文字序列的代表性特徵；令影像特徵交錯注意力計算與編碼模組對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理，以得到影像序列的代表性特徵；令語音特徵交錯注意力計算與編碼模組對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理，以得到語音序列的代表性特徵；令跨模態交錯注意力計算與編碼模組將該文字序列的代表性特徵、該影像序列的代表性特徵及該語音序列的代表性特徵進行非線性的隱向量對齊，以得到代表該視頻資料之多模態向量表徵；令分群模組將該視頻資料之多模態向量表徵進行分群，以產生分群結果；以及令主題生成模組對該分群結果進行主題創建以產生至少一主題群別，再透過各該主題群別之相似度計算以將相似度大於一閥值之主題群別進行整併，俾得到視頻分析結果。 The invention further discloses a video theme analysis method, which is executed by computer equipment. The method includes the following steps: causing the text feature interleaved attention calculation and encoding module to perform interleaved attention calculation on the text sequence data obtained from the video data. and coding processing to obtain the representative features of the text sequence; let the image feature interleaved attention calculation and coding module perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data to obtain the representative features of the image sequence characteristics; let the speech feature interleaved attention calculation and coding module perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data to obtain the representative features of the speech sequence; let the cross-modal interleaved attention calculation The coding module performs nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain a multi-modal vector representation representing the video data; let The grouping module groups the multi-modal vector representation of the video data to generate grouping results; and causes the topic generation module to perform topic creation on the grouping results to generate at least one topic group, and then through each topic group Similarity calculation is used to integrate topic groups whose similarity is greater than a threshold to obtain video analysis results.

於上述方法中，該令文本特徵交錯注意力計算與編碼模組對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理之步驟，復包括：對該文本序列資料進行切詞處理，以產生數個詞彙；對各該詞彙作向量編碼以產生各該詞彙之詞彙向量，以於對各該詞彙向量初始化後，得到文字資料的輸入序列向量；以及將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該文字序列的代表性特徵。 In the above method, the step of causing the text feature interleaved attention calculation and encoding module to perform interleaved attention calculation and encoding processing on the text sequence data obtained from the video data further includes: performing word segmentation processing on the text sequence data. , to generate several words; perform vector encoding on each word to generate a word vector of each word, so that after initializing each word vector, the input sequence of text data can be obtained Column vector; and perform interleaved attention calculation and encoding processing on the input sequence vector of the text data to obtain the representative characteristics of the text sequence.

於上述方法中，該令影像特徵交錯注意力計算與編碼模組對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理之步驟，復包括：對該影像幀序列進行切幀處理，以產生數個影像幀；對各該影像幀作向量編碼以產生各該影像幀之影像幀向量，以於對各該影像幀向量初始化後，得到影像資料的輸入序列向量；以及將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該影像序列的代表性特徵。 In the above method, the step of causing the image feature interleaved attention calculation and coding module to perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data further includes: cutting the image frame sequence. Processing to generate several image frames; performing vector encoding on each image frame to generate an image frame vector for each image frame, so as to obtain an input sequence vector of image data after initializing each image frame vector; and converting the image frame vector The input sequence vector of the image data undergoes interleaved attention calculation and encoding processing to obtain the representative features of the image sequence.

於上述方法中，該令語音特徵交錯注意力計算與編碼模組對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理之步驟，復包括：對該語音序列進行切幀處理，以產生數個語音幀；對各該語音幀作向量編碼以產生各該語音幀之語音幀向量，以於對各該語音幀向量初始化後，得到語音資料的輸入序列向量；以及將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該語音序列的代表性特徵。 In the above method, the step of causing the speech feature interleaved attention calculation and coding module to perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data further includes: performing frame cutting processing on the speech sequence, To generate several speech frames; perform vector encoding on each speech frame to generate a speech frame vector of each speech frame, so as to obtain an input sequence vector of speech data after initializing each speech frame vector; and convert the speech data The input sequence vector is subjected to interleaved attention calculation and coding processing to obtain the representative features of the speech sequence.

於上述方法中，該令主題生成模組對該分群結果進行主題創建以產生至少一主題群別，再透過各該主題群別之相似度計算之步驟，復包括：對各該詞彙作向量編碼以產生各該詞彙之詞彙向量，以於對各該詞彙向量初始化後，得到文字資料的輸入序列向量；以及將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該文字序列的代表性特徵。 In the above method, the topic generation module performs topic creation on the grouping results to generate at least one topic group, and then through the step of calculating the similarity of each topic group, including: vector encoding of each word To generate word vectors of each word, and after initializing each word vector, obtain the input sequence vector of the text data; and perform interleaved attention calculation and encoding processing on the input sequence vector of the word data to obtain the word sequence representative characteristics.

本發明復揭露一種電腦可讀媒介，應用於計算裝置或電腦中，係儲存有指令，以執行前述之視頻主題分析方法。 The invention further discloses a computer-readable medium, which is used in a computing device or a computer and stores instructions to execute the aforementioned video theme analysis method.

綜上，本發明之視頻主題分析系統、方法及其電腦可讀媒介，主要透過兩階段交錯注意力來對文本、影像及語音等多模態資料進行處理，藉以得到較具代表性且融合多模態豐富資訊的特徵資訊，其中，第一階段之各模態交錯注意力計算與編碼模組係使用交錯注意力計算機制來對文本、影像及語音資料分別進行交錯注意力計算與編碼處理，第二階段之跨模態交錯注意力計算與編碼模組則使用跨模態交錯注意力計算機制來對於前面交錯注意力計算機制所產出之文本、影像及語音序列資料之編碼向量結果進行非線性之隱向量對齊，如此，經計算與編碼後即可成為具代表性的多模態向量表徵，之後，經過特徵分群、主題創建及主題整併等程序，最終得到視頻主題分析結果。綜上，本發明考量視頻中的文本、影像及語音資料並進行跨模態整合，對於視頻主題分析將能提供更佳解析效果。 In summary, the video topic analysis system, method and computer-readable medium of the present invention mainly process multi-modal data such as text, image and voice through two-stage interleaved attention, so as to obtain a more representative and integrated multi-modal data. Characteristic information of modal rich information. Among them, the interleaved attention calculation and encoding modules of each modality in the first stage use the interleaved attention calculation mechanism to perform interleaved attention calculation and encoding processing on text, image and voice data respectively. The cross-modal interleaved attention calculation and coding module in the second stage uses the cross-modal interleaved attention calculation mechanism to perform non-coding vector results on the text, image and speech sequence data generated by the previous interleaved attention calculation mechanism. Linear latent vectors are aligned, so that after calculation and encoding, they can become representative multi-modal vector representations. After that, through procedures such as feature grouping, topic creation, and topic integration, the video topic analysis results are finally obtained. In summary, the present invention considers the text, image and voice data in the video and performs cross-modal integration, which will provide better analysis results for video topic analysis.

1:視頻主題分析系統 1: Video theme analysis system

11:文本特徵交錯注意力計算與編碼模組 11: Text feature interleaved attention calculation and encoding module

111:切詞單元 111: Word segmentation unit

112:文字序列向量初始化單元 112: Text sequence vector initialization unit

113:文本交錯注意力計算與編碼處理單元 113: Text interleaved attention calculation and encoding processing unit

12:影像特徵交錯注意力計算與編碼模組 12: Image feature interleaved attention calculation and encoding module

121:影像切幀單元 121: Image frame cutting unit

122:影像序列向量初始化單元 122: Image sequence vector initialization unit

123:影像交錯注意力計算與編碼處理單元 123: Image interleaved attention calculation and encoding processing unit

13:語音特徵交錯注意力計算與編碼模組 13: Speech feature interleaved attention calculation and coding module

131:語音切幀單元 131: Voice frame cutting unit

132:語音序列向量初始化單元 132: Speech sequence vector initialization unit

133:語音交錯注意力計算與編碼處理單元 133: Speech interleaved attention calculation and encoding processing unit

14:跨模態交錯注意力計算與編碼模組 14: Cross-modal interleaved attention calculation and encoding module

15:分群模組 15:Group module

16:主題生成模組 16:Theme generation module

161:主題創建單元 161: Topic Creation Unit

162:主題消融單元 162: Subject ablation unit

301-306:流程 301-306:Process

501-504:流程 501-504:Process

S401-S406:步驟 S401-S406: Steps

圖1為本發明之視頻主題分析系統的系統架構圖。 Figure 1 is a system architecture diagram of the video theme analysis system of the present invention.

圖2為本發明之視頻主題分析系統各模組的內部架構圖。 Figure 2 is an internal architecture diagram of each module of the video theme analysis system of the present invention.

圖3為本發明之視頻主題分析系統於一具體範例的運作流程圖。 FIG. 3 is an operation flow chart of the video theme analysis system of the present invention in a specific example.

圖4為本發明之視頻主題分析方法的步驟圖。 Figure 4 is a step diagram of the video theme analysis method of the present invention.

圖5為本發明之視頻主題分析方法於一應用範例的示意圖。 FIG. 5 is a schematic diagram of an application example of the video theme analysis method of the present invention.

以下藉由特定的具體實施形態說明本發明之技術內容，熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention through specific embodiments. Those familiar with the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied through other different specific implementation forms.

圖1為本發明之視頻主題分析系統的系統架構圖。如圖所示，本發明之視頻主題分析系統1係包括文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12、語音特徵交錯注意力計算與編碼模組13、跨模態交錯注意力計算與編碼模組14、分群模組15以及主題生成模組16。 Figure 1 is a system architecture diagram of the video theme analysis system of the present invention. As shown in the figure, the video topic analysis system 1 of the present invention includes a text feature interleaved attention calculation and coding module 11, an image feature interleaved attention calculation and coding module 12, and a voice feature interleaved attention calculation and coding module 13. , cross-modal interleaved attention calculation and encoding module 14, grouping module 15 and topic generation module 16.

文本特徵交錯注意力計算與編碼模組11用於對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理，以得到文字序列的代表性特徵。簡言之，文本特徵交錯注意力計算與編碼模組11係將視頻資料中的文本資料進行交錯注意力計算與編碼處理，即可得到視頻資料中有關文本(或文字)的代表性特徵。 The text feature interleaved attention calculation and encoding module 11 is used to perform interleaved attention calculation and encoding processing on text sequence data obtained from video data to obtain representative features of the text sequence. In short, the text feature interleaved attention calculation and encoding module 11 performs interleaved attention calculation and encoding processing on the text data in the video data, so that the representative features of the text (or words) in the video data can be obtained.

影像特徵交錯注意力計算與編碼模組12用於對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理，以得到影像序列的代表性特徵。易言之，影像特徵交錯注意力計算與編碼模組12係將視頻資料中的影像資料進行交錯注意力計算與編碼處理，即可得到視頻資料中有關影像序列的代表性特徵。 The image feature interleaved attention calculation and coding module 12 is used to perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data to obtain representative features of the image sequence. In other words, the image feature interleaved attention calculation and encoding module 12 performs interleaved attention calculation and encoding processing on the image data in the video data, so that the representative features of the image sequence in the video data can be obtained.

語音特徵交錯注意力計算與編碼模組13用於對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理，以得到語音序列的代表性特徵。簡言之，語音特徵交錯注意力計算與編碼模組13可將視頻資料中的語音資料(或音頻資料)進行交錯注意力計算與編碼處理，即可得到視頻資料中有關語音序列的代表性特徵。 The speech feature interleaved attention calculation and encoding module 13 is used to perform interleaved attention calculation and encoding processing on the speech sequence obtained from the video data to obtain representative features of the speech sequence. In short, the speech feature interleaved attention calculation and encoding module 13 can perform interleaved attention calculation and encoding processing on the speech data (or audio data) in the video data, so as to obtain the representative features of the speech sequence in the video data. .

跨模態交錯注意力計算與編碼模組14用於將該文字序列的代表性特徵、該影像序列的代表性特徵以及該語音序列的代表性特徵進行非線性的隱向量對齊，以得到代表該視頻資料之多模態向量表徵。具體來說，跨模態交錯注意力計算與編碼模組14係將來自文本特徵交錯注意力計算與編碼模組11之文字序列的代表性特徵、影像特徵交錯注意力計算與編碼模組12之影像序列的代表性特徵、語音特徵交錯注意力計算與編碼模組13之語音序列的代表性特徵進行整合，由於文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12和語音特徵交錯注意力計算與編碼模組13會將各序列的代表性特徵編碼程相同維度的序列向量，因而跨模態交錯注意力計算與編碼模組14可透過非線性之隱向量對齊，進而將視頻資料中的文本資料、影像資料以及語音資料所得到之序列向量進行整併，藉此得到代表該視頻資料之多模態向量表徵。 The cross-modal interleaved attention calculation and encoding module 14 is used to perform nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain a representation of the Multimodal vector representation of video data. Specifically, the cross-modal interleaved attention calculation and encoding module 14 is one of the representative features of the text sequence from the text feature interleaved attention calculation and encoding module 11 and the image feature interleaved attention calculation and encoding module 12 The representative features of the image sequence and the voice features of the interleaved attention calculation and coding module 13 are integrated with the representative features of the voice sequence. Since the text features interleave attention calculation and coding module 11, the image features interleave attention calculation and coding module Group 12 and the speech feature interleaved attention calculation and encoding module 13 will encode the representative features of each sequence into sequence vectors of the same dimension, so the cross-modal interleaved attention calculation and encoding module 14 can use nonlinear latent vectors Align and then integrate the sequence vectors obtained from the text data, image data and voice data in the video data to obtain a multi-modal vector representation representing the video data.

分群模組15用於將該視頻資料之多模態向量表徵進行分群，以產生分群結果。當跨模態交錯注意力計算與編碼模組14得到代表視頻資料之多模態向量表徵時，透過分群技術對視頻的代表性特徵進行分群，以供後續對各群別進行主題創建或是整併。 The grouping module 15 is used to group the multi-modal vector representation of the video data to generate grouping results. When the cross-modal interleaved attention calculation and coding module 14 obtains the multi-modal vector representation representing the video data, the representative features of the video are grouped through grouping technology for subsequent theme creation or organization of each group. And.

於一實施例中，分群技術可採用基於密度的聚類演算法HDBSCAN(Hierarchical Density-Based Spatial Clustering of Applications with Noise)演算法來對行為特徵進行分群(Clustering)。 In one embodiment, the clustering technology can use the density-based clustering algorithm HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster behavioral characteristics.

主題生成模組16用於對該分群結果進行主題創建以產生至少一主題群別，再透過各該主題群別之相似度計算與比較，以將相似度大於一閥值之主題群別進行整併，俾得到視頻分析結果。簡言之，主題生成模組16係對分群結果進行主題創建並給予對應的主題標籤(Hashtag)，於創建多個主題群別後，透過相似度的比較，將相似度高的群別進行整併，相似度可由各主題之特徵向量與主題標籤的相似度來進行判斷，當相似度值大於一預定閥值時，即可判定兩個主題群別具高相似度，可進一步將兩個主題群別進行消融(reduction)。 The topic generation module 16 is used to perform topic creation on the grouping results to generate at least one topic group, and then calculate and compare the similarity of each topic group to organize the topic groups whose similarity is greater than a threshold. And, in order to get the video analysis results. In short, the topic generation module 16 pairs of groups As a result, topics are created and corresponding topic tags (Hashtags) are given. After multiple topic groups are created, groups with high similarity are integrated through similarity comparison. The similarity can be determined by the feature vector of each topic and the topic. The judgment is made based on the similarity of the tags. When the similarity value is greater than a predetermined threshold, it can be determined that the two topic groups have high similarity, and the two topic groups can be further reduced.

另外，主題創建技術可為基於類別(class-based)的詞頻-逆向文件頻率(Term Frequency-Inverse Document Frequency，TF-IDF)演算法，前述c-TF-IDF演算法主要是在文字探勘或自然語言處理中進行文字加權，進而反映出詞彙對於文檔的重要性，此主題創建技術已眾所皆知，於此不再贅述。 In addition, the topic creation technology can be a class-based Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. The aforementioned c-TF-IDF algorithm is mainly used in text exploration or natural Text weighting is performed in language processing to reflect the importance of vocabulary to the document. This topic creation technology is well known and will not be repeated here.

由上可知，先由文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12和語音特徵交錯注意力計算與編碼模組13對視頻資料中各模態資料分別進行交錯注意力計算以及編碼，之後進行跨模態的整併處理，進而得到代表該視頻資料之多模態向量表徵，之後，經由分群模組15進行分群，最後，由主題生成模組16對分群結果進行主題創建、產出代表性主題標籤，並考量群別之間相似度高低來進行群別消融，最終即可得到該視頻資料之主題分析結果。 It can be seen from the above that first, the text feature interleaved attention calculation and coding module 11, the image feature interleaved attention calculation and coding module 12, and the speech feature interleaved attention calculation and coding module 13 respectively pair each modal data in the video data. Perform interleaved attention calculation and encoding, and then perform cross-modal integration processing to obtain a multi-modal vector representation representing the video data. After that, grouping is performed through the grouping module 15, and finally, the topic generation module 16 pairs The grouping results are used to create topics, generate representative topic tags, and perform group ablation based on the similarity between the groups. Finally, the topic analysis results of the video material can be obtained.

圖2為本發明之視頻主題分析系統各模組的內部架構圖。如圖所示，本實施例進一步說明文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12、語音特徵交錯注意力計算與編碼模組13和主題生成模組16的內部架構圖。 Figure 2 is an internal architecture diagram of each module of the video theme analysis system of the present invention. As shown in the figure, this embodiment further illustrates the text feature interleaved attention calculation and coding module 11, the image feature interleaved attention calculation and coding module 12, the speech feature interleaved attention calculation and coding module 13 and the topic generation module. 16 internal architecture diagram.

文本特徵交錯注意力計算與編碼模組11係包括切詞單元111、文字序列向量初始化單元112以及文本交錯注意力計算與編碼處理單元113。切詞單元111用於對該文本序列資料進行切詞處理，以產生數個詞彙，文字序列向量初始化單元112用於對各該詞彙作向量編碼以產生各該詞彙之詞彙向量，以於對各該詞彙向量初始化後，得到文字資料的輸入序列向量，文本交錯注意力計算與編碼處理單元113用於將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該文字序列的代表性特徵。 The text feature interleaved attention calculation and encoding module 11 includes a word segmentation unit 111, a text sequence vector initialization unit 112, and a text interleaved attention calculation and encoding processing unit 113. The word segmentation unit 111 is used to perform word segmentation processing on the text sequence data to generate several words and text sequence vectors. The initialization unit 112 is used to perform vector encoding on each vocabulary to generate a vocabulary vector of each vocabulary, so as to obtain an input sequence vector of text data after initializing each vocabulary vector. The text interleaved attention calculation and encoding processing unit 113 uses The input sequence vector of the text data is subjected to interleaved attention calculation and encoding processing to obtain the representative characteristics of the text sequence.

影像特徵交錯注意力計算與編碼模組12包括影像切幀單元121、影像序列向量初始化單元122以及影像交錯注意力計算與編碼處理單元123。影像切幀單元121用於對該影像幀序列進行切幀處理，以產生數個影像幀，影像序列向量初始化單元122用於對各該影像幀作向量編碼以產生各該影像幀之影像幀向量，以於對各該影像幀向量初始化後，得到影像資料的輸入序列向量，影像交錯注意力計算與編碼處理單元123用於將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該影像序列的代表性特徵。 The image feature interleaved attention calculation and encoding module 12 includes an image frame cutting unit 121, an image sequence vector initialization unit 122, and an image interleaved attention calculation and encoding processing unit 123. The image frame cutting unit 121 is used to perform frame cutting processing on the image frame sequence to generate several image frames. The image sequence vector initialization unit 122 is used to vector encode each image frame to generate an image frame vector of each image frame. , so that after initializing each image frame vector, the input sequence vector of the image data is obtained. The image interleaved attention calculation and encoding processing unit 123 is used to perform interleave attention calculation and encoding processing on the input sequence vector of the image data, so as to Obtain representative features of the image sequence.

語音特徵交錯注意力計算與編碼模組13包括語音切幀單元131、語音序列向量初始化單元132以及語音交錯注意力計算與編碼處理單元133。語音切幀單元131用於對該語音序列進行切幀處理，以產生數個語音幀，語音序列向量初始化單元132用於對各該語音幀作向量編碼以產生各該語音幀之語音幀向量，以於對各該語音幀向量初始化後，得到語音資料的輸入序列向量，語音交錯注意力計算與編碼處理單元133用於將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該語音序列的代表性特徵。 The speech feature interleaved attention calculation and encoding module 13 includes a speech frame cutting unit 131, a speech sequence vector initialization unit 132, and a speech interleaved attention calculation and encoding processing unit 133. The voice frame cutting unit 131 is used to perform frame cutting processing on the voice sequence to generate several voice frames. The voice sequence vector initialization unit 132 is used to vector encode each voice frame to generate a voice frame vector for each voice frame. After initializing each speech frame vector to obtain the input sequence vector of the speech data, the speech interleaved attention calculation and encoding processing unit 133 is used to perform interleaved attention calculation and encoding processing on the input sequence vector of the speech data to obtain Representative characteristics of this speech sequence.

主題生成模組16包括主題創建單元161以及主題消融單元162。主題創建單元161於對各該主題群別進行主題創建時，產生各該主題群別代表性的主題標籤，主題消融單元162可透過文字相似度計算以及各主題間代表性特徵相似度計算，得到各主題間之特徵向量與該主題標籤的相似度，以將平均相似度大於該閥值的主題群別進行消融，俾得到該視頻分析結果。後面將以具體範例來進行說明。 The theme generation module 16 includes a theme creation unit 161 and a theme ablation unit 162. The topic creation unit 161 generates topic tags representative of each topic group when creating topics for each topic group. The topic ablation unit 162 can obtain the result through text similarity calculation and representative feature similarity calculation between each topic. The similarity between the feature vectors of each topic and the topic label is used to calculate the average similarity The subject groups whose similarities are greater than the threshold are ablated to obtain the video analysis results. This will be explained later with specific examples.

圖3為本發明之視頻主題分析系統於一具體範例的運作流程圖，請同時參考圖2。如圖所示，視頻主題分析系統係透過兩階段交錯注意力來對文本、影像及語音等多模態資料進行處理，藉以得到較具代表性且融合多模態豐富資訊的特徵資訊。 FIG. 3 is an operation flow chart of the video theme analysis system of the present invention in a specific example. Please refer to FIG. 2 at the same time. As shown in the figure, the video topic analysis system processes multi-modal data such as text, images, and speech through two stages of interleaved attention, thereby obtaining characteristic information that is more representative and integrates rich multi-modal information.

於流程301，提供大規模資料。本流程所指大規模資料為視頻資料，而視頻資料中包含文本部分、影像部分和語音部分。 In process 301, large-scale data is provided. The large-scale data referred to in this process are video data, and video data includes text parts, image parts, and voice parts.

於流程302，第一階段之各模態特徵交錯注意力計算與編碼處理。具體來說，對於文本、影像和語音的處理，分別由不同模態的交錯注意力計算與編碼模組來進行，其中，第一階段之文本特徵交錯注意力計算與編碼模組用於處理文本部分，第一階段之影像特徵交錯注意力計算與編碼模組用於處理影像部分，第一階段之語音特徵交錯注意力計算與編碼模組用於處理語音部分，處理方式如前面圖2內容所述。具體來說，由文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12和語音特徵交錯注意力計算與編碼模組13分別處理文本資料、影像資料和語音資料，經交錯注意力計算與編碼處理後，可分別產生文字序列的代表性特徵、影像序列的代表性特徵和語音序列的代表性特徵。 In process 302, each modal feature in the first stage is interleaved with attention calculation and encoding processing. Specifically, the processing of text, images, and speech is performed by interleaved attention calculation and encoding modules of different modes. Among them, the text feature interleaved attention calculation and encoding module in the first stage is used to process text. In the first stage, the image feature interleaved attention calculation and coding module is used to process the image part. The first stage voice feature interleaved attention calculation and coding module is used to process the speech part. The processing method is as shown in Figure 2. narrate. Specifically, the text feature interleaved attention calculation and coding module 11, the image feature interleaved attention calculation and coding module 12, and the speech feature interleaved attention calculation and coding module 13 process text data, image data, and voice data respectively. , after interleaved attention calculation and encoding processing, representative features of text sequences, representative features of image sequences, and representative features of speech sequences can be generated respectively.

文本特徵交錯注意力計算與編碼模組11能提供文本序列資料之交錯注意力計算與編碼處理。具體來說，切詞單元111能將文本序列資料進行切詞處理，並對每一個詞彙給予一個向量編碼建立詞彙向量，接著，文字序列向量初始化單元112可透過Xavier初始化或者He初始化的方式得到詞彙向量和向量權重，或者透過預訓練的模型(例如Data2vec，即一種適用於多種模態的高性能自我監督算法)取得，所得到文字資料的輸入序列向量X _KV(text)

R ^{M(text) x C(text)}，其中，M(text)為經切詞處理後文本序列資料的長度，C(text)為超參數是文本輸入序列向量的維度，接著，文本交錯注意力計算與編碼處理單元113內可進行訓練的模型參數X _Q(text)

R ^N×D，用以進行交錯注意力計算與編碼處理，經前述之文本特徵交錯注意力計算與編碼模組11處理後，即可得到文字序列的代表性特徵X _MLP(text)。 The text feature interleaved attention calculation and encoding module 11 can provide interleaved attention calculation and encoding processing of text sequence data. Specifically, the word segmentation unit 111 can segment the text sequence data and give each word a vector code to create a word vector. Then, the word sequence vector initialization unit 112 can obtain the words through Xavier initialization or He initialization. Vectors and vector weights, or obtained through a pre-trained model (such as Data2vec, a high-performance self-supervised algorithm suitable for multiple modalities), the input sequence vector X _KV(text) of the obtained text data

R M( text ⁾ Model parameters X _Q(text) that can be trained in the calculation and encoding processing unit 113

R ^N×D is used for interleaved attention calculation and encoding processing. After being processed by the aforementioned text feature interleaved attention calculation and encoding module 11, the representative feature X _MLP(text) of the text sequence can be obtained.

影像特徵交錯注意力計算與編碼模組12能提供影像序列資料之交錯注意力計算與編碼處理。具體來說，對於影像序列資料，可使用從視頻資料所擷取的影像幀(frame)序列做為輸入序列向量X _KV(image)

R ^{M(image) x C(image)}，其中，M(image)為像幀(frame)序列資料的長度，C(image)為超參數是輸入序列向量的維度，而影像切幀單元121即提供影像切幀處理，影像序列向量初始化單元122可透過Xavier初始化或者He初始化的方式得到影像向量和向量權重，或者可透過預訓練的模型(例如Data2vec)取得，接著，影像交錯注意力計算與編碼處理單元123內可進行訓練的模型參數X _Q(image)

R ^N×D，用以進行交錯注意力計算與編碼處理，經前述之影像特徵交錯注意力計算與編碼模組12處理後，即可得到影像序列的代表性特徵X _MLP(image)。 The image feature interleaved attention calculation and encoding module 12 can provide interleaved attention calculation and encoding processing of image sequence data. Specifically, for image sequence data, the image frame (frame) sequence captured from the video data can be used as the input sequence vector X _{KV (image)}

R ^M ( image ) Image frame cutting processing, the image sequence vector initialization unit 122 can obtain the image vector and vector weight through Xavier initialization or He initialization, or can obtain it through a pre-trained model (such as Data2vec), and then perform image interleaved attention calculation and coding processing Model parameters X _Q(image) that can be trained in unit 123

R ^N×D is used to perform interleaved attention calculation and encoding processing. After being processed by the aforementioned image feature interleaved attention calculation and encoding module 12, the representative feature X _MLP(image) of the image sequence can be obtained.

語音特徵交錯注意力計算與編碼模組13能提供語音序列資料之交錯注意力計算與編碼處理。具體來說，對於語音序列資料，則使用從視頻資料所擷取的語音(音頻)序列做為輸入序列向量X _KV(audio)

R ^{M(audio) x C(audio)}，其中，M(audio)為語音幀(frame)序列資料的長度，C(audio)為超參數是輸入序列向量的維度，而語音切幀單元131即提供語音切幀處理，語音序列向量初始化單元132 可透過Xavier初始化或者He初始化的方式得到語音向量和向量權重，或者可透過預訓練的模型(例如Data2vec)取得，接著，語音交錯注意力計算與編碼處理單元133內可進行訓練的模型參數X _Q(audio)

R ^N×D，用以進行交錯注意力計算與編碼處理，經前述之語音特徵交錯注意力計算與編碼模組13處理後，即可得到語音序列的代表性特徵X _MLP(audio)。 The speech feature interleaved attention calculation and encoding module 13 can provide interleaved attention calculation and encoding processing of speech sequence data. Specifically, for speech sequence data, the speech (audio) sequence captured from video data is used as the input sequence vector X _KV(audio)

R M( ^audio ) For voice frame cutting processing, the voice sequence vector initialization unit 132 can obtain the voice vector and vector weight through Xavier initialization or He initialization, or can obtain it through a pre-trained model (such as Data2vec), and then perform voice interleaved attention calculation and coding processing. Model parameters X _Q(audio) that can be trained in unit 133

R ^N×D is used for interleaved attention calculation and coding processing. After being processed by the aforementioned voice feature interleaved attention calculation and coding module 13, the representative feature X _MLP(audio) of the voice sequence can be obtained.

於流程303，第二階段之跨模態特徵交錯注意力計算與編碼處理。於本流程中，可透過跨模態交錯注意力計算與編碼處理，將流程302產生之文字序列的代表性特徵、影像序列的代表性特徵和語音序列的代表性特徵所對應之編碼向量結果進行非線性的隱向量對齊，易言之，經由上一流程之編碼計算，已將文本資料、影像資料及語音資料編碼成相同維度的序列向量，接著，可進行該三種模態特徵之資訊融合及隱向量對齊，也就是再進行一次交錯注意力計算與編碼處理，這裡是使用跨模態交錯注意力計算機制來對於文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12、語音特徵交錯注意力計算與編碼模組13所產生出的文本、影像及語音編碼向量結果進行非線性的隱向量對齊。 In process 303, the second stage of cross-modal feature interleaved attention calculation and encoding processing. In this process, the encoding vector results corresponding to the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence generated in the process 302 can be processed through cross-modal interleaved attention calculation and encoding processing. Nonlinear latent vector alignment, in other words, through the encoding calculation of the previous process, the text data, image data and speech data have been encoded into sequence vectors of the same dimension. Then, the information fusion of the three modal features can be performed. Hidden vector alignment, that is, another interleaved attention calculation and encoding process is performed. Here, a cross-modal interleaved attention calculation mechanism is used to interleave attention calculation and encoding for text features. Module 11. Interleaved attention calculation and encoding for image features. Module 12 performs nonlinear latent vector alignment on the text, image and speech coding vector results generated by the speech feature interleaved attention calculation and encoding module 13.

具體來說，於流程302中，分別計算出的文本、影像、語音的交錯注意力計算與編碼的結果X _MLP(text)、X _MLP(image)與X _MLP(audio)，接著，由本流程進行張量合併(concatenate)，得到多模態資料的輸入序列向量X _{KV(multimodal)}

R ^{3N x D}，之後，又或是透過建立一可進行訓練的模型參數X _{Q(multimodal)}

R ^N×D，藉以進行跨模態交錯注意力計算與編碼處理及多模態特徵擷取處理。 Specifically, in the process 302, the interleaved attention calculation and encoding results X _MLP(text) , X MLP _(image) _and Tensor merge (concatenate) to obtain the input sequence vector X _{KV (multimodal)} of multimodal data

R ^{3N x D} , then, or by establishing a model parameter X _{Q (multimodal)} that can be trained

R ^N×D , to perform cross-modal interleaved attention calculation and encoding processing and multi-modal feature extraction processing.

經前述交錯注意力計算與編碼處理後，可取得各視頻的代表性特徵，該特徵內含多模態的資訊，並因為於流程303中交錯注意力計算與編碼處理會將代表性的特徵向量自動編碼到較小的維度，因而不需要再進行額外的降維處理，即可進行後續分群的操作，如此也可提升計算速度並且避免降維計算的資訊損失。 After the aforementioned interleaved attention calculation and coding processing, the representative features of each video can be obtained. This feature contains multi-modal information, and due to the interleaved attention calculation and coding processing in process 303 The representative feature vectors will be automatically encoded into smaller dimensions, so that no additional dimensionality reduction processing is required before subsequent grouping operations can be performed. This can also increase the calculation speed and avoid the information loss of dimensionality reduction calculations.

於流程304，分群處理。本流程係使用分群技術(例如但不限於HDBSCAN)對視頻的代表性特徵進行分群的計算。 In process 304, group processing is performed. This process uses grouping technology (such as but not limited to HDBSCAN) to calculate the grouping of representative features of the video.

於流程305，主題創建與主題消融。本流程可使用c-TF-IDF(基於類別的TF-IDF演算法)技術對流程304的分群結果進行主題創建(Topic Creation)，藉以得到各主題的代表性主題標籤(Hashtag)，另外，還可使用文字相似度計算(例如但不限於Word Mover's Distance)及各主題間代表性特徵相似度計算(例如但不限於餘弦相似度計算)之技術手段，計算出各主題間之特徵向量與主題標籤的相似度，最後，將平均相似度(將文字相似度計算結果與主題代表性特徵相似度計算結果分別歸一化至0至1之間的數值，並將兩者取平均值)大於一閥值(例如但不限於0.75)的主題群別進行主題消融(Topic Reduction)。 In process 305, the topic is created and the topic is dissolved. This process can use c-TF-IDF (category-based TF-IDF algorithm) technology to perform topic creation (Topic Creation) on the grouping results of process 304 to obtain representative topic tags (Hashtags) for each topic. In addition, Technical means of text similarity calculation (such as but not limited to Word Mover's Distance) and representative feature similarity calculation between topics (such as but not limited to cosine similarity calculation) can be used to calculate the feature vectors and topic tags between each topic. similarity, and finally, the average similarity (normalizing the text similarity calculation results and the topic representative feature similarity calculation results to a value between 0 and 1, and averaging the two) is greater than one valve Perform topic reduction (Topic Reduction) on the topic group with a value (such as but not limited to 0.75).

於流程306，產生視頻主題分析結果。即經流程305之主題創建與主題消融後，可得到視頻主題分析結果。 In process 306, video topic analysis results are generated. That is, after the topic creation and topic ablation in process 305, the video topic analysis results can be obtained.

圖4為本發明之視頻主題分析方法的步驟圖，係說明如何自動化進行視頻主題分析。 Figure 4 is a step diagram of the video topic analysis method of the present invention, illustrating how to automatically perform video topic analysis.

於步驟S401，令文本特徵交錯注意力計算與編碼模組對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理，以得到文字序列的代表性特徵。本步驟說明進行文本序列的特徵抽取，具體來說，先對該文本序列資料進行切詞處理，以產生數個詞彙，接著，對各該詞彙作向量編碼以產生各該詞彙之詞彙向量，以於對各該詞彙向量初始化後，得到文字資料的輸入序列向量，之後，將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該文字序列的代表性特徵。 In step S401, the text feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the text sequence data obtained from the video data to obtain representative features of the text sequence. This step describes the feature extraction of the text sequence. Specifically, the text sequence data is first segmented to generate several words, and then vector encoding is performed on each word to generate a word vector of each word, so as to After initializing each vocabulary vector, the input sequence direction of the text data is obtained. Afterwards, the input sequence vector of the text data is subjected to interleaved attention calculation and encoding processing to obtain the representative characteristics of the text sequence.

於步驟S402，令影像特徵交錯注意力計算與編碼模組對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理，以得到影像序列的代表性特徵。本步驟說明進行影像序列的特徵抽取，具體來說，先對該影像幀序列進行切幀處理，以產生數個影像幀，接著，對各該影像幀作向量編碼以產生各該影像幀之影像幀向量，以於對各該影像幀向量初始化後，得到影像資料的輸入序列向量，之後，將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該影像序列的代表性特徵。 In step S402, the image feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the image frame sequence obtained from the video data to obtain representative features of the image sequence. This step describes feature extraction of an image sequence. Specifically, the image frame sequence is first subjected to frame cutting processing to generate several image frames, and then vector encoding is performed on each image frame to generate an image of each image frame. The frame vector is used to obtain the input sequence vector of the image data after initializing each image frame vector. After that, the input sequence vector of the image data is subjected to interleaved attention calculation and coding processing to obtain the representative features of the image sequence. .

於步驟S403，令語音特徵交錯注意力計算與編碼模組對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理，以得到語音序列的代表性特徵。本步驟說明進行語音序列的特徵抽取，具體來說，先對該語音序列進行切幀處理，以產生數個語音幀，接著，對各該語音幀作向量編碼以產生各該語音幀之語音幀向量，以於對各該語音幀向量初始化後，得到語音資料的輸入序列向量，之後，將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理，以得到該語音序列的代表性特徵。 In step S403, the speech feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the speech sequence obtained from the video data to obtain representative features of the speech sequence. This step describes the feature extraction of the speech sequence. Specifically, the speech sequence is first subjected to frame cutting processing to generate several speech frames, and then vector encoding is performed on each speech frame to generate speech frames of each speech frame. Vector, after initializing each speech frame vector, the input sequence vector of the speech data is obtained. After that, the input sequence vector of the speech data is subjected to interleaved attention calculation and coding processing to obtain the representative characteristics of the speech sequence.

於步驟S404，令跨模態交錯注意力計算與編碼模組將該文字序列的代表性特徵、該影像序列的代表性特徵以及該語音序列的代表性特徵進行非線性的隱向量對齊，以得到代表該視頻資料之多模態向量表徵。本步驟說明多模態交錯注意力計算與編碼處理，具體來說，可透過多模態張量序列串接、跨模態交錯注意力計算與編碼處理以及多模態序列特徵映射、編碼與提取等程序來執行，目的是將視頻多模態序列的特徵抽取，以供後續進行視頻多模態特徵主題分析。 In step S404, the cross-modal interleaved attention calculation and encoding module is used to perform nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain Represents the multi-modal vector representation of the video data. This step describes the multi-modal interleaved attention calculation and encoding processing. Specifically, it can be performed through multi-modal tensor sequence concatenation, cross-modal interleaved attention calculation and encoding processing, and multi-modal sequence feature mapping, encoding and extraction. Wait for the program to be executed The purpose is to extract the features of video multi-modal sequences for subsequent video multi-modal feature theme analysis.

於步驟S405，令分群模組將該視頻資料之多模態向量表徵進行分群，以產生分群結果。本步驟即說明將視頻資料之多模態向量表徵作分群處理。 In step S405, the grouping module is instructed to group the multi-modal vector representation of the video data to generate a grouping result. This step explains how to group the multi-modal vector representation of video data.

於步驟S406，令主題生成模組對該分群結果進行主題創建以產生至少一主題群別，且透過各該主題群別之相似度計算與比較，將相似度大於一閥值之主題群別進行整併，以得到視頻分析結果。本步驟即說明進行主題創建與消融(整併)，產出視頻之自動化分群結果以及各群的代表性主題標籤(Hashtag)與代表性的視頻內容資訊。 In step S406, the topic generation module is caused to perform topic creation on the grouping results to generate at least one topic group, and through similarity calculation and comparison of each topic group, the topic groups whose similarity is greater than a threshold are Integrate to obtain video analysis results. This step describes topic creation and ablation (merging), and produces automatic grouping results of videos as well as representative topic tags (Hashtags) and representative video content information of each group.

圖5為本發明之視頻主題分析方法於一應用範例的示意圖。如圖所示。首先，於流程501，大量視頻資料作為輸入；於流程502，經資料前處理技術從視頻資料擷取出文本、影像與語音的序列資料；於流程503，再透過基於多模態交錯注意力編碼機制的視頻主題分析模型進行主題分析；最後，產生如流程504所示之視頻主題分析結果。以下透過一具體實施例，進行說明。 FIG. 5 is a schematic diagram of an application example of the video theme analysis method of the present invention. As shown in the picture. First, in process 501, a large amount of video data is used as input; in process 502, text, image and voice sequence data are extracted from the video data through data pre-processing technology; in process 503, through a multi-modal interleaved attention coding mechanism The video topic analysis model is used to perform topic analysis; finally, the video topic analysis results shown in process 504 are generated. The following is explained through a specific embodiment.

首先，將視頻資料細切為數個幀(frame)，舉例來說，1秒的視頻資料可以細切成30個幀，以此類推，10秒的資料可切成300個幀。接著，將各幀對應的影像內容取出，即可取得影像序列的輸入資料，將視頻內的音頻資訊擷取出來，可得到語音序列的輸入資料，而文本序列資料，可透過下列三種方式(但不限於該三種方式)取得：第一種，若該視頻資料有完整的字幕(如srt)等檔案，則使用字幕檔內的文字資料做為文字序列的輸入資料；第二種，若該視頻資料無字幕檔，但視頻內的影像資料有清晰可見的字幕資訊，則可使用光學字元辨識 (OCR)等技術擷取出對應的字幕內容，做為文字序列的輸入資料；第三種，透過語音辨識技術，將語音內容進行音轉字處理，從而得到文字序列的輸入資料。 First, the video data is cut into several frames. For example, 1 second of video data can be cut into 30 frames, and so on, 10 seconds of data can be cut into 300 frames. Then, by extracting the image content corresponding to each frame, the input data of the image sequence can be obtained. By extracting the audio information in the video, the input data of the voice sequence can be obtained. The text sequence data can be obtained through the following three methods (but Not limited to the three methods) to obtain: the first, if the video data has complete subtitles (such as srt) and other files, use the text data in the subtitle file as the input data of the text sequence; the second, if the video If the data does not have subtitles, but the image data in the video has clearly visible subtitle information, optical character recognition can be used. (OCR) and other technologies extract the corresponding subtitle content as the input data of the text sequence; the third method uses speech recognition technology to convert the voice content into words to obtain the input data of the text sequence.

在透過上述方式取得文本序列資料、影像序列資料及語音序列資料後，首先針對文本、影像及語音資料分別進行交錯注意力計算與編碼處理。文本序列資料之交錯注意力計算與編碼處理方式為，將文本序列資料進行切詞處理，並對每一個詞彙給予一個向量編碼建立詞彙向量，可得到文字資料的輸入序列向量X _KV(text)

R ^{M(text) x C(text)}，其中，M(text)為經切詞處理後文本序列資料的長度，C(text)為超參數是文本輸入序列向量的維度。接著，建立一可進行訓練的模型參數X _Q(text)

R ^N×D用以進行交錯注意力計算與編碼處理，文本資料之交錯注意力之模型計算公式為 X _QKV(text) =f _o(text) (X _QK(text) V(text))=Attention(X _Q(text) ,X _KV(text) )，其中， X _QK(text) =softmax(Q(text)K(text) ^T /

)，Q(text)=f _Q(text) (X _Q(text) )，K(text)=f _K(text) (X _KV(text) )，V(text)=f _V(text) (X _KV(text) )。前述 f _Q(text)、 f _K(text)、 f _V(text)為三個前饋神經網路(Feedforward neural network)，可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f _o(text)為一個前饋神經網路(Feedforward neural network)用以產生出 X _QKV(text)

R ^N×D。最後，再將 X _QKV(text)

R ^N×D輸入給一多層感知器模型(MLP)，得到文本序列資料的交錯注意力計算與編碼的結果 X _MLP(text)

R ^N×D。 After obtaining the text sequence data, image sequence data and speech sequence data through the above method, first perform interleaved attention calculation and encoding processing on the text, image and speech data respectively. The interleaved attention calculation and encoding processing method of text sequence data is to segment the text sequence data into words, and give each word a vector code to establish a word vector. The input sequence vector X _{KV (text)} of the text data can be obtained

R ^{M(text) x C(text)} , where M(text) is the length of the text sequence data after word segmentation processing, and C(text) is the hyperparameter which is the dimension of the text input sequence vector. Next, establish a model parameter X _Q(text) that can be trained

R ^N×D is used for interleaved attention calculation and encoding processing. The model calculation formula of interleaved attention for text data is X _{QKV ( text )} = f _{o ( text )} ( X _{QK ( text )} V(text))= Attention ( X _{Q ( text )} , X _{KV ( text )} ) , where, X _{QK ( text )} = softmax ( Q ( text ) K ( text ) ^T /

) , Q(text)= f _{Q ( text )} ( X _{Q ( text )} ) , K(text)= f _{K ( text )} ( X _{KV ( text )} ) , V(text)= f _{V ( text )} ( X _{KV ( text )} ) . The aforementioned f _{Q ( text )} , f _{K ( text )} , and f _{V ( text )} are three feedforward neural networks (Feedforward neural network), which can map the input data into a tensor with dimension F. Softmax is a normalized exponential function. f _{o ( text )} is a feedforward neural network (Feedforward neural network) used to generate X _{QKV ( text )}

R ^N×D . Finally, X _{QKV ( text )}

R ^N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of text sequence data is obtained X _{MLP ( text )}

R ^N×D .

對於影像序列資料，則使用從視頻資料所擷取的影像幀(frame)序列做為輸入序列向量X _KV(image)

R ^{M(image) x C(image)}，其中，M(image)為影像幀(frame)序列資料的長度，C(image)為超參數是輸入序列向量的維度，接著，建立一可進行訓練的模型參數X _Q(image)

R ^N×D用以進行交錯注意力計算與編碼處理。影像資料之交錯注意力之模型計算公式為 X _QKV(image) =f _o(image) (X _QK(image) V(image))=Attention(X _Q(image) ,X _KV(image) )，其中，

，Q(image)=f _Q(image) (X _Q(image) )，K(image)=f _K(image) (X _KV(image) )，V(image)=f _V(image) (X _KV(image) )。前述 f _Q(image)、 f _K(image)、 f _V(image)為三個前饋神經網路，可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f _o(image)為一個前饋神經網路(Feedforward neural network)用以產生出 X _QKV(image)

R ^N×D。最後，再將 X _QKV(image)

R ^N×D輸入給一多層感知器模型(MLP)，得到影像序列資料的交錯注意力計算與編碼的結果 X _MLP(image)

R ^N×D。 For image sequence data, the image frame (frame) sequence captured from the video data is used as the input sequence vector X _{KV (image)}

R ^M( image ) Model parametersX _Q(image)

R ^N×D is used for interleaved attention calculation and encoding processing. The model calculation formula of interleaved attention for image data is X _{QKV ( image )} = f _{o ( image )} ( X _{QK ( image )} V(image))= Attention ( X _{Q ( image )} , X _{KV ( image )} ) , where ,

, Q _{( image} _{) =} f Q ( _{image )} ( _{_ _} _{_ _} _ _ _{_ ( image )} ) . The aforementioned f _{Q ( image )} , f _{K ( image )} , and f _{V ( image )} are three feedforward neural networks that can map the input data into a tensor of dimension F. Softmax is a normalized exponential function. f _{o ( image )} is a feedforward neural network used to generate X _{QKV ( image )}

R ^N×D . Finally, X _{QKV ( image )}

R ^N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of image sequence data is obtained X _{MLP ( image )}

R ^N×D .

對於語音序列資料，則使用從視頻資料所擷取的語音(音頻)序列做為輸入序列向量X _KV(audio)

R ^{M(audio) x C(audio)}，其中，M(audio)為語音幀(frame)序列資料的長度，C(audio)為超參數是輸入序列向量的維度，接著，建立一可進行訓練的模型參數X _Q(audio)

R ^N×D用以進行交錯注意力計算與編碼處理。語音資料之交錯注意力之模型計算公式為 X _QKV(audio) =f _o(audio) (X _QK(audio) V(audio))=Attention(X _Q(audio) ,X _KV(audio) )，其中，

，Q(audio)=f _Q(audio) (X _Q(audio) )，K(audio)=f _K(audio) (X _KV(audio) )，V(audio)=f _V(audio) (X _KV(audio) )。前述 f _Q(audio)、 f _K(audio)、 f _V(audio)為三個前饋神經網路，可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f _o(audio)為一個前饋神經網路(Feedforward neural network)用以產生出 X _QKV(audio)

R ^N×D。最後，再將 X _QKV(audia)

R ^N×D輸入給一多層感知器模型(MLP)，得到語音序列資料的交錯注意力計算與編碼的結果 X _MLP(audio)

R ^N×D。 For speech sequence data, use the speech (audio) sequence captured from the video data as the input sequence vector X _{KV (audio)}

R ^M( audio ) Model parameters X _Q(audio)

R ^N×D is used for interleaved attention calculation and encoding processing. The model calculation formula for interleaved attention for speech data is X _{QKV ( audio )} = f _{o ( audio )} ( X _{QK ( audio )} V(audio))= Attention ( X _{Q (} _{audio )} , ,

, Q(audio)= f _{Q ( audio )} ( X _{Q ( audio )} ) , K(audio)= f _{K ( audio )} ( X _{KV ( audio )} ) , V(audio)= f _{V ( audio )} ( X _{KV ( audio )} ) . The aforementioned f _{Q ( audio )} , f _{K ( audio )} , and f _{V ( audio )} are three feedforward neural networks that can map input data into tensors of dimension F. Softmax is a normalized exponential function. f _{o ( audio )} is a feedforward neural network (Feedforward neural network) used to generate X _{QKV ( audio )}

R ^N×D . Finally, X _{QKV ( audia )}

R ^N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of speech sequence data is obtained X _{MLP ( audio )}

R ^N×D .

接著，將於上個階段分別計算出的文本、影像、語音的交錯注意力計算與編碼的結果X _MLP(text)、X _MLP(image)與X _MLP(audio)進行張量合併(concatenate)，得到多模態資料的輸入序列向量X _{KV(multimodal)}

R ^{3N x D}，接著，建立一可進行訓練的模型參數X _{Q(multimodal)}

R ^N×D用以進行跨模態交錯注意力計算與編碼處理，跨模態交錯注意力計算與編碼模組計算公式為 X _{QKV(multimodal)} =f _{o(multimodal)} (X _{QK(multimodal)} V(multimodal))=Attention(X _{Q(multimodal)} ,X _{KV(multimodal)} )，其中， X _{QK(multimodal)} =

，Q(multimodal)=f _{Q(multimodal)} (X _{Q(multimodal)} )，K(multimodal)=f _{K(multimodal)} (X _{KV(multimodal)} )，V(multimodal)=f _{V(multimodal)} (X _{KV(multimodal)} )。前述 f _{Q(multimodal)}、 f _{K(multimodal)}、 f _{V(multimodal)}為三個前饋神經網路，可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f _{o(multimodal)}為一個前饋神經網路(Feedforward neural network)用以產生出 X _{QKV(multimodal)}

R ^N×D。最後，再將 X _{QKV(multimodal)}

R ^N×D輸入給一多層感知器模型(MLP)，得到多模態序列資料的交錯注意力計算與編碼的結果 X _{MLP(multimodal)}

R ^N×D。 Then, perform tensor merging (concatenate) on the interleaved attention calculation and encoding results X _{MLP (text)} , X _{MLP (image)} and X _{MLP (audio)} calculated in the previous stage, Get the input sequence vector X _{KV (multimodal)} of multimodal data

R ^{3N x D} , then, establish a model parameter X _{Q (multimodal)} that can be trained

R ^N×D is used for cross-modal interleaved attention calculation and encoding processing. The cross-modal interleaved attention calculation and encoding module calculation formula is X _{QKV ( multimodal )} = f _{o ( multimodal )} ( X _{QK ( multimodal )} V (multimodal))= Attention ( X _{Q ( multimodal )} , X _{KV ( multimodal )} ) , where, X _{QK ( multimodal )} =

, Q(multimodal)= f _{Q ( multimodal )} ( X Q _{( multimodal )} ) , K _{( multimodal} _{) =} f _{K ( multimodal} ₎ ( _{( multimodal )} ) . The aforementioned f _{Q ( multimodal )} , f _{K ( multimodal )} , and f _{V ( multimodal )} are three feedforward neural networks that can map input data into tensors of dimension F. Softmax is a normalized exponential function. f _{o ( multimodal )} is a feedforward neural network (Feedforward neural network) used to generate X _{QKV ( multimodal )}

R ^N×D . Finally, X _{QKV ( multimodal )}

R ^N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of multi-modal sequence data is obtained X _{MLP ( multimodal )}

R ^N×D .

經過上述兩階段的交錯注意力計算與編碼，可求得於隱向量空間對齊後的多模態序列資料編碼結果 X _{MLP(multimodal)}

R ^N×D，接著，計算出對 X _{MLP(multimodal)}之N個向量進行取平均計算，例如avg(X _{MLP(multimodal)} )

R ^1×D，對 X _{MLP(multimodal)}之N個向量進行元素積(element-wise product)的結果，例如element_wise_product(X _{MLP(multimodal)} )

R ^1×D以及取出最前面的 X _{MLP(multimodal)}向量，例如 X _{MLP(multimodal)} [0]

R ^1×D。之後，對該三個向量進行張量合併(concatenate)，例如concatenate(avg( X _{MLP(multimodal)}),element_wise_product( X _{MLP(multimodal)}), X _{MLP(multimodal)} [0])

R ^3×D。並將concatenate(avg( X _{MLP(multimodal)}),element_wise_product( X _{MLP(multimodal)}), X _{MLP(multimodal)} [0])

R ^3×D(以下簡記為 X _{concatenate(multimodal)})輸入給一多層感知器模型(MLP)及殘差模块(residual block)進行計算以得到多模態序列資料的最終編碼結果 X _{final(multimodal)}。 X _{final(multimodal)}的計算公式為 X _{final(multimodal)} =MLP(X _{concatenate(multimodal)} )⊕X _{concatenate(multimodal)}，其中，⊕為元素相加(element-wise addition)之計算。 After the above two stages of interleaved attention calculation and encoding, the multimodal sequence data encoding result X _{MLP ( multimodal )} after alignment in the latent vector space can be obtained

R ^N×D , then calculate the average of N vectors of X _{MLP ( multimodal )} , such as avg ( X _{MLP ( multimodal )} )

R ^1×D , the result of element-wise product of N vectors of X _{MLP ( multimodal )} , such as element_wise_product ( X _{MLP ( multimodal )} )

R ^1×D and take out the front X _{MLP ( multimodal )} vector, such as X _{MLP ( multimodal )} [0]

R1 ^×D . Afterwards , perform tensor merging (concatenate) on the three vectors, such as concatenate(avg( X _{MLP ( multimodal )} ),element_wise_product( X _{MLP ( multimodal} _{) )} ,

R3 ^×D . and concatenate(avg( X MLP _{( multimodal )} ),element_wise_product( X _{MLP (} _{multimodal )} ),

R ^3× D _{( hereinafter} _abbreviated as ₎ . The calculation formula of X _{final ( multimodal )} is X _{final ( multimodal )} ₌ MLP ( X concatenate _{( multimodal ))} ⊕

本發明的多模態注意力編碼機制模型，目標是要用於無監督視頻主題分析之用途，因此，在進行無監督視頻主題分析任務之前，需要先將模型權重，於其他有監督式的任務上進行預訓練，以學習出較具代表性的多模態特徵，來提升無監督視頻主題分析之成效。本發明使用文本到視頻檢索(text-to-video retrieval)之有監督式的任務對多模態交錯注意力編碼機制模型進行預訓練。 The multi-modal attention coding mechanism model of the present invention is intended to be used for unsupervised video topic analysis. Therefore, before performing the unsupervised video topic analysis task, it is necessary to first assign the model weights to other supervised tasks. Pre-training is performed to learn more representative multi-modal features to improve the effectiveness of unsupervised video topic analysis. The present invention uses a supervised task of text-to-video retrieval to pre-train a multi-modal interleaved attention coding mechanism model.

對於文本到視頻檢索任務，將 X _MLP(text)輸入給一多層感知器模型(MLP)之結果 X _{MLP(MLP(text))}

R ^3×D作為文本的向量表徵，並將 X _{final(multimodal)}作為整個視頻的向量表徵來進行相似度匹配訓練。即對於標註為正樣本(具有高度關聯性)的文本與視頻資料，以其餘弦相似度(Cosine Similarity)，即cosine-sim(X _{MLP(MLP(text))} ,X _{final(multimodal)} )=1做為模型的監督式訓練目標。對於標註為負樣本(不具關聯性)的文本與視頻資料，以其餘弦相似度(Cosine Similarity)，即cosine-sim(X _{MLP(MLP(text))} ,X _{final(multimodal)} )=-1做為模型的監督式訓練目標。 For text-to-video retrieval tasks, the result of feeding X _{MLP ( text )} into a multilayer perceptron model (MLP) is X _{MLP ( MLP ( text ))}

R ^3×D is used as the vector representation of the text, and X _{final ( multimodal )} is used as the vector representation of the entire video for similarity matching training. That is, for text and video materials marked as positive samples (with high correlation), the cosine similarity (Cosine Similarity) is used, that is, cosine-sim( X _{MLP ( MLP ( text ))} , X _{final ( multimodal )} )= 1 as the supervised training target of the model. For text and video data marked as negative samples (not relevant), the cosine similarity (Cosine Similarity) is used, that is, cosine-sim ( X _{MLP ( MLP ( text ))} , X _{final ( multimodal )} )=-1 is the supervised training goal of the model.

經上述技術手段處理，可取得各視頻的代表性特徵，該特徵內含多模態的資訊，並且由於跨模態交錯注意力計算與編碼模組會將該代表性特徵向量自動編碼到較小的維度，因此，不需要再進行額外的降維處理即可進行後續分群的操作，如此可提升計算速度與避免額外降維計算的資訊損失。 After processing by the above technical means, the representative features of each video can be obtained. This feature contains multi-modal information, and due to the cross-modal interleaved attention calculation and encoding module, the representative feature vector will be automatically encoded into a smaller size. dimensions, therefore, subsequent grouping operations can be performed without additional dimensionality reduction processing, which can improve the calculation speed and avoid the information loss of additional dimensionality reduction calculations.

接著，使用分群技術對視頻的代表性特徵進行分群的計算，然後，再使用c-TF-IDF(基於類別的TF-IDF)技術對分群的結果進行主題創建(Topic Creation)，以得到各主題的代表性主題標籤，並再使用文字相似度計算(例如但不限於Word Mover's Distance)及各主題間代表性特徵相似度計算(例如但不限於餘弦相似度計算)之技術手段，計算出各主題間之特徵向量與主題標籤的相似度，最後將平均相似度大於閥值的主題群別，將進行主題消融(Topic Reduction)，以得到最終的視頻主題分析結果。 Next, grouping technology is used to calculate the grouping of representative features of the video, and then c-TF-IDF (category-based TF-IDF) technology is used to perform topic creation (Topic Creation) on the grouping results to obtain each topic. representative topic tags, and then use technical means of text similarity calculation (such as but not limited to Word Mover's Distance) and representative feature similarity calculation between topics (such as but not limited to cosine similarity calculation) to calculate each topic The similarity between feature vectors and topic tags, and finally the topic groups whose average similarity is greater than the threshold will be subject to topic reduction (Topic Reduction) to obtain the final video topic analysis results.

在一實施例中，上述之各個系統、模組、伺服器、設備均可由軟體、硬體或韌體實現上述內容；若為硬體，則可為具有資料處理與運算能力之處理單元、處理器、電腦或伺服器；若為軟體或韌體，則可包括處理單元、處理器、電腦或伺服器可執行之指令，且可安裝於同一硬體裝置或分布於不同的複數硬體裝置。 In one embodiment, each of the above-mentioned systems, modules, servers, and devices can implement the above content by software, hardware, or firmware; if it is hardware, it can be a processing unit or processor with data processing and computing capabilities. If it is software or firmware, it may include instructions executable by a processing unit, processor, computer or server, and may be installed on the same hardware device or distributed on multiple different hardware devices.

此外，本發明還揭示一種電腦可讀媒介，係應用於具有處理器(例如，CPU、GPU等)及/或記憶體的計算裝置或電腦中，且儲存有指令，並可利用此計算裝置或電腦透過處理器及/或記憶體執行此電腦可讀媒介，以於執行此電腦可讀媒介時執行上述之方法及各步驟。 In addition, the present invention also discloses a computer-readable medium, which is applied to a computing device or computer having a processor (eg, CPU, GPU, etc.) and/or a memory, and stores instructions, and can utilize the computing device or computer. The computer executes the computer-readable medium through the processor and/or memory to perform the above methods and steps when executing the computer-readable medium.

綜上，本發明揭露一種基於多模態交錯注意力編碼機制的視頻主題分析系統、方法及其電腦可讀媒介，其中，各模態的交錯注意力計算與編碼模組用以將輸入的文本資料、影像資料及語音資料進行特徵擷取，並透過跨模態交錯注意力計算與編碼模組將文本資料、影像資料及語音資料的特徵投影於同一隱向量空間中進行對齊，進一步擷取與融合多模態資訊的特徵，最後，再藉由分群、主題創建(Topic Creation)及主題消融(Topic Reduction)等程序得到視頻主題分析結果，如此即可快速且精準地達到視頻之自動化分群結果以及各群的代表性主題標籤(Hashtag)與代表性的視頻內容資訊。另外，本發明具備以下特色及功效： In summary, the present invention discloses a video theme analysis system, method and computer-readable medium based on a multi-modal interleaved attention coding mechanism, in which the interleaved attention calculation and coding models of each modality are The group is used to extract features from the input text data, image data and voice data, and project the features of the text data, image data and voice data into the same latent vector space through the cross-modal interleaved attention calculation and coding module. Align, further capture and fuse the characteristics of multi-modal information, and finally obtain the video topic analysis results through procedures such as grouping, topic creation (Topic Creation) and topic reduction (Topic Reduction), so that you can quickly and accurately Achieve automatic grouping results of videos, as well as representative hashtags (Hashtags) and representative video content information of each group. In addition, the present invention has the following features and effects:

第一，本發明可以用於對視頻資料進行主題分析，能讓使用者掌握視頻資料中的重要資訊、潛在主題類別與潛在主題群數、及代表性的主題標籤(Hashtag)。 First, the present invention can be used to perform topic analysis on video data, allowing users to grasp important information, potential topic categories and number of potential topic groups, and representative topic tags (Hashtags) in the video data.

第二，本發明提出了一種適用於文本資料、影像資料及語音資料的交錯注意力計算與編碼模組，其可將上述之各種模態的任意長度序列資料先進行向量編碼(序列長度為M)並計算出對應的Key向量與Value向量序列，並透過一固定長度的隱序列向量(序列長度為N)編碼出對應的Query向量。之後，再將Key向量、Value向量與Query向量進行注意力機制的計算，此計算之時間複雜度為O(MN)，且一般來說M遠大於N，且N為定值，因此，時間複雜度可化簡為O(M)，明顯小於一般變換器Transformer模型之自注意力機制的計算時間複雜度O(M²)，因此，本發明之方法具有處理較長多模態序列資料的能力。 Second, the present invention proposes an interleaved attention calculation and coding module suitable for text data, image data and speech data, which can vector-code sequence data of any length in the various modalities mentioned above (the sequence length is M ) and calculate the corresponding Key vector and Value vector sequences, and encode the corresponding Query vector through a fixed-length latent sequence vector (sequence length is N). After that, the Key vector, Value vector and Query vector are used to calculate the attention mechanism. The time complexity of this calculation is O(MN), and generally speaking, M is much larger than N, and N is a fixed value. Therefore, the time complexity is complex. The degree can be simplified to O(M), which is significantly smaller than the computational time complexity O(M ² ) of the self-attention mechanism of the general Transformer model. Therefore, the method of the present invention has the ability to process longer multi-modal sequence data. .

第三，本發明提出了一種跨模態的交錯注意力(Intercross Attention)計算與編碼模組，可以使用交錯注意力機制計算多模態資訊並透過編碼模組直接對於隱向量空間對齊後的多模態隱序列向量進行維度投射與壓縮，該方法可自動融合與對齊不同模態的資訊，並直接進行維度壓縮操作，無需像一般主題分析的作法，還要經過額外的降維操作(通常使用UMAP(Uniform Manifold Approximation and Projection)降維演算法)，由於額外的降維操作需要額外進行計算，亦會造成資訊特徵的損失，因此，本發明之方法免於額外之降維處理可同時提升多模態資訊整合的能力與計算的效率。 Third, the present invention proposes a cross-modal intercross attention (Intercross Attention) calculation and coding module, which can use the intercross attention mechanism to calculate multi-modal information and directly process the multi-modal information aligned in the latent vector space through the coding module. Modal latent sequence vectors are used for dimensional projection and compression. This method can automatically fuse and align information from different modalities, and directly perform dimensional compression operations without the need for general subject analysis. The analysis method requires additional dimensionality reduction operations (usually using the UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction algorithm). Since the additional dimensionality reduction operations require additional calculations, they will also cause the loss of information features. Therefore, The method of the present invention avoids additional dimensionality reduction processing and can simultaneously improve the multi-modal information integration capability and computing efficiency.

上列詳細說明係針對本發明之一可行實施例之具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本發明之專利範圍中。 The above detailed description is a specific description of one possible embodiment of the present invention. However, this embodiment is not intended to limit the patent scope of the present invention. Any equivalent implementation or modification that does not depart from the technical spirit of the present invention shall be included in within the patent scope of this invention.

1:視頻主題分析系統 1: Video theme analysis system

15:分群模組 15:Group module

16:主題生成模組 16:Theme generation module

Claims

A video topic analysis system includes: a text feature interleaved attention calculation and encoding module, which is used to perform interleaved attention calculation and encoding processing on text sequence data obtained from video data to obtain representative features of the text sequence; The image feature interleaved attention calculation and coding module is used to perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data to obtain the representative features of the image sequence; the voice feature interleaved attention calculation and coding The module is used to perform interleaved attention calculation and coding on the speech sequence obtained from the video data to obtain the representative features of the speech sequence; the cross-modal interleaved attention calculation and coding module is used to convert the text into The representative features of the sequence, the representative features of the image sequence and the representative features of the speech sequence are aligned nonlinearly to obtain a multi-modal vector representation representing the video data; the grouping module is used to group the Multi-modal vector representation of video data is grouped to generate grouping results; and a topic generation module is used to create topics on the grouping results to generate at least one topic group, and then calculate the similarity of each topic group The subject groups whose similarity is greater than a threshold are combined to obtain the video analysis results.

The video topic analysis system as described in claim 1, wherein the text feature interleaved attention calculation and encoding module further includes: a word segmentation unit for word segmentation processing of the text sequence data to generate several words; The text sequence vector initialization unit is used to vector encode each word to generate a word vector of each word, so as to obtain an input sequence vector of text data after initializing each word vector; and The text interleaved attention calculation and encoding processing unit is used to perform interleaved attention calculation and encoding processing on the input sequence vector of the text data to obtain the representative features of the text sequence.

The video theme analysis system as described in claim 1, wherein the image feature interleaved attention calculation and encoding module further includes: an image frame cutting unit for performing frame cutting processing on the image frame sequence to generate several images. Frame; an image sequence vector initialization unit, configured to vector encode each image frame to generate an image frame vector of each image frame, so as to obtain an input sequence vector of image data after initializing each image frame vector; and image The interleaved attention calculation and coding processing unit is used to perform interleaved attention calculation and coding processing on the input sequence vector of the image data to obtain the representative features of the image sequence.

The video theme analysis system as described in claim 1, wherein the speech feature interleaved attention calculation and coding module further includes: a speech frame cutting unit for performing frame cutting processing on the speech sequence to generate several speech frames. ; The speech sequence vector initialization unit is used to perform vector encoding on each speech frame to generate a speech frame vector of each speech frame, so as to obtain the input sequence vector of speech data after initializing the speech frame vector; and speech interleaving The attention calculation and coding processing unit is used to perform interleaved attention calculation and coding processing on the input sequence vector of the speech data to obtain the representative features of the speech sequence.

The video topic analysis system as described in claim 1, wherein the topic generation module further includes: a topic creation unit, which generates topic tags representative of each topic group when creating topics for each topic group. ;as well as The topic ablation unit obtains the similarity between the feature vectors of each topic and the topic tag through text similarity calculation and representative feature similarity calculation between each topic, so as to group the topics whose average similarity is greater than the threshold. Ablation to obtain the video analysis results.

A video topic analysis method is executed on a computer or server. The method includes the following steps: causing the text feature interleaved attention calculation and encoding module to perform interleaved attention calculation and coding on the text sequence data obtained from the video data. Encoding processing to obtain the representative features of the text sequence; let the image feature interleaved attention calculation and encoding module perform interleaved attention calculation and encoding processing on the image frame sequence obtained from the video data to obtain the representative features of the image sequence Features; let the speech feature interleaved attention calculation and coding module perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data to obtain the representative features of the speech sequence; let the cross-modal interleaved attention calculation and The encoding module performs non-linear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence and the representative features of the speech sequence to obtain a multi-modal vector representation representing the video data; grouping The module groups the multi-modal vector representation of the video data to generate grouping results; and causes the topic generation module to perform topic creation on the grouping results to generate at least one topic group, and then uses the similarity of each topic group to Degree calculation is used to integrate topic groups whose similarity is greater than a threshold to obtain video analysis results.

The video topic analysis method as described in claim 6, wherein the step of causing the text feature interleaved attention calculation and encoding module to perform interleaved attention calculation and encoding processing on the text sequence data obtained from the video data further includes: Perform word segmentation processing on the text sequence data to generate several words; perform vector encoding on each word to generate a word vector for each word, so that after initializing each word vector, an input sequence vector of the text data is obtained; And perform interleaved attention calculation and encoding processing on the input sequence vector of the text data to obtain the representative features of the text sequence.

The video theme analysis method as described in claim 6, wherein the step of causing the image feature interleaved attention calculation and coding module to perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data further includes : Perform frame cutting processing on the image frame sequence to generate several image frames; perform vector encoding on each image frame to generate an image frame vector of each image frame, so that after initializing each image frame vector, an image is obtained The input sequence vector of the image data; and perform interleaved attention calculation and encoding processing on the input sequence vector of the image data to obtain the representative features of the image sequence.

The video theme analysis method as described in claim 6, wherein the step of causing the speech feature interleaved attention calculation and encoding module to perform interleaved attention calculation and encoding processing on the speech sequence obtained from the video data further includes: Perform frame cutting processing on the speech sequence to generate several speech frames; perform vector encoding on each speech frame to generate a speech frame vector of each speech frame, so that after initializing each speech frame vector, obtain the speech data Input sequence vectors; and perform interleaved attention calculation and coding processing on the input sequence vectors of the speech data to obtain representative features of the speech sequence.

The video topic analysis method as described in claim 6, wherein the topic generation module performs topic creation on the grouping results to generate at least one topic group, and then through the step of calculating the similarity of each topic group, repeat Including: when creating topics for each topic group, generating topic tags that are representative of each topic group; and through text similarity calculation and representative feature similarity calculation between topics, obtaining the feature vectors and The similarity of the topic tag is used to eliminate the topic groups whose average similarity is greater than the threshold to obtain the video analysis result.

A computer-readable medium, used in a computing device or computer, stores instructions to execute the video theme analysis method described in any one of claims 6 to 10.