TWI830604B - Video topic analysis system, method and computer readable medium thereof - Google Patents
Video topic analysis system, method and computer readable medium thereof Download PDFInfo
- Publication number
- TWI830604B TWI830604B TW112106317A TW112106317A TWI830604B TW I830604 B TWI830604 B TW I830604B TW 112106317 A TW112106317 A TW 112106317A TW 112106317 A TW112106317 A TW 112106317A TW I830604 B TWI830604 B TW I830604B
- Authority
- TW
- Taiwan
- Prior art keywords
- sequence
- interleaved
- vector
- topic
- text
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000004364 calculation method Methods 0.000 claims abstract description 222
- 239000013598 vector Substances 0.000 claims abstract description 192
- 238000012545 processing Methods 0.000 claims description 115
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000002679 ablation Methods 0.000 claims description 9
- 230000009467 reduction Effects 0.000 abstract description 15
- 230000008569 process Effects 0.000 description 30
- 238000005516 engineering process Methods 0.000 description 18
- 238000013528 artificial neural network Methods 0.000 description 14
- 230000007246 mechanism Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 230000010354 integration Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000012821 model calculation Methods 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
Images
Abstract
Description
本發明係有關於分析視頻主題之技術,尤指一種基於多模態交錯注意力編碼機制的視頻主題分析系統、方法及其電腦可讀媒介。 The present invention relates to technology for analyzing video themes, and in particular, to a video theme analysis system and method based on a multi-modal interleaved attention coding mechanism and a computer-readable medium thereof.
Transformer模型(又稱變換器)是一種神經網路,並使用注意力(attention)或自我注意力(self-attention)之技術來執行,能追蹤序列資料中的關係,進而學習上、下文之間的脈絡及意義,易言之,Transformer模型為一種採用自注意力機制的深度學習模型。現今,Transformer模型逐漸取代以往常見的卷積神經網路(Convolutional Neural Network,CNN)和遞歸神經網路(Recurrent neural network,RNN),成為目前深度學習模型的主角。 The Transformer model (also known as the transformer) is a neural network that uses attention or self-attention technology to execute. It can track relationships in sequence data and learn between context and context. In short, the Transformer model is a deep learning model that uses a self-attention mechanism. Nowadays, the Transformer model has gradually replaced the common convolutional neural network (CNN) and recursive neural network (RNN) in the past, and has become the protagonist of the current deep learning model.
另外,傳統深度學習之技術經常需使用大規模的標註資料(加入標籤之資料)對模型進行訓練,然對大規模資料集進行資料標註不僅耗時,所須成本也高,顯有改進空間。而Transformer模型因具有習得元素之間的關係之特性,且具有強大的特徵擷取能力,因此,在透過適當的設計下,可使用無標註資料對Transformer進行預先訓練,在無須使用加入標籤之大規模資料集下,成效會明顯優於其他模型方法。 In addition, traditional deep learning technology often requires the use of large-scale annotated data (labeled data) to train the model. However, data annotation of large-scale data sets is not only time-consuming, but also costly, and there is obvious room for improvement. The Transformer model has the characteristics of learning the relationship between elements and has powerful feature extraction capabilities. Therefore, with appropriate design, the Transformer can be pre-trained using unlabeled data without the need to add labels. Under large-scale data sets, the results will be significantly better than other model methods.
現有對於文本主題分析之技術並非罕見,常見於例如文章、新聞或書籍之文字分析,透過分析進而達到主題分群,惟僅限於文字類型進行解析,但對於視頻來說,裡面可能包含影像、文字或語音,僅有文字解析恐有不足,又或是,部分技術可以針對影像或語音作分析,但也僅是針對單一模態,亦即,即便進行影像分析和語音分析,但無法將兩者分析結果進行整合,因為兩者特徵擷取時並非同一類型,也難以整合,因而目前對視頻主題分析之技術,顯有不足之處。 Existing technology for text topic analysis is not uncommon. It is commonly used in text analysis such as articles, news or books. Topic grouping is achieved through analysis, but the analysis is limited to text types. However, for videos, it may contain images, text or For speech, only text analysis may be insufficient, or some technologies can analyze images or speech, but only for a single modality. That is, even if image analysis and speech analysis are performed, they cannot analyze both. The results are integrated, because the two features are not of the same type when extracted, and it is difficult to integrate. Therefore, the current video topic analysis technology has obvious shortcomings.
綜上,如何提供一種視頻主題分析技術,特別是,能針對視頻可能涵蓋語音、影像、文字等不同模態之資訊進行解析,並且能將各種模態之特徵資料進行整合,以各全面分析視頻主題,此將成為目前本技術領域人員急欲追求之目標。 In summary, how to provide a video topic analysis technology that can analyze information in different modalities such as voice, image, text, etc., and integrate the characteristic data of various modalities to comprehensively analyze the video? Theme, this will become the goal that people in this technical field are currently eager to pursue.
為解決上述現有技術之問題,本發明係揭露一種視頻主題分析系統,係包括:文本特徵交錯注意力計算與編碼模組,用於對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理,以得到文字序列的代表性特徵;影像特徵交錯注意力計算與編碼模組,用於對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理,以得到影像序列的代表性特徵;語音特徵交錯注意力計算與編碼模組,用於對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理,以得到語音序列的代表性特徵;跨模態交錯注意力計算與編碼模組,用於將該文字序列的代表性特徵、該影像序列的代表性特徵及該語音序列的代表性特徵進行非線性的隱向量對齊,以得到代表該視頻資料之多模態 向量表徵;分群模組,用於將該視頻資料之多模態向量表徵進行分群,以產生分群結果;以及主題生成模組,用於對該分群結果進行主題創建以產生至少一主題群別,再透過各該主題群別之相似度計算以將相似度大於一閥值之主題群別進行整併,俾得到視頻分析結果。 In order to solve the above-mentioned problems of the prior art, the present invention discloses a video topic analysis system, which includes: a text feature interleaved attention calculation and encoding module, used for performing interleaved attention calculation and coding on text sequence data obtained from video data. Encoding processing to obtain the representative features of the text sequence; the image feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the image frame sequence obtained from the video data to obtain the image sequence. Representative features; the speech feature interleaved attention calculation and coding module is used to perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data to obtain the representative features of the speech sequence; cross-modal interleaved attention The force calculation and coding module is used to perform nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain a multi-modal representation of the video data. state Vector representation; a grouping module, used to group the multi-modal vector representation of the video material to generate a grouping result; and a topic generation module, used to create a topic on the grouping result to generate at least one topic group, Then, through the similarity calculation of each topic group, the topic groups whose similarity is greater than a threshold are integrated to obtain the video analysis results.
於一實施例中,該文本特徵交錯注意力計算與編碼模組復包括:切詞單元,用於對該文本序列資料進行切詞處理,以產生數個詞彙;文字序列向量初始化單元,用於對各該詞彙作向量編碼以產生各該詞彙之詞彙向量,以於對各該詞彙向量初始化後,得到文字資料的輸入序列向量;以及文本交錯注意力計算與編碼處理單元,用於將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該文字序列的代表性特徵。 In one embodiment, the text feature interleaved attention calculation and encoding module further includes: a word segmentation unit for performing word segmentation processing on the text sequence data to generate several words; a text sequence vector initialization unit for Vector encoding is performed on each word to generate a word vector of each word, so that after initializing each word vector, an input sequence vector of text data is obtained; and a text interleaved attention calculation and encoding processing unit is used to convert the text The input sequence vector of the data is processed by interleaved attention calculation and encoding to obtain the representative features of the text sequence.
於一實施例中,該影像特徵交錯注意力計算與編碼模組復包括:影像切幀單元,用於對該影像幀序列進行切幀處理,以產生數個影像幀;影像序列向量初始化單元,用於對各該影像幀作向量編碼以產生各該影像幀之影像幀向量,以於對各該影像幀向量初始化後,得到影像資料的輸入序列向量;以及影像交錯注意力計算與編碼處理單元,用於將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該影像序列的代表性特徵。 In one embodiment, the image feature interleaved attention calculation and encoding module further includes: an image frame cutting unit for performing frame cutting processing on the image frame sequence to generate several image frames; an image sequence vector initialization unit, Used to vector encode each image frame to generate an image frame vector of each image frame, so as to obtain an input sequence vector of image data after initializing each image frame vector; and an image interleaved attention calculation and encoding processing unit , used to perform interleaved attention calculation and coding processing on the input sequence vector of the image data to obtain the representative features of the image sequence.
於一實施例中,該語音特徵交錯注意力計算與編碼模組復包括:語音切幀單元,用於對該語音序列進行切幀處理,以產生數個語音幀;語音序列向量初始化單元,用於對各該語音幀作向量編碼以產生各該語音幀之語音幀向量,以於對各該語音幀向量初始化後,得到語音資料的輸入序列向量;以及語音交錯注意力計算與編碼處理單元,用於將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該語音序列的代表性特徵。 In one embodiment, the speech feature interleaved attention calculation and encoding module further includes: a speech frame cutting unit, used to perform frame cutting processing on the speech sequence to generate several speech frames; a speech sequence vector initialization unit, using performing vector encoding on each speech frame to generate a speech frame vector of each speech frame, so as to obtain an input sequence vector of speech data after initializing each speech frame vector; and a speech interleaved attention calculation and encoding processing unit, It is used to perform interleaved attention calculation and coding processing on the input sequence vector of the speech data to obtain the representative features of the speech sequence.
於一實施例中,該主題生成模組復包括:主題創建單元,係於對各該主題群別進行主題創建時,產生各該主題群別代表性的主題標籤;以及主題消融單元,係透過文字相似度計算及各主題間代表性特徵相似度計算,得到各主題間之特徵向量與該主題標籤的相似度,以將平均相似度大於該閥值的主題群別進行消融,俾得到該視頻分析結果。 In one embodiment, the topic generation module further includes: a topic creation unit, which generates topic tags representative of each topic group when creating topics for each topic group; and a topic ablation unit, which generates topic tags representative of each topic group through Calculate text similarity and representative feature similarity between each topic to obtain the similarity between the feature vectors between each topic and the topic tag, so as to ablate the topic groups whose average similarity is greater than the threshold to obtain the video Analyze the results.
本發明復揭露一種視頻主題分析方法,係由電腦設備執行該方法,該方法包括以下步驟:令文本特徵交錯注意力計算與編碼模組對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理,以得到文字序列的代表性特徵;令影像特徵交錯注意力計算與編碼模組對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理,以得到影像序列的代表性特徵;令語音特徵交錯注意力計算與編碼模組對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理,以得到語音序列的代表性特徵;令跨模態交錯注意力計算與編碼模組將該文字序列的代表性特徵、該影像序列的代表性特徵及該語音序列的代表性特徵進行非線性的隱向量對齊,以得到代表該視頻資料之多模態向量表徵;令分群模組將該視頻資料之多模態向量表徵進行分群,以產生分群結果;以及令主題生成模組對該分群結果進行主題創建以產生至少一主題群別,再透過各該主題群別之相似度計算以將相似度大於一閥值之主題群別進行整併,俾得到視頻分析結果。 The invention further discloses a video theme analysis method, which is executed by computer equipment. The method includes the following steps: causing the text feature interleaved attention calculation and encoding module to perform interleaved attention calculation on the text sequence data obtained from the video data. and coding processing to obtain the representative features of the text sequence; let the image feature interleaved attention calculation and coding module perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data to obtain the representative features of the image sequence characteristics; let the speech feature interleaved attention calculation and coding module perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data to obtain the representative features of the speech sequence; let the cross-modal interleaved attention calculation The coding module performs nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain a multi-modal vector representation representing the video data; let The grouping module groups the multi-modal vector representation of the video data to generate grouping results; and causes the topic generation module to perform topic creation on the grouping results to generate at least one topic group, and then through each topic group Similarity calculation is used to integrate topic groups whose similarity is greater than a threshold to obtain video analysis results.
於上述方法中,該令文本特徵交錯注意力計算與編碼模組對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理之步驟,復包括:對該文本序列資料進行切詞處理,以產生數個詞彙;對各該詞彙作向量編碼以產生各該詞彙之詞彙向量,以於對各該詞彙向量初始化後,得到文字資料的輸入序 列向量;以及將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該文字序列的代表性特徵。 In the above method, the step of causing the text feature interleaved attention calculation and encoding module to perform interleaved attention calculation and encoding processing on the text sequence data obtained from the video data further includes: performing word segmentation processing on the text sequence data. , to generate several words; perform vector encoding on each word to generate a word vector of each word, so that after initializing each word vector, the input sequence of text data can be obtained Column vector; and perform interleaved attention calculation and encoding processing on the input sequence vector of the text data to obtain the representative characteristics of the text sequence.
於上述方法中,該令影像特徵交錯注意力計算與編碼模組對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理之步驟,復包括:對該影像幀序列進行切幀處理,以產生數個影像幀;對各該影像幀作向量編碼以產生各該影像幀之影像幀向量,以於對各該影像幀向量初始化後,得到影像資料的輸入序列向量;以及將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該影像序列的代表性特徵。 In the above method, the step of causing the image feature interleaved attention calculation and coding module to perform interleaved attention calculation and coding processing on the image frame sequence obtained from the video data further includes: cutting the image frame sequence. Processing to generate several image frames; performing vector encoding on each image frame to generate an image frame vector for each image frame, so as to obtain an input sequence vector of image data after initializing each image frame vector; and converting the image frame vector The input sequence vector of the image data undergoes interleaved attention calculation and encoding processing to obtain the representative features of the image sequence.
於上述方法中,該令語音特徵交錯注意力計算與編碼模組對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理之步驟,復包括:對該語音序列進行切幀處理,以產生數個語音幀;對各該語音幀作向量編碼以產生各該語音幀之語音幀向量,以於對各該語音幀向量初始化後,得到語音資料的輸入序列向量;以及將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該語音序列的代表性特徵。 In the above method, the step of causing the speech feature interleaved attention calculation and coding module to perform interleaved attention calculation and coding processing on the speech sequence obtained from the video data further includes: performing frame cutting processing on the speech sequence, To generate several speech frames; perform vector encoding on each speech frame to generate a speech frame vector of each speech frame, so as to obtain an input sequence vector of speech data after initializing each speech frame vector; and convert the speech data The input sequence vector is subjected to interleaved attention calculation and coding processing to obtain the representative features of the speech sequence.
於上述方法中,該令主題生成模組對該分群結果進行主題創建以產生至少一主題群別,再透過各該主題群別之相似度計算之步驟,復包括:對各該詞彙作向量編碼以產生各該詞彙之詞彙向量,以於對各該詞彙向量初始化後,得到文字資料的輸入序列向量;以及將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該文字序列的代表性特徵。 In the above method, the topic generation module performs topic creation on the grouping results to generate at least one topic group, and then through the step of calculating the similarity of each topic group, including: vector encoding of each word To generate word vectors of each word, and after initializing each word vector, obtain the input sequence vector of the text data; and perform interleaved attention calculation and encoding processing on the input sequence vector of the word data to obtain the word sequence representative characteristics.
本發明復揭露一種電腦可讀媒介,應用於計算裝置或電腦中,係儲存有指令,以執行前述之視頻主題分析方法。 The invention further discloses a computer-readable medium, which is used in a computing device or a computer and stores instructions to execute the aforementioned video theme analysis method.
綜上,本發明之視頻主題分析系統、方法及其電腦可讀媒介,主要透過兩階段交錯注意力來對文本、影像及語音等多模態資料進行處理,藉以得到較具代表性且融合多模態豐富資訊的特徵資訊,其中,第一階段之各模態交錯注意力計算與編碼模組係使用交錯注意力計算機制來對文本、影像及語音資料分別進行交錯注意力計算與編碼處理,第二階段之跨模態交錯注意力計算與編碼模組則使用跨模態交錯注意力計算機制來對於前面交錯注意力計算機制所產出之文本、影像及語音序列資料之編碼向量結果進行非線性之隱向量對齊,如此,經計算與編碼後即可成為具代表性的多模態向量表徵,之後,經過特徵分群、主題創建及主題整併等程序,最終得到視頻主題分析結果。綜上,本發明考量視頻中的文本、影像及語音資料並進行跨模態整合,對於視頻主題分析將能提供更佳解析效果。 In summary, the video topic analysis system, method and computer-readable medium of the present invention mainly process multi-modal data such as text, image and voice through two-stage interleaved attention, so as to obtain a more representative and integrated multi-modal data. Characteristic information of modal rich information. Among them, the interleaved attention calculation and encoding modules of each modality in the first stage use the interleaved attention calculation mechanism to perform interleaved attention calculation and encoding processing on text, image and voice data respectively. The cross-modal interleaved attention calculation and coding module in the second stage uses the cross-modal interleaved attention calculation mechanism to perform non-coding vector results on the text, image and speech sequence data generated by the previous interleaved attention calculation mechanism. Linear latent vectors are aligned, so that after calculation and encoding, they can become representative multi-modal vector representations. After that, through procedures such as feature grouping, topic creation, and topic integration, the video topic analysis results are finally obtained. In summary, the present invention considers the text, image and voice data in the video and performs cross-modal integration, which will provide better analysis results for video topic analysis.
1:視頻主題分析系統 1: Video theme analysis system
11:文本特徵交錯注意力計算與編碼模組 11: Text feature interleaved attention calculation and encoding module
111:切詞單元 111: Word segmentation unit
112:文字序列向量初始化單元 112: Text sequence vector initialization unit
113:文本交錯注意力計算與編碼處理單元 113: Text interleaved attention calculation and encoding processing unit
12:影像特徵交錯注意力計算與編碼模組 12: Image feature interleaved attention calculation and encoding module
121:影像切幀單元 121: Image frame cutting unit
122:影像序列向量初始化單元 122: Image sequence vector initialization unit
123:影像交錯注意力計算與編碼處理單元 123: Image interleaved attention calculation and encoding processing unit
13:語音特徵交錯注意力計算與編碼模組 13: Speech feature interleaved attention calculation and coding module
131:語音切幀單元 131: Voice frame cutting unit
132:語音序列向量初始化單元 132: Speech sequence vector initialization unit
133:語音交錯注意力計算與編碼處理單元 133: Speech interleaved attention calculation and encoding processing unit
14:跨模態交錯注意力計算與編碼模組 14: Cross-modal interleaved attention calculation and encoding module
15:分群模組 15:Group module
16:主題生成模組 16:Theme generation module
161:主題創建單元 161: Topic Creation Unit
162:主題消融單元 162: Subject ablation unit
301-306:流程 301-306:Process
501-504:流程 501-504:Process
S401-S406:步驟 S401-S406: Steps
圖1為本發明之視頻主題分析系統的系統架構圖。 Figure 1 is a system architecture diagram of the video theme analysis system of the present invention.
圖2為本發明之視頻主題分析系統各模組的內部架構圖。 Figure 2 is an internal architecture diagram of each module of the video theme analysis system of the present invention.
圖3為本發明之視頻主題分析系統於一具體範例的運作流程圖。 FIG. 3 is an operation flow chart of the video theme analysis system of the present invention in a specific example.
圖4為本發明之視頻主題分析方法的步驟圖。 Figure 4 is a step diagram of the video theme analysis method of the present invention.
圖5為本發明之視頻主題分析方法於一應用範例的示意圖。 FIG. 5 is a schematic diagram of an application example of the video theme analysis method of the present invention.
以下藉由特定的具體實施形態說明本發明之技術內容,熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention through specific embodiments. Those familiar with the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied through other different specific implementation forms.
圖1為本發明之視頻主題分析系統的系統架構圖。如圖所示,本發明之視頻主題分析系統1係包括文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12、語音特徵交錯注意力計算與編碼模組13、跨模態交錯注意力計算與編碼模組14、分群模組15以及主題生成模組16。
Figure 1 is a system architecture diagram of the video theme analysis system of the present invention. As shown in the figure, the video
文本特徵交錯注意力計算與編碼模組11用於對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理,以得到文字序列的代表性特徵。簡言之,文本特徵交錯注意力計算與編碼模組11係將視頻資料中的文本資料進行交錯注意力計算與編碼處理,即可得到視頻資料中有關文本(或文字)的代表性特徵。
The text feature interleaved attention calculation and
影像特徵交錯注意力計算與編碼模組12用於對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理,以得到影像序列的代表性特徵。易言之,影像特徵交錯注意力計算與編碼模組12係將視頻資料中的影像資料進行交錯注意力計算與編碼處理,即可得到視頻資料中有關影像序列的代表性特徵。
The image feature interleaved attention calculation and
語音特徵交錯注意力計算與編碼模組13用於對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理,以得到語音序列的代表性特徵。簡言之,語音特徵交錯注意力計算與編碼模組13可將視頻資料中的語音資料(或音頻資料)進行交錯注意力計算與編碼處理,即可得到視頻資料中有關語音序列的代表性特徵。
The speech feature interleaved attention calculation and
跨模態交錯注意力計算與編碼模組14用於將該文字序列的代表性特徵、該影像序列的代表性特徵以及該語音序列的代表性特徵進行非線性的隱向量對齊,以得到代表該視頻資料之多模態向量表徵。具體來說,跨模態交錯注意力計算與編碼模組14係將來自文本特徵交錯注意力計算與編碼模組11之文字序列的代表性特徵、影像特徵交錯注意力計算與編碼模組12之影像序列的代表性特徵、語音特徵交錯注意力計算與編碼模組13之語音序列的代表性特徵進行整合,由於文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12和語音特徵交錯注意力計算與編碼模組13會將各序列的代表性特徵編碼程相同維度的序列向量,因而跨模態交錯注意力計算與編碼模組14可透過非線性之隱向量對齊,進而將視頻資料中的文本資料、影像資料以及語音資料所得到之序列向量進行整併,藉此得到代表該視頻資料之多模態向量表徵。
The cross-modal interleaved attention calculation and
分群模組15用於將該視頻資料之多模態向量表徵進行分群,以產生分群結果。當跨模態交錯注意力計算與編碼模組14得到代表視頻資料之多模態向量表徵時,透過分群技術對視頻的代表性特徵進行分群,以供後續對各群別進行主題創建或是整併。
The
於一實施例中,分群技術可採用基於密度的聚類演算法HDBSCAN(Hierarchical Density-Based Spatial Clustering of Applications with Noise)演算法來對行為特徵進行分群(Clustering)。 In one embodiment, the clustering technology can use the density-based clustering algorithm HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster behavioral characteristics.
主題生成模組16用於對該分群結果進行主題創建以產生至少一主題群別,再透過各該主題群別之相似度計算與比較,以將相似度大於一閥值之主題群別進行整併,俾得到視頻分析結果。簡言之,主題生成模組16係對分群
結果進行主題創建並給予對應的主題標籤(Hashtag),於創建多個主題群別後,透過相似度的比較,將相似度高的群別進行整併,相似度可由各主題之特徵向量與主題標籤的相似度來進行判斷,當相似度值大於一預定閥值時,即可判定兩個主題群別具高相似度,可進一步將兩個主題群別進行消融(reduction)。
The
另外,主題創建技術可為基於類別(class-based)的詞頻-逆向文件頻率(Term Frequency-Inverse Document Frequency,TF-IDF)演算法,前述c-TF-IDF演算法主要是在文字探勘或自然語言處理中進行文字加權,進而反映出詞彙對於文檔的重要性,此主題創建技術已眾所皆知,於此不再贅述。 In addition, the topic creation technology can be a class-based Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. The aforementioned c-TF-IDF algorithm is mainly used in text exploration or natural Text weighting is performed in language processing to reflect the importance of vocabulary to the document. This topic creation technology is well known and will not be repeated here.
由上可知,先由文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12和語音特徵交錯注意力計算與編碼模組13對視頻資料中各模態資料分別進行交錯注意力計算以及編碼,之後進行跨模態的整併處理,進而得到代表該視頻資料之多模態向量表徵,之後,經由分群模組15進行分群,最後,由主題生成模組16對分群結果進行主題創建、產出代表性主題標籤,並考量群別之間相似度高低來進行群別消融,最終即可得到該視頻資料之主題分析結果。
It can be seen from the above that first, the text feature interleaved attention calculation and
圖2為本發明之視頻主題分析系統各模組的內部架構圖。如圖所示,本實施例進一步說明文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12、語音特徵交錯注意力計算與編碼模組13和主題生成模組16的內部架構圖。
Figure 2 is an internal architecture diagram of each module of the video theme analysis system of the present invention. As shown in the figure, this embodiment further illustrates the text feature interleaved attention calculation and
文本特徵交錯注意力計算與編碼模組11係包括切詞單元111、文字序列向量初始化單元112以及文本交錯注意力計算與編碼處理單元113。切詞單元111用於對該文本序列資料進行切詞處理,以產生數個詞彙,文字序列向量
初始化單元112用於對各該詞彙作向量編碼以產生各該詞彙之詞彙向量,以於對各該詞彙向量初始化後,得到文字資料的輸入序列向量,文本交錯注意力計算與編碼處理單元113用於將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該文字序列的代表性特徵。
The text feature interleaved attention calculation and
影像特徵交錯注意力計算與編碼模組12包括影像切幀單元121、影像序列向量初始化單元122以及影像交錯注意力計算與編碼處理單元123。影像切幀單元121用於對該影像幀序列進行切幀處理,以產生數個影像幀,影像序列向量初始化單元122用於對各該影像幀作向量編碼以產生各該影像幀之影像幀向量,以於對各該影像幀向量初始化後,得到影像資料的輸入序列向量,影像交錯注意力計算與編碼處理單元123用於將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該影像序列的代表性特徵。
The image feature interleaved attention calculation and
語音特徵交錯注意力計算與編碼模組13包括語音切幀單元131、語音序列向量初始化單元132以及語音交錯注意力計算與編碼處理單元133。語音切幀單元131用於對該語音序列進行切幀處理,以產生數個語音幀,語音序列向量初始化單元132用於對各該語音幀作向量編碼以產生各該語音幀之語音幀向量,以於對各該語音幀向量初始化後,得到語音資料的輸入序列向量,語音交錯注意力計算與編碼處理單元133用於將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該語音序列的代表性特徵。
The speech feature interleaved attention calculation and
主題生成模組16包括主題創建單元161以及主題消融單元162。主題創建單元161於對各該主題群別進行主題創建時,產生各該主題群別代表性的主題標籤,主題消融單元162可透過文字相似度計算以及各主題間代表性特徵相似度計算,得到各主題間之特徵向量與該主題標籤的相似度,以將平均相
似度大於該閥值的主題群別進行消融,俾得到該視頻分析結果。後面將以具體範例來進行說明。
The
圖3為本發明之視頻主題分析系統於一具體範例的運作流程圖,請同時參考圖2。如圖所示,視頻主題分析系統係透過兩階段交錯注意力來對文本、影像及語音等多模態資料進行處理,藉以得到較具代表性且融合多模態豐富資訊的特徵資訊。 FIG. 3 is an operation flow chart of the video theme analysis system of the present invention in a specific example. Please refer to FIG. 2 at the same time. As shown in the figure, the video topic analysis system processes multi-modal data such as text, images, and speech through two stages of interleaved attention, thereby obtaining characteristic information that is more representative and integrates rich multi-modal information.
於流程301,提供大規模資料。本流程所指大規模資料為視頻資料,而視頻資料中包含文本部分、影像部分和語音部分。
In
於流程302,第一階段之各模態特徵交錯注意力計算與編碼處理。具體來說,對於文本、影像和語音的處理,分別由不同模態的交錯注意力計算與編碼模組來進行,其中,第一階段之文本特徵交錯注意力計算與編碼模組用於處理文本部分,第一階段之影像特徵交錯注意力計算與編碼模組用於處理影像部分,第一階段之語音特徵交錯注意力計算與編碼模組用於處理語音部分,處理方式如前面圖2內容所述。具體來說,由文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12和語音特徵交錯注意力計算與編碼模組13分別處理文本資料、影像資料和語音資料,經交錯注意力計算與編碼處理後,可分別產生文字序列的代表性特徵、影像序列的代表性特徵和語音序列的代表性特徵。
In
文本特徵交錯注意力計算與編碼模組11能提供文本序列資料之交錯注意力計算與編碼處理。具體來說,切詞單元111能將文本序列資料進行切詞處理,並對每一個詞彙給予一個向量編碼建立詞彙向量,接著,文字序列向量初始化單元112可透過Xavier初始化或者He初始化的方式得到詞彙向量和向量
權重,或者透過預訓練的模型(例如Data2vec,即一種適用於多種模態的高性能自我監督算法)取得,所得到文字資料的輸入序列向量X KV(text) R M(text) x C(text) ,其中,M(text)為經切詞處理後文本序列資料的長度,C(text)為超參數是文本輸入序列向量的維度,接著,文本交錯注意力計算與編碼處理單元113內可進行訓練的模型參數X Q(text) R N×D ,用以進行交錯注意力計算與編碼處理,經前述之文本特徵交錯注意力計算與編碼模組11處理後,即可得到文字序列的代表性特徵X MLP(text) 。
The text feature interleaved attention calculation and
影像特徵交錯注意力計算與編碼模組12能提供影像序列資料之交錯注意力計算與編碼處理。具體來說,對於影像序列資料,可使用從視頻資料所擷取的影像幀(frame)序列做為輸入序列向量X KV(image) R M(image) x C(image) ,其中,M(image)為像幀(frame)序列資料的長度,C(image)為超參數是輸入序列向量的維度,而影像切幀單元121即提供影像切幀處理,影像序列向量初始化單元122可透過Xavier初始化或者He初始化的方式得到影像向量和向量權重,或者可透過預訓練的模型(例如Data2vec)取得,接著,影像交錯注意力計算與編碼處理單元123內可進行訓練的模型參數X Q(image) R N×D ,用以進行交錯注意力計算與編碼處理,經前述之影像特徵交錯注意力計算與編碼模組12處理後,即可得到影像序列的代表性特徵X MLP(image) 。
The image feature interleaved attention calculation and
語音特徵交錯注意力計算與編碼模組13能提供語音序列資料之交錯注意力計算與編碼處理。具體來說,對於語音序列資料,則使用從視頻資料所擷取的語音(音頻)序列做為輸入序列向量X KV(audio) R M(audio) x C(audio) ,其中,M(audio)為語音幀(frame)序列資料的長度,C(audio)為超參數是輸入序列向量的維度,而語音切幀單元131即提供語音切幀處理,語音序列向量初始化單元132
可透過Xavier初始化或者He初始化的方式得到語音向量和向量權重,或者可透過預訓練的模型(例如Data2vec)取得,接著,語音交錯注意力計算與編碼處理單元133內可進行訓練的模型參數X Q(audio) R N×D ,用以進行交錯注意力計算與編碼處理,經前述之語音特徵交錯注意力計算與編碼模組13處理後,即可得到語音序列的代表性特徵X MLP(audio) 。
The speech feature interleaved attention calculation and
於流程303,第二階段之跨模態特徵交錯注意力計算與編碼處理。於本流程中,可透過跨模態交錯注意力計算與編碼處理,將流程302產生之文字序列的代表性特徵、影像序列的代表性特徵和語音序列的代表性特徵所對應之編碼向量結果進行非線性的隱向量對齊,易言之,經由上一流程之編碼計算,已將文本資料、影像資料及語音資料編碼成相同維度的序列向量,接著,可進行該三種模態特徵之資訊融合及隱向量對齊,也就是再進行一次交錯注意力計算與編碼處理,這裡是使用跨模態交錯注意力計算機制來對於文本特徵交錯注意力計算與編碼模組11、影像特徵交錯注意力計算與編碼模組12、語音特徵交錯注意力計算與編碼模組13所產生出的文本、影像及語音編碼向量結果進行非線性的隱向量對齊。
In
具體來說,於流程302中,分別計算出的文本、影像、語音的交錯注意力計算與編碼的結果X MLP(text) 、X MLP(image) 與X MLP(audio) ,接著,由本流程進行張量合併(concatenate),得到多模態資料的輸入序列向量X KV(multimodal) R 3N x D ,之後,又或是透過建立一可進行訓練的模型參數X Q(multimodal) R N×D ,藉以進行跨模態交錯注意力計算與編碼處理及多模態特徵擷取處理。
Specifically, in the
經前述交錯注意力計算與編碼處理後,可取得各視頻的代表性特徵,該特徵內含多模態的資訊,並因為於流程303中交錯注意力計算與編碼處理
會將代表性的特徵向量自動編碼到較小的維度,因而不需要再進行額外的降維處理,即可進行後續分群的操作,如此也可提升計算速度並且避免降維計算的資訊損失。
After the aforementioned interleaved attention calculation and coding processing, the representative features of each video can be obtained. This feature contains multi-modal information, and due to the interleaved attention calculation and coding processing in
於流程304,分群處理。本流程係使用分群技術(例如但不限於HDBSCAN)對視頻的代表性特徵進行分群的計算。
In
於流程305,主題創建與主題消融。本流程可使用c-TF-IDF(基於類別的TF-IDF演算法)技術對流程304的分群結果進行主題創建(Topic Creation),藉以得到各主題的代表性主題標籤(Hashtag),另外,還可使用文字相似度計算(例如但不限於Word Mover's Distance)及各主題間代表性特徵相似度計算(例如但不限於餘弦相似度計算)之技術手段,計算出各主題間之特徵向量與主題標籤的相似度,最後,將平均相似度(將文字相似度計算結果與主題代表性特徵相似度計算結果分別歸一化至0至1之間的數值,並將兩者取平均值)大於一閥值(例如但不限於0.75)的主題群別進行主題消融(Topic Reduction)。
In
於流程306,產生視頻主題分析結果。即經流程305之主題創建與主題消融後,可得到視頻主題分析結果。
In
圖4為本發明之視頻主題分析方法的步驟圖,係說明如何自動化進行視頻主題分析。 Figure 4 is a step diagram of the video topic analysis method of the present invention, illustrating how to automatically perform video topic analysis.
於步驟S401,令文本特徵交錯注意力計算與編碼模組對從視頻資料所取得之文本序列資料進行交錯注意力計算與編碼處理,以得到文字序列的代表性特徵。本步驟說明進行文本序列的特徵抽取,具體來說,先對該文本序列資料進行切詞處理,以產生數個詞彙,接著,對各該詞彙作向量編碼以產生各該詞彙之詞彙向量,以於對各該詞彙向量初始化後,得到文字資料的輸入序列向 量,之後,將該文字資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該文字序列的代表性特徵。 In step S401, the text feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the text sequence data obtained from the video data to obtain representative features of the text sequence. This step describes the feature extraction of the text sequence. Specifically, the text sequence data is first segmented to generate several words, and then vector encoding is performed on each word to generate a word vector of each word, so as to After initializing each vocabulary vector, the input sequence direction of the text data is obtained. Afterwards, the input sequence vector of the text data is subjected to interleaved attention calculation and encoding processing to obtain the representative characteristics of the text sequence.
於步驟S402,令影像特徵交錯注意力計算與編碼模組對從該視頻資料所取得之影像幀序列進行交錯注意力計算與編碼處理,以得到影像序列的代表性特徵。本步驟說明進行影像序列的特徵抽取,具體來說,先對該影像幀序列進行切幀處理,以產生數個影像幀,接著,對各該影像幀作向量編碼以產生各該影像幀之影像幀向量,以於對各該影像幀向量初始化後,得到影像資料的輸入序列向量,之後,將該影像資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該影像序列的代表性特徵。 In step S402, the image feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the image frame sequence obtained from the video data to obtain representative features of the image sequence. This step describes feature extraction of an image sequence. Specifically, the image frame sequence is first subjected to frame cutting processing to generate several image frames, and then vector encoding is performed on each image frame to generate an image of each image frame. The frame vector is used to obtain the input sequence vector of the image data after initializing each image frame vector. After that, the input sequence vector of the image data is subjected to interleaved attention calculation and coding processing to obtain the representative features of the image sequence. .
於步驟S403,令語音特徵交錯注意力計算與編碼模組對從該視頻資料所取得之語音序列進行交錯注意力計算與編碼處理,以得到語音序列的代表性特徵。本步驟說明進行語音序列的特徵抽取,具體來說,先對該語音序列進行切幀處理,以產生數個語音幀,接著,對各該語音幀作向量編碼以產生各該語音幀之語音幀向量,以於對各該語音幀向量初始化後,得到語音資料的輸入序列向量,之後,將該語音資料的輸入序列向量進行交錯注意力計算與編碼處理,以得到該語音序列的代表性特徵。 In step S403, the speech feature interleaved attention calculation and encoding module is used to perform interleaved attention calculation and encoding processing on the speech sequence obtained from the video data to obtain representative features of the speech sequence. This step describes the feature extraction of the speech sequence. Specifically, the speech sequence is first subjected to frame cutting processing to generate several speech frames, and then vector encoding is performed on each speech frame to generate speech frames of each speech frame. Vector, after initializing each speech frame vector, the input sequence vector of the speech data is obtained. After that, the input sequence vector of the speech data is subjected to interleaved attention calculation and coding processing to obtain the representative characteristics of the speech sequence.
於步驟S404,令跨模態交錯注意力計算與編碼模組將該文字序列的代表性特徵、該影像序列的代表性特徵以及該語音序列的代表性特徵進行非線性的隱向量對齊,以得到代表該視頻資料之多模態向量表徵。本步驟說明多模態交錯注意力計算與編碼處理,具體來說,可透過多模態張量序列串接、跨模態交錯注意力計算與編碼處理以及多模態序列特徵映射、編碼與提取等程序來執 行,目的是將視頻多模態序列的特徵抽取,以供後續進行視頻多模態特徵主題分析。 In step S404, the cross-modal interleaved attention calculation and encoding module is used to perform nonlinear latent vector alignment on the representative features of the text sequence, the representative features of the image sequence, and the representative features of the speech sequence to obtain Represents the multi-modal vector representation of the video data. This step describes the multi-modal interleaved attention calculation and encoding processing. Specifically, it can be performed through multi-modal tensor sequence concatenation, cross-modal interleaved attention calculation and encoding processing, and multi-modal sequence feature mapping, encoding and extraction. Wait for the program to be executed The purpose is to extract the features of video multi-modal sequences for subsequent video multi-modal feature theme analysis.
於步驟S405,令分群模組將該視頻資料之多模態向量表徵進行分群,以產生分群結果。本步驟即說明將視頻資料之多模態向量表徵作分群處理。 In step S405, the grouping module is instructed to group the multi-modal vector representation of the video data to generate a grouping result. This step explains how to group the multi-modal vector representation of video data.
於步驟S406,令主題生成模組對該分群結果進行主題創建以產生至少一主題群別,且透過各該主題群別之相似度計算與比較,將相似度大於一閥值之主題群別進行整併,以得到視頻分析結果。本步驟即說明進行主題創建與消融(整併),產出視頻之自動化分群結果以及各群的代表性主題標籤(Hashtag)與代表性的視頻內容資訊。 In step S406, the topic generation module is caused to perform topic creation on the grouping results to generate at least one topic group, and through similarity calculation and comparison of each topic group, the topic groups whose similarity is greater than a threshold are Integrate to obtain video analysis results. This step describes topic creation and ablation (merging), and produces automatic grouping results of videos as well as representative topic tags (Hashtags) and representative video content information of each group.
圖5為本發明之視頻主題分析方法於一應用範例的示意圖。如圖所示。首先,於流程501,大量視頻資料作為輸入;於流程502,經資料前處理技術從視頻資料擷取出文本、影像與語音的序列資料;於流程503,再透過基於多模態交錯注意力編碼機制的視頻主題分析模型進行主題分析;最後,產生如流程504所示之視頻主題分析結果。以下透過一具體實施例,進行說明。
FIG. 5 is a schematic diagram of an application example of the video theme analysis method of the present invention. As shown in the picture. First, in
首先,將視頻資料細切為數個幀(frame),舉例來說,1秒的視頻資料可以細切成30個幀,以此類推,10秒的資料可切成300個幀。接著,將各幀對應的影像內容取出,即可取得影像序列的輸入資料,將視頻內的音頻資訊擷取出來,可得到語音序列的輸入資料,而文本序列資料,可透過下列三種方式(但不限於該三種方式)取得:第一種,若該視頻資料有完整的字幕(如srt)等檔案,則使用字幕檔內的文字資料做為文字序列的輸入資料;第二種,若該視頻資料無字幕檔,但視頻內的影像資料有清晰可見的字幕資訊,則可使用光學字元辨識 (OCR)等技術擷取出對應的字幕內容,做為文字序列的輸入資料;第三種,透過語音辨識技術,將語音內容進行音轉字處理,從而得到文字序列的輸入資料。 First, the video data is cut into several frames. For example, 1 second of video data can be cut into 30 frames, and so on, 10 seconds of data can be cut into 300 frames. Then, by extracting the image content corresponding to each frame, the input data of the image sequence can be obtained. By extracting the audio information in the video, the input data of the voice sequence can be obtained. The text sequence data can be obtained through the following three methods (but Not limited to the three methods) to obtain: the first, if the video data has complete subtitles (such as srt) and other files, use the text data in the subtitle file as the input data of the text sequence; the second, if the video If the data does not have subtitles, but the image data in the video has clearly visible subtitle information, optical character recognition can be used. (OCR) and other technologies extract the corresponding subtitle content as the input data of the text sequence; the third method uses speech recognition technology to convert the voice content into words to obtain the input data of the text sequence.
在透過上述方式取得文本序列資料、影像序列資料及語音序列資料後,首先針對文本、影像及語音資料分別進行交錯注意力計算與編碼處理。文本序列資料之交錯注意力計算與編碼處理方式為,將文本序列資料進行切詞處理,並對每一個詞彙給予一個向量編碼建立詞彙向量,可得到文字資料的輸入序列向量X KV(text) R M(text) x C(text) ,其中,M(text)為經切詞處理後文本序列資料的長度,C(text)為超參數是文本輸入序列向量的維度。接著,建立一可進行訓練的模型參數X Q(text) R N×D 用以進行交錯注意力計算與編碼處理,文本資料之交錯注意力之模型計算公式為 X QKV(text) =f o(text) (X QK(text) V(text))=Attention(X Q(text) ,X KV(text) ),其中, X QK(text) =softmax(Q(text)K(text) T / ),Q(text)=f Q(text) (X Q(text) ),K(text)=f K(text) (X KV(text) ),V(text)=f V(text) (X KV(text) )。前述 f Q(text) 、 f K(text) 、 f V(text) 為三個前饋神經網路(Feedforward neural network),可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f o(text) 為一個前饋神經網路(Feedforward neural network)用以產生出 X QKV(text) R N×D 。最後,再將 X QKV(text) R N×D 輸入給一多層感知器模型(MLP),得到文本序列資料的交錯注意力計算與編碼的結果 X MLP(text) R N×D 。 After obtaining the text sequence data, image sequence data and speech sequence data through the above method, first perform interleaved attention calculation and encoding processing on the text, image and speech data respectively. The interleaved attention calculation and encoding processing method of text sequence data is to segment the text sequence data into words, and give each word a vector code to establish a word vector. The input sequence vector X KV (text) of the text data can be obtained R M(text) x C(text) , where M(text) is the length of the text sequence data after word segmentation processing, and C(text) is the hyperparameter which is the dimension of the text input sequence vector. Next, establish a model parameter X Q(text) that can be trained R N×D is used for interleaved attention calculation and encoding processing. The model calculation formula of interleaved attention for text data is X QKV ( text ) = f o ( text ) ( X QK ( text ) V(text))= Attention ( X Q ( text ) , X KV ( text ) ) , where, X QK ( text ) = softmax ( Q ( text ) K ( text ) T / ) , Q(text)= f Q ( text ) ( X Q ( text ) ) , K(text)= f K ( text ) ( X KV ( text ) ) , V(text)= f V ( text ) ( X KV ( text ) ) . The aforementioned f Q ( text ) , f K ( text ) , and f V ( text ) are three feedforward neural networks (Feedforward neural network), which can map the input data into a tensor with dimension F. Softmax is a normalized exponential function. f o ( text ) is a feedforward neural network (Feedforward neural network) used to generate X QKV ( text ) R N×D . Finally, X QKV ( text ) R N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of text sequence data is obtained X MLP ( text ) R N×D .
對於影像序列資料,則使用從視頻資料所擷取的影像幀(frame)序列做為輸入序列向量X KV(image) R M(image) x C(image) ,其中,M(image)為影像幀(frame)序列資料的長度,C(image)為超參數是輸入序列向量的維度,接著,建立一可進行訓練的模型參數X Q(image) R N×D 用以進行交錯注意力計算與編碼處理。 影像資料之交錯注意力之模型計算公式為 X QKV(image) =f o(image) (X QK(image) V(image))=Attention(X Q(image) ,X KV(image) ),其中,,Q(image)=f Q(image) (X Q(image) ),K(image)=f K(image) (X KV(image) ),V(image)=f V(image) (X KV(image) )。前述 f Q(image) 、 f K(image) 、 f V(image) 為三個前饋神經網路,可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f o(image) 為一個前饋神經網路(Feedforward neural network)用以產生出 X QKV(image) R N×D 。最後,再將 X QKV(image) R N×D 輸入給一多層感知器模型(MLP),得到影像序列資料的交錯注意力計算與編碼的結果 X MLP(image) R N×D 。 For image sequence data, the image frame (frame) sequence captured from the video data is used as the input sequence vector X KV (image) R M( image ) Model parametersX Q(image) R N×D is used for interleaved attention calculation and encoding processing. The model calculation formula of interleaved attention for image data is X QKV ( image ) = f o ( image ) ( X QK ( image ) V(image))= Attention ( X Q ( image ) , X KV ( image ) ) , where , , Q ( image ) = f Q ( image ) ( _ _ _ _ _ _ _ ( image ) ) . The aforementioned f Q ( image ) , f K ( image ) , and f V ( image ) are three feedforward neural networks that can map the input data into a tensor of dimension F. Softmax is a normalized exponential function. f o ( image ) is a feedforward neural network used to generate X QKV ( image ) R N×D . Finally, X QKV ( image ) R N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of image sequence data is obtained X MLP ( image ) R N×D .
對於語音序列資料,則使用從視頻資料所擷取的語音(音頻)序列做為輸入序列向量X KV(audio) R M(audio) x C(audio) ,其中,M(audio)為語音幀(frame)序列資料的長度,C(audio)為超參數是輸入序列向量的維度,接著,建立一可進行訓練的模型參數X Q(audio) R N×D 用以進行交錯注意力計算與編碼處理。語音資料之交錯注意力之模型計算公式為 X QKV(audio) =f o(audio) (X QK(audio) V(audio))=Attention(X Q(audio) ,X KV(audio) ),其中,,Q(audio)=f Q(audio) (X Q(audio) ),K(audio)=f K(audio) (X KV(audio) ),V(audio)=f V(audio) (X KV(audio) )。前述 f Q(audio) 、 f K(audio) 、 f V(audio) 為三個前饋神經網路,可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f o(audio) 為一個前饋神經網路(Feedforward neural network)用以產生出 X QKV(audio) R N×D 。 最後,再將 X QKV(audia) R N×D 輸入給一多層感知器模型(MLP),得到語音序列資料的交錯注意力計算與編碼的結果 X MLP(audio) R N×D 。 For speech sequence data, use the speech (audio) sequence captured from the video data as the input sequence vector X KV (audio) R M( audio ) Model parameters X Q(audio) R N×D is used for interleaved attention calculation and encoding processing. The model calculation formula for interleaved attention for speech data is X QKV ( audio ) = f o ( audio ) ( X QK ( audio ) V(audio))= Attention ( X Q ( audio ) , , , Q(audio)= f Q ( audio ) ( X Q ( audio ) ) , K(audio)= f K ( audio ) ( X KV ( audio ) ) , V(audio)= f V ( audio ) ( X KV ( audio ) ) . The aforementioned f Q ( audio ) , f K ( audio ) , and f V ( audio ) are three feedforward neural networks that can map input data into tensors of dimension F. Softmax is a normalized exponential function. f o ( audio ) is a feedforward neural network (Feedforward neural network) used to generate X QKV ( audio ) R N×D . Finally, X QKV ( audia ) R N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of speech sequence data is obtained X MLP ( audio ) R N×D .
接著,將於上個階段分別計算出的文本、影像、語音的交錯注意力計算與編碼的結果X MLP(text) 、X MLP(image) 與X MLP(audio) 進行張量合併(concatenate),得到多模態資料的輸入序列向量X KV(multimodal) R 3N x D ,接著,建立一可進行訓練的模型參數X Q(multimodal) R N×D 用以進行跨模態交錯注意力計算與編碼處理,跨模態交錯注意力計算與編碼模組計算公式為 X QKV(multimodal) =f o(multimodal) (X QK(multimodal) V(multimodal))=Attention(X Q(multimodal) ,X KV(multimodal) ),其中, X QK(multimodal) = ,Q(multimodal)=f Q(multimodal) (X Q(multimodal) ),K(multimodal)=f K(multimodal) (X KV(multimodal) ),V(multimodal)=f V(multimodal) (X KV(multimodal) )。前述 f Q(multimodal) 、 f K(multimodal) 、 f V(multimodal) 為三個前饋神經網路,可將輸入資料映射成維度為F的張量。 softmax 為歸一化指數函式。 f o(multimodal) 為一個前饋神經網路(Feedforward neural network)用以產生出 X QKV(multimodal) R N×D 。最後,再將 X QKV(multimodal) R N×D 輸入給一多層感知器模型(MLP),得到多模態序列資料的交錯注意力計算與編碼的結果 X MLP(multimodal) R N×D 。 Then, perform tensor merging (concatenate) on the interleaved attention calculation and encoding results X MLP (text) , X MLP (image) and X MLP (audio) calculated in the previous stage, Get the input sequence vector X KV (multimodal) of multimodal data R 3N x D , then, establish a model parameter X Q (multimodal) that can be trained R N×D is used for cross-modal interleaved attention calculation and encoding processing. The cross-modal interleaved attention calculation and encoding module calculation formula is X QKV ( multimodal ) = f o ( multimodal ) ( X QK ( multimodal ) V (multimodal))= Attention ( X Q ( multimodal ) , X KV ( multimodal ) ) , where, X QK ( multimodal ) = , Q(multimodal)= f Q ( multimodal ) ( X Q ( multimodal ) ) , K ( multimodal ) = f K ( multimodal ) ( ( multimodal ) ) . The aforementioned f Q ( multimodal ) , f K ( multimodal ) , and f V ( multimodal ) are three feedforward neural networks that can map input data into tensors of dimension F. Softmax is a normalized exponential function. f o ( multimodal ) is a feedforward neural network (Feedforward neural network) used to generate X QKV ( multimodal ) R N×D . Finally, X QKV ( multimodal ) R N×D is input to a multi-layer perceptron model (MLP), and the result of interleaved attention calculation and encoding of multi-modal sequence data is obtained X MLP ( multimodal ) R N×D .
經過上述兩階段的交錯注意力計算與編碼,可求得於隱向量空間對齊後的多模態序列資料編碼結果 X MLP(multimodal) R N×D ,接著,計算出對 X MLP(multimodal) 之N個向量進行取平均計算,例如avg(X MLP(multimodal) ) R 1×D ,對 X MLP(multimodal) 之N個向量進行元素積(element-wise product)的結果,例如element_wise_product(X MLP(multimodal) ) R 1×D 以及取出最前面的 X MLP(multimodal) 向量,例如 X MLP(multimodal) [0] R 1×D 。之後,對該三個向量 進行張量合併(concatenate),例如concatenate(avg( X MLP(multimodal) ),element_wise_product( X MLP(multimodal) ), X MLP(multimodal) [0]) R 3×D 。並將concatenate(avg( X MLP(multimodal) ),element_wise_product( X MLP(multimodal) ), X MLP(multimodal) [0]) R 3×D (以下簡記為 X concatenate(multimodal) )輸入給一多層感知器模型(MLP)及殘差模块(residual block)進行計算以得到多模態序列資料的最終編碼結果 X final(multimodal) 。 X final(multimodal) 的計算公式為 X final(multimodal) =MLP(X concatenate(multimodal) )⊕X concatenate(multimodal) ,其中,⊕為元素相加(element-wise addition)之計算。 After the above two stages of interleaved attention calculation and encoding, the multimodal sequence data encoding result X MLP ( multimodal ) after alignment in the latent vector space can be obtained R N×D , then calculate the average of N vectors of X MLP ( multimodal ) , such as avg ( X MLP ( multimodal ) ) R 1×D , the result of element-wise product of N vectors of X MLP ( multimodal ) , such as element_wise_product ( X MLP ( multimodal ) ) R 1×D and take out the front X MLP ( multimodal ) vector, such as X MLP ( multimodal ) [0] R1 ×D . Afterwards , perform tensor merging (concatenate) on the three vectors, such as concatenate(avg( X MLP ( multimodal ) ),element_wise_product( X MLP ( multimodal ) ) , R3 ×D . and concatenate(avg( X MLP ( multimodal ) ),element_wise_product( X MLP ( multimodal ) ), R 3× D ( hereinafter abbreviated as ) . The calculation formula of X final ( multimodal ) is X final ( multimodal ) = MLP ( X concatenate ( multimodal )) ⊕
本發明的多模態注意力編碼機制模型,目標是要用於無監督視頻主題分析之用途,因此,在進行無監督視頻主題分析任務之前,需要先將模型權重,於其他有監督式的任務上進行預訓練,以學習出較具代表性的多模態特徵,來提升無監督視頻主題分析之成效。本發明使用文本到視頻檢索(text-to-video retrieval)之有監督式的任務對多模態交錯注意力編碼機制模型進行預訓練。 The multi-modal attention coding mechanism model of the present invention is intended to be used for unsupervised video topic analysis. Therefore, before performing the unsupervised video topic analysis task, it is necessary to first assign the model weights to other supervised tasks. Pre-training is performed to learn more representative multi-modal features to improve the effectiveness of unsupervised video topic analysis. The present invention uses a supervised task of text-to-video retrieval to pre-train a multi-modal interleaved attention coding mechanism model.
對於文本到視頻檢索任務,將 X MLP(text) 輸入給一多層感知器模型(MLP)之結果 X MLP(MLP(text)) R 3×D 作為文本的向量表徵,並將 X final(multimodal) 作為整個視頻的向量表徵來進行相似度匹配訓練。即對於標註為正樣本(具有高度關聯性)的文本與視頻資料,以其餘弦相似度(Cosine Similarity),即cosine-sim(X MLP(MLP(text)) ,X final(multimodal) )=1做為模型的監督式訓練目標。對於標註為負樣本(不具關聯性)的文本與視頻資料,以其餘弦相似度(Cosine Similarity),即cosine-sim(X MLP(MLP(text)) ,X final(multimodal) )=-1做為模型的監督式訓練目標。 For text-to-video retrieval tasks, the result of feeding X MLP ( text ) into a multilayer perceptron model (MLP) is X MLP ( MLP ( text )) R 3×D is used as the vector representation of the text, and X final ( multimodal ) is used as the vector representation of the entire video for similarity matching training. That is, for text and video materials marked as positive samples (with high correlation), the cosine similarity (Cosine Similarity) is used, that is, cosine-sim( X MLP ( MLP ( text )) , X final ( multimodal ) )= 1 as the supervised training target of the model. For text and video data marked as negative samples (not relevant), the cosine similarity (Cosine Similarity) is used, that is, cosine-sim ( X MLP ( MLP ( text )) , X final ( multimodal ) )=-1 is the supervised training goal of the model.
經上述技術手段處理,可取得各視頻的代表性特徵,該特徵內含多模態的資訊,並且由於跨模態交錯注意力計算與編碼模組會將該代表性特徵向量自動編碼到較小的維度,因此,不需要再進行額外的降維處理即可進行後續分群的操作,如此可提升計算速度與避免額外降維計算的資訊損失。 After processing by the above technical means, the representative features of each video can be obtained. This feature contains multi-modal information, and due to the cross-modal interleaved attention calculation and encoding module, the representative feature vector will be automatically encoded into a smaller size. dimensions, therefore, subsequent grouping operations can be performed without additional dimensionality reduction processing, which can improve the calculation speed and avoid the information loss of additional dimensionality reduction calculations.
接著,使用分群技術對視頻的代表性特徵進行分群的計算,然後,再使用c-TF-IDF(基於類別的TF-IDF)技術對分群的結果進行主題創建(Topic Creation),以得到各主題的代表性主題標籤,並再使用文字相似度計算(例如但不限於Word Mover's Distance)及各主題間代表性特徵相似度計算(例如但不限於餘弦相似度計算)之技術手段,計算出各主題間之特徵向量與主題標籤的相似度,最後將平均相似度大於閥值的主題群別,將進行主題消融(Topic Reduction),以得到最終的視頻主題分析結果。 Next, grouping technology is used to calculate the grouping of representative features of the video, and then c-TF-IDF (category-based TF-IDF) technology is used to perform topic creation (Topic Creation) on the grouping results to obtain each topic. representative topic tags, and then use technical means of text similarity calculation (such as but not limited to Word Mover's Distance) and representative feature similarity calculation between topics (such as but not limited to cosine similarity calculation) to calculate each topic The similarity between feature vectors and topic tags, and finally the topic groups whose average similarity is greater than the threshold will be subject to topic reduction (Topic Reduction) to obtain the final video topic analysis results.
在一實施例中,上述之各個系統、模組、伺服器、設備均可由軟體、硬體或韌體實現上述內容;若為硬體,則可為具有資料處理與運算能力之處理單元、處理器、電腦或伺服器;若為軟體或韌體,則可包括處理單元、處理器、電腦或伺服器可執行之指令,且可安裝於同一硬體裝置或分布於不同的複數硬體裝置。 In one embodiment, each of the above-mentioned systems, modules, servers, and devices can implement the above content by software, hardware, or firmware; if it is hardware, it can be a processing unit or processor with data processing and computing capabilities. If it is software or firmware, it may include instructions executable by a processing unit, processor, computer or server, and may be installed on the same hardware device or distributed on multiple different hardware devices.
此外,本發明還揭示一種電腦可讀媒介,係應用於具有處理器(例如,CPU、GPU等)及/或記憶體的計算裝置或電腦中,且儲存有指令,並可利用此計算裝置或電腦透過處理器及/或記憶體執行此電腦可讀媒介,以於執行此電腦可讀媒介時執行上述之方法及各步驟。 In addition, the present invention also discloses a computer-readable medium, which is applied to a computing device or computer having a processor (eg, CPU, GPU, etc.) and/or a memory, and stores instructions, and can utilize the computing device or computer. The computer executes the computer-readable medium through the processor and/or memory to perform the above methods and steps when executing the computer-readable medium.
綜上,本發明揭露一種基於多模態交錯注意力編碼機制的視頻主題分析系統、方法及其電腦可讀媒介,其中,各模態的交錯注意力計算與編碼模 組用以將輸入的文本資料、影像資料及語音資料進行特徵擷取,並透過跨模態交錯注意力計算與編碼模組將文本資料、影像資料及語音資料的特徵投影於同一隱向量空間中進行對齊,進一步擷取與融合多模態資訊的特徵,最後,再藉由分群、主題創建(Topic Creation)及主題消融(Topic Reduction)等程序得到視頻主題分析結果,如此即可快速且精準地達到視頻之自動化分群結果以及各群的代表性主題標籤(Hashtag)與代表性的視頻內容資訊。另外,本發明具備以下特色及功效: In summary, the present invention discloses a video theme analysis system, method and computer-readable medium based on a multi-modal interleaved attention coding mechanism, in which the interleaved attention calculation and coding models of each modality are The group is used to extract features from the input text data, image data and voice data, and project the features of the text data, image data and voice data into the same latent vector space through the cross-modal interleaved attention calculation and coding module. Align, further capture and fuse the characteristics of multi-modal information, and finally obtain the video topic analysis results through procedures such as grouping, topic creation (Topic Creation) and topic reduction (Topic Reduction), so that you can quickly and accurately Achieve automatic grouping results of videos, as well as representative hashtags (Hashtags) and representative video content information of each group. In addition, the present invention has the following features and effects:
第一,本發明可以用於對視頻資料進行主題分析,能讓使用者掌握視頻資料中的重要資訊、潛在主題類別與潛在主題群數、及代表性的主題標籤(Hashtag)。 First, the present invention can be used to perform topic analysis on video data, allowing users to grasp important information, potential topic categories and number of potential topic groups, and representative topic tags (Hashtags) in the video data.
第二,本發明提出了一種適用於文本資料、影像資料及語音資料的交錯注意力計算與編碼模組,其可將上述之各種模態的任意長度序列資料先進行向量編碼(序列長度為M)並計算出對應的Key向量與Value向量序列,並透過一固定長度的隱序列向量(序列長度為N)編碼出對應的Query向量。之後,再將Key向量、Value向量與Query向量進行注意力機制的計算,此計算之時間複雜度為O(MN),且一般來說M遠大於N,且N為定值,因此,時間複雜度可化簡為O(M),明顯小於一般變換器Transformer模型之自注意力機制的計算時間複雜度O(M2),因此,本發明之方法具有處理較長多模態序列資料的能力。 Second, the present invention proposes an interleaved attention calculation and coding module suitable for text data, image data and speech data, which can vector-code sequence data of any length in the various modalities mentioned above (the sequence length is M ) and calculate the corresponding Key vector and Value vector sequences, and encode the corresponding Query vector through a fixed-length latent sequence vector (sequence length is N). After that, the Key vector, Value vector and Query vector are used to calculate the attention mechanism. The time complexity of this calculation is O(MN), and generally speaking, M is much larger than N, and N is a fixed value. Therefore, the time complexity is complex. The degree can be simplified to O(M), which is significantly smaller than the computational time complexity O(M 2 ) of the self-attention mechanism of the general Transformer model. Therefore, the method of the present invention has the ability to process longer multi-modal sequence data. .
第三,本發明提出了一種跨模態的交錯注意力(Intercross Attention)計算與編碼模組,可以使用交錯注意力機制計算多模態資訊並透過編碼模組直接對於隱向量空間對齊後的多模態隱序列向量進行維度投射與壓縮,該方法可自動融合與對齊不同模態的資訊,並直接進行維度壓縮操作,無需像一般主題分 析的作法,還要經過額外的降維操作(通常使用UMAP(Uniform Manifold Approximation and Projection)降維演算法),由於額外的降維操作需要額外進行計算,亦會造成資訊特徵的損失,因此,本發明之方法免於額外之降維處理可同時提升多模態資訊整合的能力與計算的效率。 Third, the present invention proposes a cross-modal intercross attention (Intercross Attention) calculation and coding module, which can use the intercross attention mechanism to calculate multi-modal information and directly process the multi-modal information aligned in the latent vector space through the coding module. Modal latent sequence vectors are used for dimensional projection and compression. This method can automatically fuse and align information from different modalities, and directly perform dimensional compression operations without the need for general subject analysis. The analysis method requires additional dimensionality reduction operations (usually using the UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction algorithm). Since the additional dimensionality reduction operations require additional calculations, they will also cause the loss of information features. Therefore, The method of the present invention avoids additional dimensionality reduction processing and can simultaneously improve the multi-modal information integration capability and computing efficiency.
上列詳細說明係針對本發明之一可行實施例之具體說明,惟該實施例並非用以限制本發明之專利範圍,凡未脫離本發明技藝精神所為之等效實施或變更,均應包含於本發明之專利範圍中。 The above detailed description is a specific description of one possible embodiment of the present invention. However, this embodiment is not intended to limit the patent scope of the present invention. Any equivalent implementation or modification that does not depart from the technical spirit of the present invention shall be included in within the patent scope of this invention.
1:視頻主題分析系統 1: Video theme analysis system
11:文本特徵交錯注意力計算與編碼模組 11: Text feature interleaved attention calculation and encoding module
12:影像特徵交錯注意力計算與編碼模組 12: Image feature interleaved attention calculation and encoding module
13:語音特徵交錯注意力計算與編碼模組 13: Speech feature interleaved attention calculation and coding module
14:跨模態交錯注意力計算與編碼模組 14: Cross-modal interleaved attention calculation and encoding module
15:分群模組 15:Group module
16:主題生成模組 16:Theme generation module
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW112106317A TWI830604B (en) | 2023-02-21 | 2023-02-21 | Video topic analysis system, method and computer readable medium thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW112106317A TWI830604B (en) | 2023-02-21 | 2023-02-21 | Video topic analysis system, method and computer readable medium thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
TWI830604B true TWI830604B (en) | 2024-01-21 |
Family
ID=90459303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW112106317A TWI830604B (en) | 2023-02-21 | 2023-02-21 | Video topic analysis system, method and computer readable medium thereof |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI830604B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783691A (en) * | 2018-12-29 | 2019-05-21 | 四川远鉴科技有限公司 | A kind of video retrieval method of deep learning and Hash coding |
CN110363164A (en) * | 2019-07-18 | 2019-10-22 | 南京工业大学 | A kind of unified approach based on LSTM time consistency video analysis |
CN114040126A (en) * | 2021-09-22 | 2022-02-11 | 西安深信科创信息技术有限公司 | Character-driven character broadcasting video generation method and device |
US20220406038A1 (en) * | 2021-06-21 | 2022-12-22 | Tubi, Inc. | Training data generation for advanced frequency management |
-
2023
- 2023-02-21 TW TW112106317A patent/TWI830604B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783691A (en) * | 2018-12-29 | 2019-05-21 | 四川远鉴科技有限公司 | A kind of video retrieval method of deep learning and Hash coding |
CN110363164A (en) * | 2019-07-18 | 2019-10-22 | 南京工业大学 | A kind of unified approach based on LSTM time consistency video analysis |
US20220406038A1 (en) * | 2021-06-21 | 2022-12-22 | Tubi, Inc. | Training data generation for advanced frequency management |
CN114040126A (en) * | 2021-09-22 | 2022-02-11 | 西安深信科创信息技术有限公司 | Character-driven character broadcasting video generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109840287B (en) | Cross-modal information retrieval method and device based on neural network | |
CN110751208B (en) | Criminal emotion recognition method for multi-mode feature fusion based on self-weight differential encoder | |
CN111680541B (en) | Multi-modal emotion analysis method based on multi-dimensional attention fusion network | |
Surís et al. | Cross-modal embeddings for video and audio retrieval | |
Huang et al. | Unsupervised domain adaptation for speech emotion recognition using PCANet | |
CN111444340A (en) | Text classification and recommendation method, device, equipment and storage medium | |
CN112989977B (en) | Audio-visual event positioning method and device based on cross-modal attention mechanism | |
CN110795549B (en) | Short text conversation method, device, equipment and storage medium | |
WO2022222850A1 (en) | Multimedia content recognition method, related apparatus, device and storage medium | |
CN114254655B (en) | Network security tracing semantic identification method based on prompt self-supervision learning | |
JP2023546173A (en) | Facial recognition type person re-identification system | |
Li et al. | A deep feature based multi-kernel learning approach for video emotion recognition | |
CN111858984A (en) | Image matching method based on attention mechanism Hash retrieval | |
CN116189039A (en) | Multi-modal emotion classification method and system for modal sequence perception with global audio feature enhancement | |
CN116258989A (en) | Text and vision based space-time correlation type multi-modal emotion recognition method and system | |
Palaskar et al. | Multimodal Speech Summarization Through Semantic Concept Learning. | |
CN112329604B (en) | Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition | |
CN116226357B (en) | Document retrieval method under input containing error information | |
TWI830604B (en) | Video topic analysis system, method and computer readable medium thereof | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
Hu et al. | Audio–text retrieval based on contrastive learning and collaborative attention mechanism | |
CN113159071B (en) | Cross-modal image-text association anomaly detection method | |
CN115019137A (en) | Method and device for predicting multi-scale double-flow attention video language event | |
Zhao et al. | Emotion Recognition using Multimodal Features | |
Tiwari et al. | Automatic caption generation via attention based deep neural network model |