WO2023137913A1 - Video text summarization method based on multi-modal model, device and storage medium - Google Patents

Video text summarization method based on multi-modal model, device and storage medium Download PDF

Info

Publication number
WO2023137913A1
WO2023137913A1 PCT/CN2022/090712 CN2022090712W WO2023137913A1 WO 2023137913 A1 WO2023137913 A1 WO 2023137913A1 CN 2022090712 W CN2022090712 W CN 2022090712W WO 2023137913 A1 WO2023137913 A1 WO 2023137913A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
video
information
features
feature vector
Prior art date
Application number
PCT/CN2022/090712
Other languages
French (fr)
Chinese (zh)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023137913A1 publication Critical patent/WO2023137913A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular to a multimodal model-based video text summarization method, device and storage medium.
  • the current processing method mainly extracts the important sentences by calculating the importance of the sentences. Since several sentences are extracted from the monologue, the result will lead to incoherence between the sentences.
  • the embodiment of the present application provides a video text summarization method based on a multimodal model, including:
  • the video features are vectorized to obtain video feature vectors
  • the video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained.
  • a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
  • the embodiment of the present application provides a video text summarization device based on a multimodal model, including:
  • the first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;
  • a vectorization module configured to perform vectorization processing on the video features to obtain video feature vectors
  • the second extraction module is used to perform voice extraction on the video data to obtain monologue voice information
  • a conversion module configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;
  • a word segmentation module configured to perform word segmentation processing on the text information to obtain multiple word information
  • the training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results.
  • the encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.
  • an embodiment of the present application provides a device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, a method for video text summarization based on a multimodal model is implemented, and the method includes:
  • the video features are vectorized to obtain video feature vectors
  • the video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained.
  • a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
  • the embodiment of the present application provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used to execute a video text summarization method based on a multimodal model, and the method includes:
  • the video features are vectorized to obtain video feature vectors
  • the video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained.
  • a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
  • the video text summarization method, device and storage medium based on the multi-modal model proposed in the embodiment of the present application by newly setting a sub-layer for fusing image-like features and text-like features in the encoder of the transformer model, the text summarization results obtained by training the video feature vector and dictionary in the transformer model are more accurate, that is, the accuracy of the text summarization content extracted from the video can be effectively improved.
  • Fig. 1 is a schematic diagram of a system architecture platform for performing a video text summarization method based on a multimodal model provided by one embodiment of the present application;
  • Fig. 2 is the flow chart of the video text summarization method based on multimodal model that one embodiment of the present application provides;
  • FIG. 3 is a schematic diagram of an improved transformer model in a video text summarization method based on a multimodal model provided by an embodiment of the present application;
  • Fig. 4 is the flow chart of generating text summarization result in the video text summarization method based on multimodal model provided by one embodiment of the present application;
  • FIG. 5 is a schematic diagram of generating fusion coding information in a video text summarization method based on a multimodal model provided by an embodiment of the present application;
  • Fig. 6 is a flow chart of the fusion processing in the newly added sublayer in the encoder in the video text summarization method based on the multimodal model provided by one embodiment of the present application;
  • Fig. 7 is a schematic structural diagram of a newly added sublayer in an encoder in a video text summarization method based on a multimodal model provided by an embodiment of the present application.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the embodiment of the present application provides a video text summarization method based on a multi-modal model.
  • the video text summarization method includes the following steps: performing feature extraction on video data to obtain video features, and the video data is video data for which text summarization needs to be extracted; performing vectorization processing on the video features to obtain video feature vectors; performing speech extraction on the video data to obtain monologue speech information; converting the monologue speech information into text information by using automatic speech recognition technology ASR;
  • the transformer model is trained to obtain text summary results.
  • the encoder of the transformer model is provided with a sub-layer for merging image-like features and text-like features, and the sub-layers include Text-vision fusion and Add&Norm.
  • a sub-layer for fusing image-like features and text-like features is newly set in the encoder of the transformer model, so that the text summary results obtained by training the video feature vector and dictionary in the transformer model are more accurate, and can effectively improve the accuracy of the text summary content extracted from the video.
  • FIG. 1 is a schematic diagram of a system architecture platform 100 for implementing a video text summarization method based on a multimodal model provided by an embodiment of the present application.
  • the system architecture platform 100 is provided with a processor 110 and a memory 120 , wherein the processor 110 and the memory 120 may be connected via a bus or in other ways.
  • connection via a bus is taken as an example.
  • the memory 120 can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory 120 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 120 may optionally include memories that are remotely located relative to the processor 110, and these remote memories may be connected to the system architecture platform through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • system architecture platform can be applied to 5G communication network systems and subsequent evolved mobile communication network systems, etc., which is not specifically limited in this embodiment.
  • system architecture platform shown in FIG. 1 does not limit the embodiment of the present application, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • the system architecture platform 100 can be an independent system architecture platform 100, or it can be a cloud system architecture platform 100 that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • Figure 2 is a flowchart of a video text summarization method based on a multimodal model provided by an embodiment of the present application.
  • the video text summarization method based on a multimodal model is applied to the above-mentioned architecture platform, and the video text summarization method based on a multimodal model includes but is not limited to steps S100, S200, S300, S400, and S500.
  • Step S100 feature extraction is performed on video data to obtain video features, and the video data is video data for which a text summary needs to be extracted;
  • Step S200 performing vectorization processing on video features to obtain video feature vectors
  • Step S300 performing voice extraction on the video data to obtain monologue voice information
  • Step S400 converting the monologue speech information into text information by using the automatic speech recognition technology ASR;
  • Step S500 performing word segmentation processing on the text information to obtain multiple word information
  • step S600 the video feature vector and multiple word information are input into the transformer model for training to obtain a text summary result.
  • the encoder of the transformer model is provided with a sub-layer for fusing image-like features and text-like features.
  • the sub-layers include Text-vision fusion and Add&Norm.
  • the video data that needs to be extracted from the text summary is obtained.
  • the video data includes video image information and voice information, wherein the voice information includes monologue voice information and background voice information, then feature extraction is performed on the video data to obtain video features, and the video features are vectorized to obtain video feature vectors, wherein the video feature vectors are prepared for subsequent training steps; at the same time, voice extraction can be performed on the video data to obtain monologue voice information, and the monologue voice information is converted into text information by automatic speech recognition technology ASR, and then the text information is subjected to word segmentation processing to obtain multiple word information;
  • the video feature vector and multiple word information are input into the transformer model for training, and the text summarization result after the sub-layer training processing of the fusion of image-like features and text-like features is obtained.
  • this sub-layer includes Text-vision fusion and Add&Norm, which makes the text summary results obtained by training the video feature vector and dictionary in the transformer model more accurate, and can effectively improve the accuracy of the text summary content extracted from the video.
  • the 3D ResNet-101 model can be used to extract features from video data to obtain video feature vectors, and other models can also be used to extract video features from video data, which is not uniquely limited in this embodiment.
  • the step of performing word segmentation processing on the text information to obtain multiple word information can be to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information, or to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same line; to perform row-by-row processing on multiple word information to obtain multiple word information arranged in a dictionary structure of rows, which is not specifically limited in this embodiment.
  • the dictionary in this embodiment includes a plurality of word information, and each word information in the dictionary is an independent row, and each word information corresponds to a row position number.
  • this embodiment adds a sublayer to the Encoder Layer in the traditional transformer model, which is used to fuse image-like features and text-like features, including Text-vision fusion and Add&Norm.
  • the improved transformer model includes the Encoder end (encoder) and the Decoder end (decoder).
  • the Encoder Layer in the Encoder end includes three sublayers: the first sublayer (multi-head self-attention and Add&Norm), the second sublayer (FFN and Add&Norm) and the third sublayer (Text-vision fusion and Add&Norm).
  • the Decoder Layer at the Decoder end includes three sublayers: the fourth sublayer ( Masked multi-head self-attention and Add&Norm), the fifth sublayer (Multi-head Enc-Dec Attention and Add&Norm) and the sixth sublayer (FFN and Add&Norm).
  • Text Inputs is the input text (dictionary) of good words. In the traditional transformer model, these words will correspond to 3 embeddings.
  • token embedding which can be called token embedding (also called word embedding). Its function is to map human language into geometric space, one is segment embedding, and the other is positional embedding. In the embodiment, only token embedding and segment embedding are used.
  • the method of obtaining Token embedding is: multiply text inputs by a weight matrix with a size of N*512 to obtain a token embedding with a vector length of 512, where text inputs is a vector whose length is the number of dictionary rows N as the dimension, and each word in the text inputs corresponds to a position in the dictionary, and this position is marked with 1, and the rest of the positions are marked with 0.
  • step S500 includes but not limited to step S410 and step S420 .
  • Step S410 inputting the video feature vector and the dictionary into the encoder of the transformer model provided with a sub-layer for fusing image features and text features for fusion processing to obtain fusion encoding information;
  • step S420 the fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
  • the video feature vector and the dictionary can be respectively input into the encoder of the transformer model provided with a sub-layer for fusing image features and text features for fusion processing to obtain fusion coding information, and then the fusion coding information is passed to the decoder for decoding processing, thereby generating a text summary result.
  • the sublayer used to fuse image features and text features is newly set in the encoder of the transformer model, instead of being trained solely through text feature vectors, the text summary results obtained by training video feature vectors and dictionaries in the transformer model are more accurate, which can effectively improve the accuracy of text summary content extracted from videos.
  • step S410 includes but not limited to step S510 and step S520 .
  • Step S510 the dictionary is input to the encoder of the transformer model, and the dictionary is sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain text feature vectors.
  • the first sublayer includes multi-head self-attention and Add&Norm
  • the second sublayer includes FFN and Add&Norm;
  • step S520 the text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
  • the dictionary can be input to the encoder of the transformer model for processing.
  • the dictionary can be extracted and processed through the first sublayer (multi-head self-attention and Add&Norm) and the second sublayer (FFN and Add&Norm) in the encoder to obtain the text feature vector, and then the text feature vector and video feature vector are input into the newly added sublayer to perform fusion processing of image features and text features to obtain fusion coding information. It is trained purely through the text feature vector, then it is more accurate to generate a text summary by decoding the fused coding information through the decoder, which can effectively improve the accuracy of the text summary extracted from the video.
  • step S510 includes but not limited to step S610 , step S620 , step S630 , step S640 , and step S650 .
  • Step S610 inputting the text feature vector and the video feature vector to the sublayer for fusing the image-like features and the text-like features;
  • Step S620 performing matrix multiplication processing on Zv and the weight matrix to obtain Z'v , where Zv is a video feature vector;
  • Step S630 perform matrix transposition processing on Z′v , and obtain
  • Step S640 will Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;
  • step S650 A and Z v are multiplied to obtain AZ v , and AZ v and Z t are vector concatenated to obtain Z′ t , where Z′ t is fusion coding information.
  • the text feature vector and the video feature vector are input to the sublayer for fusing image features and text features.
  • Z v is first multiplied with the weight matrix to obtain Z′ v
  • Z′ v is matrix transposed to obtain then Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t .
  • Z v is the video feature vector
  • A is the attention weight
  • Z′ t is the fusion coding information.
  • the structure of Text-vision fusion in the sub-layer newly added in the encoder of the transformer model for fusing image-like features and text-like features is referred to in FIG.
  • W1 is the weight matrix
  • Zv is the video feature vector
  • A is the attention weight
  • Z′t is the fusion coding information
  • the method implemented in this implementation can achieve better results in automatic text summarization due to the introduction of image-like features, such as higher scores in evaluation methods such as ROUGE-1, ROUGE-2, and ROUGE-L.
  • An embodiment of the present application also provides a prediction device for an answer sequence based on improved topic response theory, including:
  • the first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;
  • a vectorization module configured to perform vectorization processing on the video features to obtain video feature vectors
  • the second extraction module is used to perform voice extraction on the video data to obtain monologue voice information
  • a conversion module configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;
  • a word segmentation module configured to perform word segmentation processing on the text information to obtain multiple word information
  • the training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results.
  • the encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.
  • the training module is also used to input the video feature vector and the dictionary to the encoder of the transformer model provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to the decoder for decoding processing to generate a text summary result.
  • the training module is also used to input the dictionary into the encoder of the transformer model, and extract the dictionary through the first sublayer and the second sublayer in the encoder to obtain text feature vectors.
  • the first sublayer includes multi-head self-attention and Add&Norm
  • the second sublayer includes FFN and Add&Norm.
  • the training module is also used to input the text feature vector and the video feature vector to the sublayer used to fuse the image class feature and the text class feature; perform matrix multiplication processing on Zv and the weight matrix to obtain Z′v , and Zv is the video feature vector; perform matrix transposition processing on Z′v to obtain Will Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t , and Z′ t is the fusion coding information.
  • the extraction module is also used to extract features from the video data using the 3D ResNet-101 model to obtain video feature vectors.
  • the word segmentation module is also used to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information.
  • the word segmentation module is also used to perform word segmentation processing on the text information using the hanlp word segmentation tool to obtain multiple word information arranged in the same row;
  • a plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
  • an embodiment of the present application provides a device, which includes: a memory, a processor, and a computer program stored on the memory and operable on the processor.
  • a video text summarization method based on a multimodal model is implemented.
  • the method includes: performing feature extraction on video data to obtain video features, and the video data is video data for which text summaries need to be extracted; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information;
  • the text information is subjected to word segmentation processing to obtain multiple word information; the video feature vector and multiple word information are input to the transformer model for training, and the text summary result is obtained.
  • the encoder of the transformer model is set with a sub-layer for fusing image-like features and text-like features.
  • the sub-layer includes Text-vision fusion and Add&Norm.
  • inputting the video feature vector and multiple word information into the transformer model for training in the video text summarization method based on the multimodal model, and obtaining the text summarization result includes: inputting the video feature vector and the multiple word information into an encoder of the transformer model provided with a sub-layer for fusing image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to a decoder for decoding processing, and generate text summarization results.
  • the method of video text summarization based on the multi-modal model is implemented to input video feature vectors and multiple word information into the transformer model.
  • An encoder configured with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion encoding information, including: inputting multiple word information to the encoder of the transformer model, and sequentially extracting multiple word information through the first sub-layer and the second sub-layer in the encoder to obtain text feature vectors.
  • the first sub-layer includes multi-head self -attention and Add&Norm
  • the second sublayer includes FFN and Add&Norm
  • the text feature vector and the video feature vector are input to the sublayer for fusing the image class feature and the text class feature for fusion processing to obtain fusion coding information.
  • the text feature vector and the video feature vector are input into the sublayer for fusing the image feature and the text feature in the video text summarization method based on the multimodal model to perform fusion processing, and obtaining the fusion coding information includes: inputting the text feature vector and the video feature vector to the sublayer for fusing the image feature and the text feature; performing matrix multiplication processing on Zv and the weight matrix to obtain Z′v , Zv is the video feature vector; performing matrix transposition processing on Z′v to obtain Will Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t , and Z′ t is the fusion coding information.
  • the feature extraction of the video data in the video text summarization method based on the multimodal model is realized, and obtaining the video feature vector includes: utilizing the 3D ResNet-101 model to perform feature extraction on the video data to obtain the video feature vector.
  • the text information in the video text summarization method based on the multimodal model is implemented to perform word segmentation processing, and obtaining a plurality of word information includes: using the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same row; performing row-by-line processing of the multiple word information to obtain a plurality of word information arranged in a branch dictionary structure.
  • the processor and memory can be connected by a bus or other means.
  • the device in this embodiment may correspond to include the memory and the processor in the embodiment shown in FIG. 1 , and can constitute a part of the system architecture platform in the embodiment shown in FIG. 1 .
  • the non-transitory software programs and instructions required to realize the video text summarization method based on the multimodal model of the device side of the above-mentioned embodiments are stored in the memory, and when executed by the processor, the video text summarization method based on the multimodal model of the above-mentioned embodiment is executed. For example, the method steps S100 to S600 in FIG. 650.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are used to execute the video text summarization method based on the multimodal model at the terminal side, the method includes: performing feature extraction on video data to obtain video features, and the video data is video data for which text summaries need to be extracted; performing vectorization processing on video features to obtain video feature vectors; performing speech extraction on video data to obtain monologue speech information; Segment the text information to obtain multiple word information; input the video feature vector and multiple word information into the transformer model for training, and obtain the text summary result.
  • the encoder of the transformer model is set to integrate image features and text features.
  • the sub-layer includes Text-vision fusion and Add&Norm.
  • the computer-executable instructions are used to execute the method of video text summarization based on the multimodal model on the terminal side, inputting the video feature vector and multiple word information into the transformer model for training, and obtaining the text summarization result includes: inputting the video feature vector and multiple word information into the encoder of the transformer model provided with a sub-layer for fusing image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to the decoder for decoding processing, and generate text summarization results.
  • computer-executable instructions are used to perform the video feature vector and multiple word information input into the transformer model in the video text summarization method based on the multi-modal model on the terminal side.
  • An encoder configured with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion encoding information, including: inputting multiple word information to the encoder of the transformer model, and sequentially extracting multiple word information through the first sub-layer and the second sub-layer in the encoder to obtain text feature vectors.
  • the first sub-layer includes multi-he ad self-attention and Add&Norm
  • the second sublayer includes FFN and Add&Norm
  • the text feature vector and video feature vector are input to the sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
  • the computer-executable instructions are used to perform the above-mentioned video text summarization method based on a multimodal model on the terminal side, inputting text feature vectors and video feature vectors to a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion encoding information includes: inputting text feature vectors and video feature vectors to a sub-layer for fusing image-like features and text-like features; performing matrix multiplication processing on Zv and a weight matrix to obtain Z′v , where Zv is a video feature vector; performing matrix transposition processing on Z′v , to get Will Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t , and Z′ t is the fusion coding information.
  • the computer-executable instructions are used to perform feature extraction on video data in the video text summarization method based on the multimodal model on the terminal side, and obtaining video feature vectors includes: using a 3D ResNet-101 model to perform feature extraction on video data to obtain video feature vectors.
  • the computer-executable instructions are used to perform word segmentation processing on the text information in the video text summarization method based on the multi-modal model on the terminal side, and obtaining a plurality of word information includes: using the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same line; performing row-by-line processing on the multiple word information to obtain a plurality of word information arranged in a branch dictionary structure.
  • the method steps S100 to S600 in FIG. 2 the method steps S410 to S420 in FIG. 4 , the method steps S510 to S520 in FIG. 5 , and the method steps S610 to S650 in FIG. 6 described above are performed.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and that can be accessed by a computer.
  • communication media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media, as are known to those of ordinary skill in the art. It should be noted that the computer-readable storage medium may be non-volatile or volatile.

Abstract

A video text summarization method based on a multi-modal model, a device and a storage medium, relating to artificial intelligence technology. The method comprises the following steps: performing feature extraction on video data to obtain video features, the video data being video data from which a text summary needs to be extracted (S100); vectorizing the video features to obtain a video feature vector (S200); performing speech extraction on the video data to obtain monologue speech information (S300); by means of automatic speech recognition (ASR), converting the monologue speech information into text information (S400); performing word segmentation processing on the text information to obtain a plurality of pieces of word information (S500); and inputting the video feature vector and the plurality of pieces of word information into a transformer model for training to obtain a text summarization result, a sub-layer used for fusing image-class features and text-class features being arranged in an encoder of the transformer model (S600). The accuracy of text summary content extracted from a video can be improved.

Description

基于多模态模型的视频文本摘要方法、设备及存储介质Video text summarization method, device and storage medium based on multimodal model
本申请要求于2022年01月18日提交中国专利局、申请号为202210056075.6,发明名称为“基于多模态模型的视频文本摘要方法、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210056075.6 submitted to the China Patent Office on January 18, 2022, and the title of the invention is "Method, device and storage medium for video text summarization based on multimodal model", the entire content of which is incorporated in this application by reference.
技术领域technical field
本申请实施例涉及人工智能领域,尤其涉及一种基于多模态模型的视频文本摘要方法、设备及存储介质。The embodiments of the present application relate to the field of artificial intelligence, and in particular to a multimodal model-based video text summarization method, device and storage medium.
背景技术Background technique
目前,在业内智能视频摘要的方法大多使用文本抽取式方法,因为视频中有语音独白,语音独白通常是使用自动语音识别技术(Automatic Speech Recognition,ASR)把声音转成文本,再使用自然语言处理技术所生成。At present, most methods of intelligent video summarization in the industry use text extraction methods, because there are speech monologues in the video, and speech monologues are usually generated by using Automatic Speech Recognition (ASR) technology to convert sound into text, and then using natural language processing technology.
技术问题technical problem
以下是发明人意识到的现有技术的技术问题:目前处理方法主要通过计算句子的重要程度,把计算结果为重要的句子抽取出来,由于是从独白中抽取若干句子,所以结果会导致语句之间不通顺。The following is the technical problem of the prior art realized by the inventor: the current processing method mainly extracts the important sentences by calculating the importance of the sentences. Since several sentences are extracted from the monologue, the result will lead to incoherence between the sentences.
技术解决方案technical solution
第一方面,本申请实施例提供了一种基于多模态模型的视频文本摘要方法,包括:In the first aspect, the embodiment of the present application provides a video text summarization method based on a multimodal model, including:
对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;
将所述视频特征进行向量化处理,得到视频特征向量;The video features are vectorized to obtain video feature vectors;
对所述视频数据进行语音提取,得到独白语音信息;Carry out speech extraction to described video data, obtain monologue speech information;
通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;Converting the monologue voice information into text information through automatic speech recognition technology ASR;
将所述文本信息进行分词处理,得到多个词信息;performing word segmentation processing on the text information to obtain multiple word information;
将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
第二方面,本申请实施例提供了一种基于多模态模型的视频文本摘要装置,包括:In the second aspect, the embodiment of the present application provides a video text summarization device based on a multimodal model, including:
第一提取模块,用于对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;The first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;
向量化模块,用于将所述视频特征进行向量化处理,得到视频特征向量;A vectorization module, configured to perform vectorization processing on the video features to obtain video feature vectors;
第二提取模块,用于对所述视频数据进行语音提取,得到独白语音信息;The second extraction module is used to perform voice extraction on the video data to obtain monologue voice information;
转换模块,用于通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;A conversion module, configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;
分词模块,用于将所述文本信息进行分词处理,得到多个词信息;A word segmentation module, configured to perform word segmentation processing on the text information to obtain multiple word information;
训练模块,用于将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.
第三方面,本申请实施例提供了一种设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种基于多模态模 型的视频文本摘要方法,所述方法包括:In a third aspect, an embodiment of the present application provides a device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, a method for video text summarization based on a multimodal model is implemented, and the method includes:
对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;
将所述视频特征进行向量化处理,得到视频特征向量;The video features are vectorized to obtain video feature vectors;
对所述视频数据进行语音提取,得到独白语音信息;Carry out speech extraction to described video data, obtain monologue speech information;
通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;Converting the monologue voice information into text information through automatic speech recognition technology ASR;
将所述文本信息进行分词处理,得到多个词信息;performing word segmentation processing on the text information to obtain multiple word information;
将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
第四方面,本申请实施例提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行基于多模态模型的视频文本摘要方法,所述方法包括:In a fourth aspect, the embodiment of the present application provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used to execute a video text summarization method based on a multimodal model, and the method includes:
对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;
将所述视频特征进行向量化处理,得到视频特征向量;The video features are vectorized to obtain video feature vectors;
对所述视频数据进行语音提取,得到独白语音信息;Carry out speech extraction to described video data, obtain monologue speech information;
通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;Converting the monologue voice information into text information through automatic speech recognition technology ASR;
将所述文本信息进行分词处理,得到多个词信息;performing word segmentation processing on the text information to obtain multiple word information;
将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
有益效果Beneficial effect
本申请实施例提出的基于多模态模型的视频文本摘要方法、设备及存储介质,通过在transformer模型的编码器中新设置用于将图像类特征和文本类特征进行融合的子层,使得视频特征向量和字典在transformer模型训练得出的文本摘要结果更加准确,即能够有效改善在视频中提取的文本摘要内容的准确性。The video text summarization method, device and storage medium based on the multi-modal model proposed in the embodiment of the present application, by newly setting a sub-layer for fusing image-like features and text-like features in the encoder of the transformer model, the text summarization results obtained by training the video feature vector and dictionary in the transformer model are more accurate, that is, the accuracy of the text summarization content extracted from the video can be effectively improved.
附图说明Description of drawings
图1是本申请一个实施例提供的用于执行基于多模态模型的视频文本摘要方法的系统架构平台的示意图;Fig. 1 is a schematic diagram of a system architecture platform for performing a video text summarization method based on a multimodal model provided by one embodiment of the present application;
图2是本申请一个实施例提供的基于多模态模型的视频文本摘要方法的流程图;Fig. 2 is the flow chart of the video text summarization method based on multimodal model that one embodiment of the present application provides;
图3是本申请一个实施例提供的基于多模态模型的视频文本摘要方法中改进的transformer模型的示意图;FIG. 3 is a schematic diagram of an improved transformer model in a video text summarization method based on a multimodal model provided by an embodiment of the present application;
图4是本申请一个实施例提供的基于多模态模型的视频文本摘要方法中的生成文本摘要结果的流程图;Fig. 4 is the flow chart of generating text summarization result in the video text summarization method based on multimodal model provided by one embodiment of the present application;
图5是本申请一个实施例提供的基于多模态模型的视频文本摘要方法中生成融合编码信息的示意图;FIG. 5 is a schematic diagram of generating fusion coding information in a video text summarization method based on a multimodal model provided by an embodiment of the present application;
图6是本申请一个实施例提供的基于多模态模型的视频文本摘要方法中编码器中新增子层中融合处理的流程图;Fig. 6 is a flow chart of the fusion processing in the newly added sublayer in the encoder in the video text summarization method based on the multimodal model provided by one embodiment of the present application;
图7是本申请一个实施例提供的基于多模态模型的视频文本摘要方法中编码器中新增子层的结构示意图。Fig. 7 is a schematic structural diagram of a newly added sublayer in an encoder in a video text summarization method based on a multimodal model provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申 请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the flow chart. The terms "first", "second" and the like in the specification, claims or the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequential order.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本申请实施例提供了一种基于多模态模型的视频文本摘要方法,该视频文本摘要方法包括以下步骤:对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;将所述视频特征进行向量化处理,得到视频特征向量;对所述视频数据进行语音提取,得到独白语音信息;通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;将所述文本信息进行分词处理,得到多个词信息;将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。在本实施例的技术方案中,在transformer模型的编码器中新设置用于将图像类特征和文本类特征进行融合的子层,使得视频特征向量和字典在transformer模型训练得出的文本摘要结果更加准确,能够有效改善在视频中提取的文本摘要内容的准确性。The embodiment of the present application provides a video text summarization method based on a multi-modal model. The video text summarization method includes the following steps: performing feature extraction on video data to obtain video features, and the video data is video data for which text summarization needs to be extracted; performing vectorization processing on the video features to obtain video feature vectors; performing speech extraction on the video data to obtain monologue speech information; converting the monologue speech information into text information by using automatic speech recognition technology ASR; The transformer model is trained to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image-like features and text-like features, and the sub-layers include Text-vision fusion and Add&Norm. In the technical solution of this embodiment, a sub-layer for fusing image-like features and text-like features is newly set in the encoder of the transformer model, so that the text summary results obtained by training the video feature vector and dictionary in the transformer model are more accurate, and can effectively improve the accuracy of the text summary content extracted from the video.
下面结合附图,对本申请实施例作进一步阐述。The embodiments of the present application will be further described below in conjunction with the accompanying drawings.
如图1所示,图1是本申请一个实施例提供的用于执行基于多模态模型的视频文本摘要方法的系统架构平台100的示意图。As shown in FIG. 1 , FIG. 1 is a schematic diagram of a system architecture platform 100 for implementing a video text summarization method based on a multimodal model provided by an embodiment of the present application.
在图1的示例中,该系统架构平台100设置有处理器110和存储器120,其中,处理器110和存储器120可以通过总线或者其他方式连接,图1中以通过总线连接为例。In the example shown in FIG. 1 , the system architecture platform 100 is provided with a processor 110 and a memory 120 , wherein the processor 110 and the memory 120 may be connected via a bus or in other ways. In FIG. 1 , connection via a bus is taken as an example.
存储器120作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器120可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器120可选包括相对于处理器110远程设置的存储器,这些远程存储器可以通过网络连接至该系统架构平台。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 120, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 120 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some implementations, the memory 120 may optionally include memories that are remotely located relative to the processor 110, and these remote memories may be connected to the system architecture platform through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
本领域技术人员可以理解的是,该系统架构平台可以应用于5G通信网络系统以及后续演进的移动通信网络系统等,本实施例对此并不作具体限定。Those skilled in the art can understand that the system architecture platform can be applied to 5G communication network systems and subsequent evolved mobile communication network systems, etc., which is not specifically limited in this embodiment.
本领域技术人员可以理解的是,图1中示出的系统架构平台并不构成对本申请实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the system architecture platform shown in FIG. 1 does not limit the embodiment of the present application, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
系统架构平台100可以是独立的系统架构平台100,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云系统架构平台100。The system architecture platform 100 can be an independent system architecture platform 100, or it can be a cloud system architecture platform 100 that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
基于上述系统架构平台,下面提出本申请的基于多模态模型的视频文本摘要方法的各个实施例。Based on the above-mentioned system architecture platform, various embodiments of the video text summarization method based on the multimodal model of the present application are proposed below.
如图2所示,图2是本申请一个实施例提供的基于多模态模型的视频文本摘要方法的流程图,该基于多模态模型的视频文本摘要方法应用于上述架构平台,并且该基于多模态模型的视频文本摘要方法包括但不限于有步骤S100、步骤S200、步骤S300、步骤S400和步骤S500。As shown in Figure 2, Figure 2 is a flowchart of a video text summarization method based on a multimodal model provided by an embodiment of the present application. The video text summarization method based on a multimodal model is applied to the above-mentioned architecture platform, and the video text summarization method based on a multimodal model includes but is not limited to steps S100, S200, S300, S400, and S500.
步骤S100,对视频数据进行特征提取,得到视频特征,视频数据为需要提取文本摘要的视频数据;Step S100, feature extraction is performed on video data to obtain video features, and the video data is video data for which a text summary needs to be extracted;
步骤S200,将视频特征进行向量化处理,得到视频特征向量;Step S200, performing vectorization processing on video features to obtain video feature vectors;
步骤S300,对视频数据进行语音提取,得到独白语音信息;Step S300, performing voice extraction on the video data to obtain monologue voice information;
步骤S400,通过自动语音识别技术ASR将独白语音信息转换为文本信息;Step S400, converting the monologue speech information into text information by using the automatic speech recognition technology ASR;
步骤S500,将文本信息进行分词处理,得到多个词信息;Step S500, performing word segmentation processing on the text information to obtain multiple word information;
步骤S600,将视频特征向量和多个词信息输入至transformer模型进行训练,得到文本摘要结果,transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,子层包括Text-vision fusion和Add&Norm。In step S600, the video feature vector and multiple word information are input into the transformer model for training to obtain a text summary result. The encoder of the transformer model is provided with a sub-layer for fusing image-like features and text-like features. The sub-layers include Text-vision fusion and Add&Norm.
在一实施例中,获取需要提取文本摘要的视频数据,视频数据包括视频图像信息和语音信息,其中语音信息包括独白语音信息和背景语音信息,然后对视频数据进行特征提取得到视频特征,并将视频特征进行向量化处理得到视频特征向量,其中视频特征向量为后续训练步骤做准备;同时可以对视频数据进行语音提取得到独白语音信息,并通过自动语音识别技术ASR将独白语音信息转换为文本信息,再将文本信息进行分词处理得到多个词信息;接着将已经处理好的视频特征向量和多个词信息输入至transformer模型中进行训练,得到经过将图像类特征和文本类特征进行融合的子层训练处理后的文本摘要结果。由于在transformer模型的编码器中新设置用于将图像类特征和文本类特征进行融合的子层,该子层包括Text-vision fusion和Add&Norm,使得视频特征向量和字典在transformer模型训练得出的文本摘要结果更加准确,能够有效改善在视频中提取的文本摘要内容的准确性。需要说明的是,可以利用3D ResNet-101模型对视频数据进行特征提取得到视频特征向量,也可以利用其它模型对视频数据中的视频特征进行提取,本实施例对其不作唯一限定。In one embodiment, the video data that needs to be extracted from the text summary is obtained. The video data includes video image information and voice information, wherein the voice information includes monologue voice information and background voice information, then feature extraction is performed on the video data to obtain video features, and the video features are vectorized to obtain video feature vectors, wherein the video feature vectors are prepared for subsequent training steps; at the same time, voice extraction can be performed on the video data to obtain monologue voice information, and the monologue voice information is converted into text information by automatic speech recognition technology ASR, and then the text information is subjected to word segmentation processing to obtain multiple word information; The video feature vector and multiple word information are input into the transformer model for training, and the text summarization result after the sub-layer training processing of the fusion of image-like features and text-like features is obtained. Since the new sub-layer for fusing image-like features and text-like features is set in the encoder of the transformer model, this sub-layer includes Text-vision fusion and Add&Norm, which makes the text summary results obtained by training the video feature vector and dictionary in the transformer model more accurate, and can effectively improve the accuracy of the text summary content extracted from the video. It should be noted that the 3D ResNet-101 model can be used to extract features from video data to obtain video feature vectors, and other models can also be used to extract video features from video data, which is not uniquely limited in this embodiment.
可以理解的是,将文本信息进行分词处理得到多个词信息的步骤可以是使用hanlp分词工具将文本信息进行分词处理得到多个词信息,还可以是所述使用hanlp分词工具将所述文本信息进行分词处理,得到排列在同一行的多个词信息;将多个所述词信息进行分行处理,得到以分行字典结构排列的多个词信息,本实施例对其不作具体限定。可以理解的是,本实施例中的字典包括多个词信息,在字典中每个词信息均独立为一行,每个词信息对应一个行位置编号。It can be understood that, the step of performing word segmentation processing on the text information to obtain multiple word information can be to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information, or to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same line; to perform row-by-row processing on multiple word information to obtain multiple word information arranged in a dictionary structure of rows, which is not specifically limited in this embodiment. It can be understood that the dictionary in this embodiment includes a plurality of word information, and each word information in the dictionary is an independent row, and each word information corresponds to a row position number.
需要说明的是,参照图3,本实施例在传统的transformer模型中的Encoder Layer增加一个子层,该子层用于将图像类特征和文本类特征进行融合,包括Text-vision fusion和Add&Norm。那么改进后的transformer模型包括Encoder端(编码器)和Decoder端(解码器),Encoder端中的Encoder Layer包括三个子层:第一子层(multi-head self-attention和Add&Norm)、第二子层(FFN和Add&Norm)和第三子层(Text-vision fusion和Add&Norm),Decoder端的Decoder Layer包括三个子层:第四子层(Masked multi-head self-attention和Add&Norm)、第五子层(Multi-head Enc-Dec Attention和Add&Norm)和第六子层(FFN和Add&Norm)。It should be noted that, referring to FIG. 3 , this embodiment adds a sublayer to the Encoder Layer in the traditional transformer model, which is used to fuse image-like features and text-like features, including Text-vision fusion and Add&Norm. Then the improved transformer model includes the Encoder end (encoder) and the Decoder end (decoder). The Encoder Layer in the Encoder end includes three sublayers: the first sublayer (multi-head self-attention and Add&Norm), the second sublayer (FFN and Add&Norm) and the third sublayer (Text-vision fusion and Add&Norm). The Decoder Layer at the Decoder end includes three sublayers: the fourth sublayer ( Masked multi-head self-attention and Add&Norm), the fifth sublayer (Multi-head Enc-Dec Attention and Add&Norm) and the sixth sublayer (FFN and Add&Norm).
需要说明的是,Text Inputs是输入分好词的文本(字典),传统的transformer模型中这些词会对应有3个embedding。一个是token embedding令牌嵌入,可以叫标记嵌入(token embedding)也叫词嵌入(word embedding))作用是将人类的语言映射到几何空间中,一个是segment embedding分段嵌入,还有一个是positional embedding位置嵌入。在实施例中只使用了token embedding和segment embedding。其中Token embedding的获取方法是:将text inputs与一个大小为N*512的权重矩阵相乘,得到向量长度为512的token embedding,其中text inputs是由一个长度为字典行数N作为维度的向量,而且text inputs 里的每一个词都会对应字典中的一个位置,在这个位置上标注1,其余位置标注0。It should be noted that Text Inputs is the input text (dictionary) of good words. In the traditional transformer model, these words will correspond to 3 embeddings. One is token embedding, which can be called token embedding (also called word embedding). Its function is to map human language into geometric space, one is segment embedding, and the other is positional embedding. In the embodiment, only token embedding and segment embedding are used. The method of obtaining Token embedding is: multiply text inputs by a weight matrix with a size of N*512 to obtain a token embedding with a vector length of 512, where text inputs is a vector whose length is the number of dictionary rows N as the dimension, and each word in the text inputs corresponds to a position in the dictionary, and this position is marked with 1, and the rest of the positions are marked with 0.
参照图4,在一实施例中,步骤S500包括但不限于步骤S410和步骤S420。Referring to FIG. 4 , in one embodiment, step S500 includes but not limited to step S410 and step S420 .
步骤S410,将视频特征向量和字典输入至transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;Step S410, inputting the video feature vector and the dictionary into the encoder of the transformer model provided with a sub-layer for fusing image features and text features for fusion processing to obtain fusion encoding information;
步骤S420,将融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。In step S420, the fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
具体地,将视频特征向量和字典可以分别输入至transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息,然后再将融合编码信息传递至解码器中进行解码处理,从而生成文本摘要结果。由于在transformer模型的编码器中新设置用于将图像类特征和文本类特征进行融合的子层,而不是单纯通过文本特征向量训练出来的,使得视频特征向量和字典在transformer模型训练得出的文本摘要结果更加准确,能够有效改善在视频中提取的文本摘要内容的准确性。Specifically, the video feature vector and the dictionary can be respectively input into the encoder of the transformer model provided with a sub-layer for fusing image features and text features for fusion processing to obtain fusion coding information, and then the fusion coding information is passed to the decoder for decoding processing, thereby generating a text summary result. Since the sublayer used to fuse image features and text features is newly set in the encoder of the transformer model, instead of being trained solely through text feature vectors, the text summary results obtained by training video feature vectors and dictionaries in the transformer model are more accurate, which can effectively improve the accuracy of text summary content extracted from videos.
参照图5,在一实施例中,步骤S410包括但不限于步骤S510和步骤S520。Referring to FIG. 5 , in one embodiment, step S410 includes but not limited to step S510 and step S520 .
步骤S510,将字典输入至transformer模型的编码器,并将字典依次通过编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,第一子层包括multi-head self-attention和Add&Norm,第二子层包括FFN和Add&Norm;Step S510, the dictionary is input to the encoder of the transformer model, and the dictionary is sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain text feature vectors. The first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;
步骤S520,将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。In step S520, the text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
具体地,可以先将字典输入至transformer模型的编码器进行处理,如先将字典依次通过编码器中的第一子层(multi-head self-attention和Add&Norm)和第二子层(FFN和Add&Norm)进行提取处理,得到文本特征向量,然后再将文本特征向量和视频特征向量输入新增的子层中进行图像类特征和文本类特征融合处理得到融合编码信息,由于融合编码信息是由文本特征向量和视频特征向量融合而成的,而不是单纯通过文本特征向量训练出来的,那么将融合编码信息通过解码器进行解码生成文本摘要结果更加准确,能够有效改善在视频中提取的文本摘要内容的准确性。Specifically, the dictionary can be input to the encoder of the transformer model for processing. For example, the dictionary can be extracted and processed through the first sublayer (multi-head self-attention and Add&Norm) and the second sublayer (FFN and Add&Norm) in the encoder to obtain the text feature vector, and then the text feature vector and video feature vector are input into the newly added sublayer to perform fusion processing of image features and text features to obtain fusion coding information. It is trained purely through the text feature vector, then it is more accurate to generate a text summary by decoding the fused coding information through the decoder, which can effectively improve the accuracy of the text summary extracted from the video.
参照图6,在一实施例中,步骤S510包括但不限于步骤S610、步骤S620、步骤S630、步骤S640、步骤S650。Referring to FIG. 6 , in one embodiment, step S510 includes but not limited to step S610 , step S620 , step S630 , step S640 , and step S650 .
步骤S610,将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;Step S610, inputting the text feature vector and the video feature vector to the sublayer for fusing the image-like features and the text-like features;
步骤S620,将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,Z v为视频特征向量; Step S620, performing matrix multiplication processing on Zv and the weight matrix to obtain Z'v , where Zv is a video feature vector;
步骤S630,将Z′ v进行矩阵转置处理,得到
Figure PCTCN2022090712-appb-000001
Step S630, perform matrix transposition processing on Z′v , and obtain
Figure PCTCN2022090712-appb-000001
步骤S640,将
Figure PCTCN2022090712-appb-000002
与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;
Step S640, will
Figure PCTCN2022090712-appb-000002
Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;
步骤S650,将A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。 In step S650, A and Z v are multiplied to obtain AZ v , and AZ v and Z t are vector concatenated to obtain Z′ t , where Z′ t is fusion coding information.
具体地,将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层,在子层中先将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,然后将Z′ v进行矩阵转置处理得到
Figure PCTCN2022090712-appb-000003
再将
Figure PCTCN2022090712-appb-000004
与Z t进行矩阵相乘并通过softmax函数进行计算处理得到A,将A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t。其中Z v为视频特征向量,A为注意力权重,Z′ t为融合编码信息。
Specifically, the text feature vector and the video feature vector are input to the sublayer for fusing image features and text features. In the sublayer, Z v is first multiplied with the weight matrix to obtain Z′ v , and then Z′ v is matrix transposed to obtain
Figure PCTCN2022090712-appb-000003
then
Figure PCTCN2022090712-appb-000004
Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t . where Z v is the video feature vector, A is the attention weight, and Z′ t is the fusion coding information.
在一实施例中,在transformer模型的编码器中新增的用于将图像类特征和文本类特征进行融合的子层中的Text-vision fusion的结构参照图7,基于Text-vision fusion的结构对文本特征向量和视频特征向量在Text-vision fusion中的融合过程,可以通过数学公式表达,具体如下:In one embodiment, the structure of Text-vision fusion in the sub-layer newly added in the encoder of the transformer model for fusing image-like features and text-like features is referred to in FIG.
Figure PCTCN2022090712-appb-000005
Figure PCTCN2022090712-appb-000005
Figure PCTCN2022090712-appb-000006
Figure PCTCN2022090712-appb-000006
Z′ t=Concat(Z t,AZ v)W 2 Z′ t =Concat(Z t , AZ v )W 2
其中W 1为权重矩阵,Z v为视频特征向量,A为注意力权重,Z′ t为融合编码信息。 where W1 is the weight matrix, Zv is the video feature vector, A is the attention weight, and Z′t is the fusion coding information.
本实施的方法由于引入图像类特征,能够达到在自动文本摘要更好的效果,例如在ROUGE-1,ROUGE-2,ROUGE-L等评估方法中得分更高。The method implemented in this implementation can achieve better results in automatic text summarization due to the introduction of image-like features, such as higher scores in evaluation methods such as ROUGE-1, ROUGE-2, and ROUGE-L.
基于上述基于多模态模型的视频文本摘要方法,下面分别提出本申请的基于多模态模型的视频文本摘要方法装置、控制器和计算机可读存储介质的各个实施例。Based on the above multimodal model-based video text summarization method, various embodiments of the multimodal model-based video text summarization method device, controller, and computer-readable storage medium of the present application are proposed below.
本申请的一个实施例还提供了基于改进题目反应理论的答题序列的预测装置,包括:An embodiment of the present application also provides a prediction device for an answer sequence based on improved topic response theory, including:
第一提取模块,用于对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;The first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;
向量化模块,用于将所述视频特征进行向量化处理,得到视频特征向量;A vectorization module, configured to perform vectorization processing on the video features to obtain video feature vectors;
第二提取模块,用于对所述视频数据进行语音提取,得到独白语音信息;The second extraction module is used to perform voice extraction on the video data to obtain monologue voice information;
转换模块,用于通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;A conversion module, configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;
分词模块,用于将所述文本信息进行分词处理,得到多个词信息;A word segmentation module, configured to perform word segmentation processing on the text information to obtain multiple word information;
训练模块,用于将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.
在一实施例中,训练模块还用于将视频特征向量和字典输入至transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;将融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。In one embodiment, the training module is also used to input the video feature vector and the dictionary to the encoder of the transformer model provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to the decoder for decoding processing to generate a text summary result.
在一实施例中,训练模块还用于将字典输入至transformer模型的编码器,并将字典依次通过编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,第一子层包括multi-head self-attention和Add&Norm,第二子层包括FFN和Add&Norm;将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。In one embodiment, the training module is also used to input the dictionary into the encoder of the transformer model, and extract the dictionary through the first sublayer and the second sublayer in the encoder to obtain text feature vectors. The first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm.
在一实施例中,训练模块还用于将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,Z v为视频特征向量;将Z′ v进行矩阵转置处理,得到
Figure PCTCN2022090712-appb-000007
Figure PCTCN2022090712-appb-000008
与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;将A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。
In one embodiment, the training module is also used to input the text feature vector and the video feature vector to the sublayer used to fuse the image class feature and the text class feature; perform matrix multiplication processing on Zv and the weight matrix to obtain Z′v , and Zv is the video feature vector; perform matrix transposition processing on Z′v to obtain
Figure PCTCN2022090712-appb-000007
Will
Figure PCTCN2022090712-appb-000008
Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t , and Z′ t is the fusion coding information.
在一实施例中,提取模块还用于利用3D ResNet-101模型对视频数据进行特征提取,得到视频特征向量。In one embodiment, the extraction module is also used to extract features from the video data using the 3D ResNet-101 model to obtain video feature vectors.
在一实施例中,分词模块还用于使用hanlp分词工具将文本信息进行分词处理,得到多个词信息。In one embodiment, the word segmentation module is also used to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information.
在一实施例中,分词模块还用于所述使用hanlp分词工具将所述文本信息进行分词处理,得到排列在同一行的多个词信息;In one embodiment, the word segmentation module is also used to perform word segmentation processing on the text information using the hanlp word segmentation tool to obtain multiple word information arranged in the same row;
将多个所述词信息进行分行处理,得到以分行字典结构排列的多个词信息。A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
需要说明的是,上述基于多模态模型的视频文本摘要装置的各个实施例与基于多模态模型的视频文本摘要方法的实施例中所使用的技术手段、解决的技术问题以及达到的技术效果一致,此处不作具体赘述,详见基于多模态模型的视频文本摘要方法的实施例。It should be noted that the various embodiments of the above-mentioned video text summarization device based on the multimodal model are consistent with the technical means used, the technical problems solved and the technical effects achieved in the embodiments of the video text summarization method based on the multimodal model, and will not be described in detail here. For details, see the embodiment of the video text summarization method based on the multimodal model.
另外,本申请的一个实施例提供了一种设备,该设备包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现基于多模态模型的 视频文本摘要方法,方法包括:对视频数据进行特征提取,得到视频特征,视频数据为需要提取文本摘要的视频数据;将视频特征进行向量化处理,得到视频特征向量;对视频数据进行语音提取,得到独白语音信息;通过自动语音识别技术ASR将独白语音信息转换为文本信息;将文本信息进行分词处理,得到多个词信息;将视频特征向量和多个词信息输入至transformer模型进行训练,得到文本摘要结果,transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,子层包括Text-vision fusion和Add&Norm。In addition, an embodiment of the present application provides a device, which includes: a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, a video text summarization method based on a multimodal model is implemented. The method includes: performing feature extraction on video data to obtain video features, and the video data is video data for which text summaries need to be extracted; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information; The text information is subjected to word segmentation processing to obtain multiple word information; the video feature vector and multiple word information are input to the transformer model for training, and the text summary result is obtained. The encoder of the transformer model is set with a sub-layer for fusing image-like features and text-like features. The sub-layer includes Text-vision fusion and Add&Norm.
在一实施例中,处理器执行计算机程序时实现基于多模态模型的视频文本摘要方法中的将视频特征向量和多个词信息输入至transformer模型进行训练,得到文本摘要结果包括:将视频特征向量和多个词信息输入至transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;将融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。In one embodiment, when the processor executes the computer program, inputting the video feature vector and multiple word information into the transformer model for training in the video text summarization method based on the multimodal model, and obtaining the text summarization result includes: inputting the video feature vector and the multiple word information into an encoder of the transformer model provided with a sub-layer for fusing image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to a decoder for decoding processing, and generate text summarization results.
在一实施例中,处理器执行计算机程序时实现基于多模态模型的视频文本摘要方法中的将视频特征向量和多个词信息输入至transformer模型的设置有用于将图像类特征和文本类特征进行融合的子层的编码器进行融合处理,得到融合编码信息,包括:将多个词信息输入至transformer模型的编码器,并将多个词信息依次通过编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,第一子层包括multi-head self-attention和Add&Norm,第二子层包括FFN和Add&Norm;将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。In one embodiment, when the processor executes the computer program, the method of video text summarization based on the multi-modal model is implemented to input video feature vectors and multiple word information into the transformer model. An encoder configured with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion encoding information, including: inputting multiple word information to the encoder of the transformer model, and sequentially extracting multiple word information through the first sub-layer and the second sub-layer in the encoder to obtain text feature vectors. The first sub-layer includes multi-head self -attention and Add&Norm, the second sublayer includes FFN and Add&Norm; the text feature vector and the video feature vector are input to the sublayer for fusing the image class feature and the text class feature for fusion processing to obtain fusion coding information.
在一实施例中,处理器执行计算机程序时实现基于多模态模型的视频文本摘要方法中的将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息包括:将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,Z v为视频特征向量;将Z′ v进行矩阵转置处理,得到
Figure PCTCN2022090712-appb-000009
Figure PCTCN2022090712-appb-000010
与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;将A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。
In one embodiment, when the processor executes the computer program, the text feature vector and the video feature vector are input into the sublayer for fusing the image feature and the text feature in the video text summarization method based on the multimodal model to perform fusion processing, and obtaining the fusion coding information includes: inputting the text feature vector and the video feature vector to the sublayer for fusing the image feature and the text feature; performing matrix multiplication processing on Zv and the weight matrix to obtain Z′v , Zv is the video feature vector; performing matrix transposition processing on Z′v to obtain
Figure PCTCN2022090712-appb-000009
Will
Figure PCTCN2022090712-appb-000010
Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t , and Z′ t is the fusion coding information.
在一实施例中,处理器执行计算机程序时实现基于多模态模型的视频文本摘要方法中的对视频数据进行特征提取,得到视频特征向量包括:利用3D ResNet-101模型对视频数据进行特征提取,得到视频特征向量。In one embodiment, when the processor executes the computer program, the feature extraction of the video data in the video text summarization method based on the multimodal model is realized, and obtaining the video feature vector includes: utilizing the 3D ResNet-101 model to perform feature extraction on the video data to obtain the video feature vector.
在一实施例中,处理器执行计算机程序时实现基于多模态模型的视频文本摘要方法中的将文本信息进行分词处理,得到多个词信息包括:使用hanlp分词工具将文本信息进行分词处理,得到排列在同一行的多个词信息;将多个词信息进行分行处理,得到以分行字典结构排列的多个词信息。In one embodiment, when the processor executes the computer program, the text information in the video text summarization method based on the multimodal model is implemented to perform word segmentation processing, and obtaining a plurality of word information includes: using the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same row; performing row-by-line processing of the multiple word information to obtain a plurality of word information arranged in a branch dictionary structure.
处理器和存储器可以通过总线或者其他方式连接。The processor and memory can be connected by a bus or other means.
需要说明的是,本实施例中的设备,可以对应为包括有如图1所示实施例中的存储器和处理器,能够构成图1所示实施例中的系统架构平台的一部分,两者属于相同的发明构思,因此两者具有相同的实现原理以及有益效果,此处不再详述。It should be noted that the device in this embodiment may correspond to include the memory and the processor in the embodiment shown in FIG. 1 , and can constitute a part of the system architecture platform in the embodiment shown in FIG. 1 .
实现上述实施例的设备侧的基于多模态模型的视频文本摘要方法所需的非暂态软件程序以及指令存储在存储器中,当被处理器执行时,执行上述实施例的基于多模态模型的视频文本摘要方法,例如,执行以上描述的图2中的方法步骤S100至S600、图4中的方法步骤S410至S420、图5中的方法步骤S510至S520、图6中的方法步骤S610至步骤S650。The non-transitory software programs and instructions required to realize the video text summarization method based on the multimodal model of the device side of the above-mentioned embodiments are stored in the memory, and when executed by the processor, the video text summarization method based on the multimodal model of the above-mentioned embodiment is executed. For example, the method steps S100 to S600 in FIG. 650.
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,当计算机可执行指令用于执行上述终端侧的基于多模态模型的视频文本摘要方法,方法包括:对视频数据进行特征提取,得到视频特征,视频数据为需要提取文本摘要的视频数据;将视频特征进行向量化处理,得到视频特征向量;对视频数据进行 语音提取,得到独白语音信息;通过自动语音识别技术ASR将独白语音信息转换为文本信息;将文本信息进行分词处理,得到多个词信息;将视频特征向量和多个词信息输入至transformer模型进行训练,得到文本摘要结果,transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,子层包括Text-vision fusion和Add&Norm。In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are used to execute the video text summarization method based on the multimodal model at the terminal side, the method includes: performing feature extraction on video data to obtain video features, and the video data is video data for which text summaries need to be extracted; performing vectorization processing on video features to obtain video feature vectors; performing speech extraction on video data to obtain monologue speech information; Segment the text information to obtain multiple word information; input the video feature vector and multiple word information into the transformer model for training, and obtain the text summary result. The encoder of the transformer model is set to integrate image features and text features. The sub-layer includes Text-vision fusion and Add&Norm.
在一实施例中,计算机可执行指令用于执行上述终端侧的基于多模态模型的视频文本摘要方法中的将视频特征向量和多个词信息输入至transformer模型进行训练,得到文本摘要结果包括:将视频特征向量和多个词信息输入至transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;将融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。In one embodiment, the computer-executable instructions are used to execute the method of video text summarization based on the multimodal model on the terminal side, inputting the video feature vector and multiple word information into the transformer model for training, and obtaining the text summarization result includes: inputting the video feature vector and multiple word information into the encoder of the transformer model provided with a sub-layer for fusing image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to the decoder for decoding processing, and generate text summarization results.
在一实施例中,计算机可执行指令用于执行上述终端侧的基于多模态模型的视频文本摘要方法中的将视频特征向量和多个词信息输入至transformer模型的设置有用于将图像类特征和文本类特征进行融合的子层的编码器进行融合处理,得到融合编码信息,包括:将多个词信息输入至transformer模型的编码器,并将多个词信息依次通过编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,第一子层包括multi-head self-attention和Add&Norm,第二子层包括FFN和Add&Norm;将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。In one embodiment, computer-executable instructions are used to perform the video feature vector and multiple word information input into the transformer model in the video text summarization method based on the multi-modal model on the terminal side. An encoder configured with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion encoding information, including: inputting multiple word information to the encoder of the transformer model, and sequentially extracting multiple word information through the first sub-layer and the second sub-layer in the encoder to obtain text feature vectors. The first sub-layer includes multi-he ad self-attention and Add&Norm, the second sublayer includes FFN and Add&Norm; the text feature vector and video feature vector are input to the sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
在一实施例中,计算机可执行指令用于执行上述终端侧的基于多模态模型的视频文本摘要方法中的将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息包括:将文本特征向量和视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,Z v为视频特征向量;将Z′ v进行矩阵转置处理,得到
Figure PCTCN2022090712-appb-000011
Figure PCTCN2022090712-appb-000012
与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;将A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。
In one embodiment, the computer-executable instructions are used to perform the above-mentioned video text summarization method based on a multimodal model on the terminal side, inputting text feature vectors and video feature vectors to a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion encoding information includes: inputting text feature vectors and video feature vectors to a sub-layer for fusing image-like features and text-like features; performing matrix multiplication processing on Zv and a weight matrix to obtain Z′v , where Zv is a video feature vector; performing matrix transposition processing on Z′v , to get
Figure PCTCN2022090712-appb-000011
Will
Figure PCTCN2022090712-appb-000012
Multiply the matrix with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z v to obtain AZ v , and perform vector splicing processing on AZ v and Z t to obtain Z′ t , and Z′ t is the fusion coding information.
在一实施例中,计算机可执行指令用于执行上述终端侧的基于多模态模型的视频文本摘要方法中的对视频数据进行特征提取,得到视频特征向量包括:利用3D ResNet-101模型对视频数据进行特征提取,得到视频特征向量。In one embodiment, the computer-executable instructions are used to perform feature extraction on video data in the video text summarization method based on the multimodal model on the terminal side, and obtaining video feature vectors includes: using a 3D ResNet-101 model to perform feature extraction on video data to obtain video feature vectors.
在一实施例中,计算机可执行指令用于执行上述终端侧的基于多模态模型的视频文本摘要方法中的将文本信息进行分词处理,得到多个词信息包括:使用hanlp分词工具将文本信息进行分词处理,得到排列在同一行的多个词信息;将多个词信息进行分行处理,得到以分行字典结构排列的多个词信息。In one embodiment, the computer-executable instructions are used to perform word segmentation processing on the text information in the video text summarization method based on the multi-modal model on the terminal side, and obtaining a plurality of word information includes: using the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same line; performing row-by-line processing on the multiple word information to obtain a plurality of word information arranged in a branch dictionary structure.
例如,执行以上描述的图2中的方法步骤S100至S600、图4中的方法步骤S410至S420、图5中的方法步骤S510至S520、图6中的方法步骤S610至步骤S650。For example, the method steps S100 to S600 in FIG. 2 , the method steps S410 to S420 in FIG. 4 , the method steps S510 to S520 in FIG. 5 , and the method steps S610 to S650 in FIG. 6 described above are performed.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。需要说明的是, 所述计算机可读存储介质可以是非易失性,也可以是易失性。Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and that can be accessed by a computer. In addition, communication media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media, as are known to those of ordinary skill in the art. It should be noted that the computer-readable storage medium may be non-volatile or volatile.
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present application, but the present application is not limited to the above-mentioned embodiment. Those skilled in the art can also make various equivalent deformations or replacements under the shared conditions that do not violate the spirit of the application. These equivalent deformations or replacements are all included within the scope of the claims of the application.

Claims (20)

  1. 一种基于多模态模型的视频文本摘要方法,其中,包括:A video text summarization method based on a multimodal model, including:
    对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;
    将所述视频特征进行向量化处理,得到视频特征向量;The video features are vectorized to obtain video feature vectors;
    对所述视频数据进行语音提取,得到独白语音信息;Carry out speech extraction to described video data, obtain monologue speech information;
    通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;Converting the monologue voice information into text information through automatic speech recognition technology ASR;
    将所述文本信息进行分词处理,得到多个词信息;performing word segmentation processing on the text information to obtain multiple word information;
    将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
  2. 根据权利要求1所述的基于多模态模型的视频文本摘要方法,其中,所述将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果包括:The method for summarizing video text based on a multimodal model according to claim 1, wherein said inputting said video feature vector and a plurality of said word information to a transformer model for training, obtaining a text summarization result comprising:
    将所述视频特征向量和多个所述词信息输入至所述transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;The video feature vector and a plurality of word information are input to the encoder of the transformer model that is provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion encoding information;
    将所述融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。The fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
  3. 根据权利要求2所述的基于多模态模型的视频文本摘要方法,其中,所述将所述视频特征向量和多个所述词信息输入至transformer模型的设置有用于将图像类特征和文本类特征进行融合的子层的编码器进行融合处理,得到融合编码信息,包括:The method for summarizing video text based on a multimodal model according to claim 2, wherein the encoder of the sublayer that is configured to input the video feature vector and a plurality of word information to the transformer model is used to fuse the image class feature and the text class feature to perform fusion processing to obtain fusion coding information, including:
    将多个所述词信息输入至所述transformer模型的所述编码器,并将多个所述词信息依次通过所述编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,所述第一子层包括multi-head self-attention和Add&Norm,所述第二子层包括FFN和Add&Norm;A plurality of the word information is input to the encoder of the transformer model, and a plurality of the word information are sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain a text feature vector, the first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;
    将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。The text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
  4. 根据权利要求3所述的基于多模态模型的视频文本摘要方法,其中,所述将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息包括:The method for summarizing video text based on a multimodal model according to claim 3, wherein said inputting said text feature vector and said video feature vector to a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion encoding information includes:
    将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;The text feature vector and the video feature vector are input to a sublayer for merging image class features and text class features;
    将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,所述Z v为所述视频特征向量; Carry out matrix multiplication processing with Zv and weight matrix, obtain Z'v , described Zv is described video feature vector;
    将所述Z′ v进行矩阵转置处理,得到
    Figure PCTCN2022090712-appb-100001
    Perform matrix transposition processing on the Z′ v to obtain
    Figure PCTCN2022090712-appb-100001
    Figure PCTCN2022090712-appb-100002
    与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;
    Will
    Figure PCTCN2022090712-appb-100002
    Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;
    将所述A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。 The A and Z v are multiplied to obtain AZ v , and AZ v and Z t are subjected to vector splicing processing to obtain Z′ t , where Z′ t is fusion coding information.
  5. 根据权利要求1所述的基于多模态模型的视频文本摘要方法,其中,所述对所述视频数据进行特征提取,得到视频特征向量包括:The method for summarizing video text based on a multimodal model according to claim 1, wherein said performing feature extraction on said video data to obtain a video feature vector comprises:
    利用3D ResNet-101模型对所述视频数据进行特征提取,得到视频特征向量。The 3D ResNet-101 model is used to extract features from the video data to obtain video feature vectors.
  6. 根据权利要求1所述的基于多模态模型的视频文本摘要方法,其中,所述将所述文本信息进行分词处理,得到多个词信息包括:The method for summarizing video text based on a multimodal model according to claim 1, wherein said performing word segmentation processing on said text information to obtain a plurality of word information includes:
    使用hanlp分词工具将所述文本信息进行分词处理,得到多个词信息。Use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information.
  7. 根据权利要求6所述的基于多模态模型的视频文本摘要方法,其中,所述使用hanlp分词工具将所述文本信息进行分词处理,得到多个词信息包括:The video text summarization method based on the multimodal model according to claim 6, wherein, the described text information is carried out word segmentation processing using the hanlp word segmentation tool, and obtaining a plurality of word information includes:
    所述使用hanlp分词工具将所述文本信息进行分词处理,得到排列在同一行的多个词信 息;Described use hanlp participle instrument carries out participle processing to described text information, obtains the multiple word information that is arranged in the same row;
    将多个所述词信息进行分行处理,得到以分行字典结构排列的多个词信息。A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
  8. 一种基于多模态模型的视频文本摘要方法装置,其中,包括:A method and device for video text summarization based on a multimodal model, including:
    第一提取模块,用于对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;The first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;
    向量化模块,用于将所述视频特征进行向量化处理,得到视频特征向量;A vectorization module, configured to perform vectorization processing on the video features to obtain video feature vectors;
    第二提取模块,用于对所述视频数据进行语音提取,得到独白语音信息;The second extraction module is used to perform voice extraction on the video data to obtain monologue voice information;
    转换模块,用于通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;A conversion module, configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;
    分词模块,用于将所述文本信息进行分词处理,得到多个词信息;A word segmentation module, configured to perform word segmentation processing on the text information to obtain multiple word information;
    训练模块,用于将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.
  9. 一种设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现基于多模态模型的视频文本摘要方法,所述方法包括:A device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the computer program, a video text summarization method based on a multimodal model is implemented, the method comprising:
    对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;
    将所述视频特征进行向量化处理,得到视频特征向量;The video features are vectorized to obtain video feature vectors;
    对所述视频数据进行语音提取,得到独白语音信息;Carry out speech extraction to described video data, obtain monologue speech information;
    通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;Converting the monologue voice information into text information through automatic speech recognition technology ASR;
    将所述文本信息进行分词处理,得到多个词信息;performing word segmentation processing on the text information to obtain multiple word information;
    将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
  10. 根据权利要求9所述的设备,其中,所述处理器执行所述计算机程序时实现基于多模态模型的视频文本摘要方法中的所述将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果包括:The device according to claim 9, wherein, when the processor executes the computer program, the multimodal model-based video text summarization method is implemented by inputting the video feature vector and a plurality of word information into a transformer model for training, and obtaining a text summarization result comprising:
    将所述视频特征向量和多个所述词信息输入至所述transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;The video feature vector and a plurality of word information are input to the encoder of the transformer model that is provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion encoding information;
    将所述融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。The fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
  11. 根据权利要求10所述的设备,其中,所述处理器执行所述计算机程序时实现基于多模态模型的视频文本摘要方法中的所述将所述视频特征向量和多个所述词信息输入至transformer模型的设置有用于将图像类特征和文本类特征进行融合的子层的编码器进行融合处理,得到融合编码信息,包括:The device according to claim 10, wherein, when the processor executes the computer program, it implements the multimodal model-based video text summarization method in which the video feature vector and a plurality of word information are input to the transformer model. An encoder that is provided with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion coding information, including:
    将多个所述词信息输入至所述transformer模型的所述编码器,并将多个所述词信息依次通过所述编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,所述第一子层包括multi-head self-attention和Add&Norm,所述第二子层包括FFN和Add&Norm;A plurality of the word information is input to the encoder of the transformer model, and a plurality of the word information are sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain a text feature vector, the first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;
    将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。The text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
  12. 根据权利要求11所述的设备,其中,所述处理器执行所述计算机程序时实现基于多模态模型的视频文本摘要方法中的所述将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息包括:The device according to claim 11, wherein, when the processor executes the computer program, the multimodal model-based video text summarization method of inputting the text feature vector and the video feature vector into a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion coding information includes:
    将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;The text feature vector and the video feature vector are input to a sublayer for merging image class features and text class features;
    将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,所述Z v为所述视频特征向量; Carry out matrix multiplication processing with Zv and weight matrix, obtain Z'v , described Zv is described video feature vector;
    将所述Z′ v进行矩阵转置处理,得到
    Figure PCTCN2022090712-appb-100003
    Perform matrix transposition processing on the Z′ v to obtain
    Figure PCTCN2022090712-appb-100003
    Figure PCTCN2022090712-appb-100004
    与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;
    Will
    Figure PCTCN2022090712-appb-100004
    Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;
    将所述A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。 The A and Z v are multiplied to obtain AZ v , and AZ v and Z t are subjected to vector splicing processing to obtain Z′ t , where Z′ t is fusion coding information.
  13. 根据权利要求9所述的设备,其中,所述处理器执行所述计算机程序时实现基于多模态模型的视频文本摘要方法中的所述对所述视频数据进行特征提取,得到视频特征向量包括:The device according to claim 9, wherein, when the processor executes the computer program, the feature extraction of the video data in the video text summarization method based on a multimodal model is implemented, and obtaining a video feature vector includes:
    利用3D ResNet-101模型对所述视频数据进行特征提取,得到视频特征向量。The 3D ResNet-101 model is used to extract features from the video data to obtain video feature vectors.
  14. 根据权利要求9所述的设备,其中,所述处理器执行所述计算机程序时实现基于多模态模型的视频文本摘要方法中的所述将所述文本信息进行分词处理,得到多个词信息包括:The device according to claim 9, wherein, when the processor executes the computer program, the multimodal model-based video text summarization method is implemented to perform word segmentation processing on the text information, and obtaining a plurality of word information includes:
    所述使用hanlp分词工具将所述文本信息进行分词处理,得到排列在同一行的多个词信息;Described use hanlp participle tool to carry out participle processing to described text information, obtain the multiple word information that is arranged in the same row;
    将多个所述词信息进行分行处理,得到以分行字典结构排列的多个词信息。A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
  15. 一种计算机可读存储介质,存储有计算机可执行指令,其中所述计算机可执行指令用于执行所述的基于多模态模型的视频文本摘要方法,所述方法包括:A computer-readable storage medium, storing computer-executable instructions, wherein the computer-executable instructions are used to execute the multimodal model-based video text summarization method, the method comprising:
    对视频数据进行特征提取,得到视频特征,所述视频数据为需要提取文本摘要的视频数据;Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;
    将所述视频特征进行向量化处理,得到视频特征向量;The video features are vectorized to obtain video feature vectors;
    对所述视频数据进行语音提取,得到独白语音信息;Carry out speech extraction to described video data, obtain monologue speech information;
    通过自动语音识别技术ASR将所述独白语音信息转换为文本信息;Converting the monologue voice information into text information through automatic speech recognition technology ASR;
    将所述文本信息进行分词处理,得到多个词信息;performing word segmentation processing on the text information to obtain multiple word information;
    将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果,所述transformer模型的编码器中设置用于将图像类特征和文本类特征进行融合的子层,所述子层包括Text-vision fusion和Add&Norm。The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可执行指令用于执行所述的基于多模态模型的视频文本摘要方法中的所述将所述视频特征向量和多个所述词信息输入至transformer模型进行训练,得到文本摘要结果包括:The computer-readable storage medium according to claim 15 , wherein the computer-executable instructions are used to perform the step of inputting the video feature vector and a plurality of word information into the transformer model in the multimodal model-based video text summarization method for training, and obtaining the text summarization results includes:
    将所述视频特征向量和多个所述词信息输入至所述transformer模型的设置有用于将图像特征和文本特征进行融合的子层的编码器中进行融合处理,得到融合编码信息;The video feature vector and a plurality of word information are input to the encoder of the transformer model that is provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion encoding information;
    将所述融合编码信息传递至解码器中进行解码处理,生成文本摘要结果。The fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机可执行指令用于执行所述的基于多模态模型的视频文本摘要方法中的所述将所述视频特征向量和多个所述词信息输入至transformer模型的设置有用于将图像类特征和文本类特征进行融合的子层的编码器进行融合处理,得到融合编码信息,包括:The computer-readable storage medium according to claim 16 , wherein the computer-executable instructions are used to perform fusion processing on an encoder that is provided with a sub-layer for fusing image-like features and text-like features to obtain fusion encoding information by inputting the video feature vector and a plurality of the word information into the transformer model in the multimodal model-based video text summarization method, including:
    将多个所述词信息输入至所述transformer模型的所述编码器,并将多个所述词信息依次通过所述编码器中的第一子层和第二子层进行提取处理,得到文本特征向量,所述第一子层包括multi-head self-attention和Add&Norm,所述第二子层包括FFN和Add&Norm;A plurality of the word information is input to the encoder of the transformer model, and a plurality of the word information are sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain a text feature vector, the first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;
    将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息。The text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机可执行指令用于执行所述的基于多模态模型的视频文本摘要方法中的所述将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层进行融合处理,得到融合编码信息包括:The computer-readable storage medium according to claim 17, wherein the computer-executable instructions are used to perform the step of inputting the text feature vector and the video feature vector into a sublayer for fusing image-like features and text-like features in the multimodal model-based video text summarization method for fusion processing, and obtaining fusion coding information includes:
    将所述文本特征向量和所述视频特征向量输入至用于将图像类特征和文本类特征进行融合的子层;The text feature vector and the video feature vector are input to a sublayer for merging image class features and text class features;
    将Z v与权重矩阵进行矩阵相乘处理,得到Z′ v,所述Z v为所述视频特征向量; Carry out matrix multiplication processing with Zv and weight matrix, obtain Z'v , described Zv is described video feature vector;
    将所述Z′ v进行矩阵转置处理,得到
    Figure PCTCN2022090712-appb-100005
    Perform matrix transposition processing on the Z′ v to obtain
    Figure PCTCN2022090712-appb-100005
    Figure PCTCN2022090712-appb-100006
    与Z t进行矩阵相乘并通过softmax函数进行计算处理,得到A,A为注意力权重;
    Will
    Figure PCTCN2022090712-appb-100006
    Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;
    将所述A与Z v进行相乘处理得到AZ v,并将AZ v与Z t进行向量拼接处理,得到Z′ t,Z′ t为融合编码信息。 The A and Z v are multiplied to obtain AZ v , and AZ v and Z t are subjected to vector splicing processing to obtain Z′ t , where Z′ t is fusion coding information.
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可执行指令用于执行所述的基于多模态模型的视频文本摘要方法中的所述对所述视频数据进行特征提取,得到视频特征向量包括:The computer-readable storage medium according to claim 15, wherein the computer-executable instructions are used to perform feature extraction on the video data in the multimodal model-based video text summarization method, and obtaining video feature vectors includes:
    利用3D ResNet-101模型对所述视频数据进行特征提取,得到视频特征向量。The 3D ResNet-101 model is used to extract features from the video data to obtain video feature vectors.
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可执行指令用于执行所述的基于多模态模型的视频文本摘要方法中的所述将所述文本信息进行分词处理,得到多个词信息包括:The computer-readable storage medium according to claim 15, wherein the computer-executable instructions are used to perform the word segmentation processing of the text information in the multimodal model-based video text summarization method, and obtaining a plurality of word information includes:
    所述使用hanlp分词工具将所述文本信息进行分词处理,得到排列在同一行的多个词信息;Described use hanlp participle tool to carry out participle processing to described text information, obtain the multiple word information that is arranged in the same row;
    将多个所述词信息进行分行处理,得到以分行字典结构排列的多个词信息。A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
PCT/CN2022/090712 2022-01-18 2022-04-29 Video text summarization method based on multi-modal model, device and storage medium WO2023137913A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210056075.6A CN114398889A (en) 2022-01-18 2022-01-18 Video text summarization method, device and storage medium based on multi-modal model
CN202210056075.6 2022-01-18

Publications (1)

Publication Number Publication Date
WO2023137913A1 true WO2023137913A1 (en) 2023-07-27

Family

ID=81231310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090712 WO2023137913A1 (en) 2022-01-18 2022-04-29 Video text summarization method based on multi-modal model, device and storage medium

Country Status (2)

Country Link
CN (1) CN114398889A (en)
WO (1) WO2023137913A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474094A (en) * 2023-12-22 2024-01-30 云南师范大学 Knowledge tracking method based on fusion domain features of Transformer

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398889A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Video text summarization method, device and storage medium based on multi-modal model
CN115544244B (en) * 2022-09-06 2023-11-17 内蒙古工业大学 Multi-mode generation type abstract acquisition method based on cross fusion and reconstruction
CN117094367B (en) * 2023-10-19 2024-03-29 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117521017B (en) * 2024-01-03 2024-04-05 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417134A (en) * 2020-10-30 2021-02-26 同济大学 Automatic abstract generation system and method based on voice text deep fusion features
CN113052149A (en) * 2021-05-20 2021-06-29 平安科技(深圳)有限公司 Video abstract generation method and device, computer equipment and medium
CN113609285A (en) * 2021-08-09 2021-11-05 福州大学 Multi-mode text summarization system based on door control fusion mechanism
CN113762052A (en) * 2021-05-13 2021-12-07 腾讯科技(深圳)有限公司 Video cover extraction method, device, equipment and computer readable storage medium
CN114398889A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Video text summarization method, device and storage medium based on multi-modal model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781916A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Video data fraud detection method and device, computer equipment and storage medium
CN111625660A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Dialog generation method, video comment method, device, equipment and storage medium
CN111767461B (en) * 2020-06-24 2024-02-06 北京奇艺世纪科技有限公司 Data processing method and device
CN113159010B (en) * 2021-03-05 2022-07-22 北京百度网讯科技有限公司 Video classification method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417134A (en) * 2020-10-30 2021-02-26 同济大学 Automatic abstract generation system and method based on voice text deep fusion features
CN113762052A (en) * 2021-05-13 2021-12-07 腾讯科技(深圳)有限公司 Video cover extraction method, device, equipment and computer readable storage medium
CN113052149A (en) * 2021-05-20 2021-06-29 平安科技(深圳)有限公司 Video abstract generation method and device, computer equipment and medium
CN113609285A (en) * 2021-08-09 2021-11-05 福州大学 Multi-mode text summarization system based on door control fusion mechanism
CN114398889A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Video text summarization method, device and storage medium based on multi-modal model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TIEZHENG YU; WENLIANG DAI; ZIHAN LIU; PASCALE FUNG: "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 October 2021 (2021-10-01), 201 Olin Library Cornell University Ithaca, NY 14853, XP091069943 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474094A (en) * 2023-12-22 2024-01-30 云南师范大学 Knowledge tracking method based on fusion domain features of Transformer
CN117474094B (en) * 2023-12-22 2024-04-09 云南师范大学 Knowledge tracking method based on fusion domain features of Transformer

Also Published As

Publication number Publication date
CN114398889A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
WO2023137913A1 (en) Video text summarization method based on multi-modal model, device and storage medium
US20240054767A1 (en) Multi-modal Model Training Method, Apparatus and Device, and Storage Medium
CN109388793B (en) Entity marking method, intention identification method, corresponding device and computer storage medium
WO2019200923A1 (en) Pinyin-based semantic recognition method and device and human-machine conversation system
KR102124466B1 (en) Apparatus and method for generating conti for webtoon
CN111709243A (en) Knowledge extraction method and device based on deep learning
WO2023134088A1 (en) Video summary generation method and apparatus, electronic device, and storage medium
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN114092930B (en) Character recognition method and system
WO2021212601A1 (en) Image-based writing assisting method and apparatus, medium, and device
WO2022141864A1 (en) Conversation intent recognition model training method, apparatus, computer device, and medium
US20220310070A1 (en) Artificial Intelligence System for Capturing Context by Dilated Self-Attention
CN112016271A (en) Language style conversion model training method, text processing method and device
CN113157959A (en) Cross-modal retrieval method, device and system based on multi-modal theme supplement
CN116434752A (en) Speech recognition error correction method and device
CN117010500A (en) Visual knowledge reasoning question-answering method based on multi-source heterogeneous knowledge joint enhancement
CN111930894A (en) Long text matching method and device, storage medium and electronic equipment
CN113343692B (en) Search intention recognition method, model training method, device, medium and equipment
CN114973229A (en) Text recognition model training method, text recognition device, text recognition equipment and medium
WO2022095370A1 (en) Text matching method and apparatus, terminal device, and storage medium
CN117093687A (en) Question answering method and device, electronic equipment and storage medium
CN115115432B (en) Product information recommendation method and device based on artificial intelligence
WO2023178802A1 (en) Named entity recognition method and apparatus, device, and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921344

Country of ref document: EP

Kind code of ref document: A1