WO2023137913A1

WO2023137913A1 - Video text summarization method based on multi-modal model, device and storage medium

Info

Publication number: WO2023137913A1
Application number: PCT/CN2022/090712
Authority: WO
Inventors: 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-01-18
Filing date: 2022-04-29
Publication date: 2023-07-27
Also published as: CN114398889A

Abstract

A video text summarization method based on a multi-modal model, a device and a storage medium, relating to artificial intelligence technology. The method comprises the following steps: performing feature extraction on video data to obtain video features, the video data being video data from which a text summary needs to be extracted (S100); vectorizing the video features to obtain a video feature vector (S200); performing speech extraction on the video data to obtain monologue speech information (S300); by means of automatic speech recognition (ASR), converting the monologue speech information into text information (S400); performing word segmentation processing on the text information to obtain a plurality of pieces of word information (S500); and inputting the video feature vector and the plurality of pieces of word information into a transformer model for training to obtain a text summarization result, a sub-layer used for fusing image-class features and text-class features being arranged in an encoder of the transformer model (S600). The accuracy of text summary content extracted from a video can be improved.

Description

Video text summarization method, device and storage medium based on multimodal model

This application claims the priority of the Chinese patent application with the application number 202210056075.6 submitted to the China Patent Office on January 18, 2022, and the title of the invention is "Method, device and storage medium for video text summarization based on multimodal model", the entire content of which is incorporated in this application by reference.

technical field

The embodiments of the present application relate to the field of artificial intelligence, and in particular to a multimodal model-based video text summarization method, device and storage medium.

Background technique

At present, most methods of intelligent video summarization in the industry use text extraction methods, because there are speech monologues in the video, and speech monologues are usually generated by using Automatic Speech Recognition (ASR) technology to convert sound into text, and then using natural language processing technology.

technical problem

The following is the technical problem of the prior art realized by the inventor: the current processing method mainly extracts the important sentences by calculating the importance of the sentences. Since several sentences are extracted from the monologue, the result will lead to incoherence between the sentences.

technical solution

In the first aspect, the embodiment of the present application provides a video text summarization method based on a multimodal model, including:

Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;

The video features are vectorized to obtain video feature vectors;

Carry out speech extraction to described video data, obtain monologue speech information;

Converting the monologue voice information into text information through automatic speech recognition technology ASR;

performing word segmentation processing on the text information to obtain multiple word information;

The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.

In the second aspect, the embodiment of the present application provides a video text summarization device based on a multimodal model, including:

The first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;

A vectorization module, configured to perform vectorization processing on the video features to obtain video feature vectors;

The second extraction module is used to perform voice extraction on the video data to obtain monologue voice information;

A conversion module, configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;

A word segmentation module, configured to perform word segmentation processing on the text information to obtain multiple word information;

The training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.

In a third aspect, an embodiment of the present application provides a device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, a method for video text summarization based on a multimodal model is implemented, and the method includes:

The video features are vectorized to obtain video feature vectors;

In a fourth aspect, the embodiment of the present application provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used to execute a video text summarization method based on a multimodal model, and the method includes:

The video features are vectorized to obtain video feature vectors;

Beneficial effect

The video text summarization method, device and storage medium based on the multi-modal model proposed in the embodiment of the present application, by newly setting a sub-layer for fusing image-like features and text-like features in the encoder of the transformer model, the text summarization results obtained by training the video feature vector and dictionary in the transformer model are more accurate, that is, the accuracy of the text summarization content extracted from the video can be effectively improved.

Description of drawings

Fig. 1 is a schematic diagram of a system architecture platform for performing a video text summarization method based on a multimodal model provided by one embodiment of the present application;

Fig. 2 is the flow chart of the video text summarization method based on multimodal model that one embodiment of the present application provides;

FIG. 3 is a schematic diagram of an improved transformer model in a video text summarization method based on a multimodal model provided by an embodiment of the present application;

Fig. 4 is the flow chart of generating text summarization result in the video text summarization method based on multimodal model provided by one embodiment of the present application;

FIG. 5 is a schematic diagram of generating fusion coding information in a video text summarization method based on a multimodal model provided by an embodiment of the present application;

Fig. 6 is a flow chart of the fusion processing in the newly added sublayer in the encoder in the video text summarization method based on the multimodal model provided by one embodiment of the present application;

Fig. 7 is a schematic structural diagram of a newly added sublayer in an encoder in a video text summarization method based on a multimodal model provided by an embodiment of the present application.

Embodiments of the present invention

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the flow chart. The terms "first", "second" and the like in the specification, claims or the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequential order.

The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The embodiment of the present application provides a video text summarization method based on a multi-modal model. The video text summarization method includes the following steps: performing feature extraction on video data to obtain video features, and the video data is video data for which text summarization needs to be extracted; performing vectorization processing on the video features to obtain video feature vectors; performing speech extraction on the video data to obtain monologue speech information; converting the monologue speech information into text information by using automatic speech recognition technology ASR; The transformer model is trained to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image-like features and text-like features, and the sub-layers include Text-vision fusion and Add&Norm. In the technical solution of this embodiment, a sub-layer for fusing image-like features and text-like features is newly set in the encoder of the transformer model, so that the text summary results obtained by training the video feature vector and dictionary in the transformer model are more accurate, and can effectively improve the accuracy of the text summary content extracted from the video.

The embodiments of the present application will be further described below in conjunction with the accompanying drawings.

As shown in FIG. 1 , FIG. 1 is a schematic diagram of a system architecture platform 100 for implementing a video text summarization method based on a multimodal model provided by an embodiment of the present application.

In the example shown in FIG. 1 , the system architecture platform 100 is provided with a processor 110 and a memory 120 , wherein the processor 110 and the memory 120 may be connected via a bus or in other ways. In FIG. 1 , connection via a bus is taken as an example.

The memory 120, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 120 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some implementations, the memory 120 may optionally include memories that are remotely located relative to the processor 110, and these remote memories may be connected to the system architecture platform through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Those skilled in the art can understand that the system architecture platform can be applied to 5G communication network systems and subsequent evolved mobile communication network systems, etc., which is not specifically limited in this embodiment.

Those skilled in the art can understand that the system architecture platform shown in FIG. 1 does not limit the embodiment of the present application, and may include more or less components than shown in the figure, or combine some components, or arrange different components.

The system architecture platform 100 can be an independent system architecture platform 100, or it can be a cloud system architecture platform 100 that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.

Based on the above-mentioned system architecture platform, various embodiments of the video text summarization method based on the multimodal model of the present application are proposed below.

As shown in Figure 2, Figure 2 is a flowchart of a video text summarization method based on a multimodal model provided by an embodiment of the present application. The video text summarization method based on a multimodal model is applied to the above-mentioned architecture platform, and the video text summarization method based on a multimodal model includes but is not limited to steps S100, S200, S300, S400, and S500.

Step S100, feature extraction is performed on video data to obtain video features, and the video data is video data for which a text summary needs to be extracted;

Step S200, performing vectorization processing on video features to obtain video feature vectors;

Step S300, performing voice extraction on the video data to obtain monologue voice information;

Step S400, converting the monologue speech information into text information by using the automatic speech recognition technology ASR;

Step S500, performing word segmentation processing on the text information to obtain multiple word information;

In step S600, the video feature vector and multiple word information are input into the transformer model for training to obtain a text summary result. The encoder of the transformer model is provided with a sub-layer for fusing image-like features and text-like features. The sub-layers include Text-vision fusion and Add&Norm.

In one embodiment, the video data that needs to be extracted from the text summary is obtained. The video data includes video image information and voice information, wherein the voice information includes monologue voice information and background voice information, then feature extraction is performed on the video data to obtain video features, and the video features are vectorized to obtain video feature vectors, wherein the video feature vectors are prepared for subsequent training steps; at the same time, voice extraction can be performed on the video data to obtain monologue voice information, and the monologue voice information is converted into text information by automatic speech recognition technology ASR, and then the text information is subjected to word segmentation processing to obtain multiple word information; The video feature vector and multiple word information are input into the transformer model for training, and the text summarization result after the sub-layer training processing of the fusion of image-like features and text-like features is obtained. Since the new sub-layer for fusing image-like features and text-like features is set in the encoder of the transformer model, this sub-layer includes Text-vision fusion and Add&Norm, which makes the text summary results obtained by training the video feature vector and dictionary in the transformer model more accurate, and can effectively improve the accuracy of the text summary content extracted from the video. It should be noted that the 3D ResNet-101 model can be used to extract features from video data to obtain video feature vectors, and other models can also be used to extract video features from video data, which is not uniquely limited in this embodiment.

It can be understood that, the step of performing word segmentation processing on the text information to obtain multiple word information can be to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information, or to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same line; to perform row-by-row processing on multiple word information to obtain multiple word information arranged in a dictionary structure of rows, which is not specifically limited in this embodiment. It can be understood that the dictionary in this embodiment includes a plurality of word information, and each word information in the dictionary is an independent row, and each word information corresponds to a row position number.

It should be noted that, referring to FIG. 3 , this embodiment adds a sublayer to the Encoder Layer in the traditional transformer model, which is used to fuse image-like features and text-like features, including Text-vision fusion and Add&Norm. Then the improved transformer model includes the Encoder end (encoder) and the Decoder end (decoder). The Encoder Layer in the Encoder end includes three sublayers: the first sublayer (multi-head self-attention and Add&Norm), the second sublayer (FFN and Add&Norm) and the third sublayer (Text-vision fusion and Add&Norm). The Decoder Layer at the Decoder end includes three sublayers: the fourth sublayer ( Masked multi-head self-attention and Add&Norm), the fifth sublayer (Multi-head Enc-Dec Attention and Add&Norm) and the sixth sublayer (FFN and Add&Norm).

It should be noted that Text Inputs is the input text (dictionary) of good words. In the traditional transformer model, these words will correspond to 3 embeddings. One is token embedding, which can be called token embedding (also called word embedding). Its function is to map human language into geometric space, one is segment embedding, and the other is positional embedding. In the embodiment, only token embedding and segment embedding are used. The method of obtaining Token embedding is: multiply text inputs by a weight matrix with a size of N*512 to obtain a token embedding with a vector length of 512, where text inputs is a vector whose length is the number of dictionary rows N as the dimension, and each word in the text inputs corresponds to a position in the dictionary, and this position is marked with 1, and the rest of the positions are marked with 0.

Referring to FIG. 4 , in one embodiment, step S500 includes but not limited to step S410 and step S420 .

Step S410, inputting the video feature vector and the dictionary into the encoder of the transformer model provided with a sub-layer for fusing image features and text features for fusion processing to obtain fusion encoding information;

In step S420, the fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.

Specifically, the video feature vector and the dictionary can be respectively input into the encoder of the transformer model provided with a sub-layer for fusing image features and text features for fusion processing to obtain fusion coding information, and then the fusion coding information is passed to the decoder for decoding processing, thereby generating a text summary result. Since the sublayer used to fuse image features and text features is newly set in the encoder of the transformer model, instead of being trained solely through text feature vectors, the text summary results obtained by training video feature vectors and dictionaries in the transformer model are more accurate, which can effectively improve the accuracy of text summary content extracted from videos.

Referring to FIG. 5 , in one embodiment, step S410 includes but not limited to step S510 and step S520 .

Step S510, the dictionary is input to the encoder of the transformer model, and the dictionary is sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain text feature vectors. The first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;

In step S520, the text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.

Specifically, the dictionary can be input to the encoder of the transformer model for processing. For example, the dictionary can be extracted and processed through the first sublayer (multi-head self-attention and Add&Norm) and the second sublayer (FFN and Add&Norm) in the encoder to obtain the text feature vector, and then the text feature vector and video feature vector are input into the newly added sublayer to perform fusion processing of image features and text features to obtain fusion coding information. It is trained purely through the text feature vector, then it is more accurate to generate a text summary by decoding the fused coding information through the decoder, which can effectively improve the accuracy of the text summary extracted from the video.

Referring to FIG. 6 , in one embodiment, step S510 includes but not limited to step S610 , step S620 , step S630 , step S640 , and step S650 .

Step S610, inputting the text feature vector and the video feature vector to the sublayer for fusing the image-like features and the text-like features;

Step S620, performing matrix multiplication processing on _Zv and the weight matrix to obtain _Z'v , where _Zv is a video feature vector;

Step S630, perform matrix transposition processing on _Z′v , and obtain

Step S640, will

Perform matrix multiplication with Z _t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;

In step S650, A and Z _v are multiplied to obtain AZ _v , and AZ _v and Z _t are vector concatenated to obtain Z′ _t , where Z′ _t is fusion coding information.

Specifically, the text feature vector and the video feature vector are input to the sublayer for fusing image features and text features. In the sublayer, Z _v is first multiplied with the weight matrix to obtain Z′ _v , and then Z′ _v is matrix transposed to obtain

then

Multiply the matrix with Z _t and perform calculation processing through the softmax function to obtain A, multiply A and Z _v to obtain AZ _v , and perform vector splicing processing on AZ _v and Z _t to obtain Z′ _t . where Z _v is the video feature vector, A is the attention weight, and Z′ _t is the fusion coding information.

In one embodiment, the structure of Text-vision fusion in the sub-layer newly added in the encoder of the transformer model for fusing image-like features and text-like features is referred to in FIG.

Z′ _t =Concat(Z _t , AZ _v )W ₂

where _W1 is the weight matrix, _Zv is the video feature vector, A is the attention weight, and _Z′t is the fusion coding information.

The method implemented in this implementation can achieve better results in automatic text summarization due to the introduction of image-like features, such as higher scores in evaluation methods such as ROUGE-1, ROUGE-2, and ROUGE-L.

Based on the above multimodal model-based video text summarization method, various embodiments of the multimodal model-based video text summarization method device, controller, and computer-readable storage medium of the present application are proposed below.

An embodiment of the present application also provides a prediction device for an answer sequence based on improved topic response theory, including:

In one embodiment, the training module is also used to input the video feature vector and the dictionary to the encoder of the transformer model provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to the decoder for decoding processing to generate a text summary result.

In one embodiment, the training module is also used to input the dictionary into the encoder of the transformer model, and extract the dictionary through the first sublayer and the second sublayer in the encoder to obtain text feature vectors. The first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm.

In one embodiment, the training module is also used to input the text feature vector and the video feature vector to the sublayer used to fuse the image class feature and the text class feature; perform matrix multiplication processing on _Zv and the weight matrix to obtain _Z′v , and _Zv is the video feature vector; perform matrix transposition processing on _Z′v to obtain

Will

Multiply the matrix with Z _t and perform calculation processing through the softmax function to obtain A, and A is the attention weight; multiply A and Z _v to obtain AZ _v , and perform vector splicing processing on AZ _v and Z _t to obtain Z′ _t , and Z′ _t is the fusion coding information.

In one embodiment, the extraction module is also used to extract features from the video data using the 3D ResNet-101 model to obtain video feature vectors.

In one embodiment, the word segmentation module is also used to use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information.

In one embodiment, the word segmentation module is also used to perform word segmentation processing on the text information using the hanlp word segmentation tool to obtain multiple word information arranged in the same row;

A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.

It should be noted that the various embodiments of the above-mentioned video text summarization device based on the multimodal model are consistent with the technical means used, the technical problems solved and the technical effects achieved in the embodiments of the video text summarization method based on the multimodal model, and will not be described in detail here. For details, see the embodiment of the video text summarization method based on the multimodal model.

In addition, an embodiment of the present application provides a device, which includes: a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, a video text summarization method based on a multimodal model is implemented. The method includes: performing feature extraction on video data to obtain video features, and the video data is video data for which text summaries need to be extracted; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information; The text information is subjected to word segmentation processing to obtain multiple word information; the video feature vector and multiple word information are input to the transformer model for training, and the text summary result is obtained. The encoder of the transformer model is set with a sub-layer for fusing image-like features and text-like features. The sub-layer includes Text-vision fusion and Add&Norm.

In one embodiment, when the processor executes the computer program, inputting the video feature vector and multiple word information into the transformer model for training in the video text summarization method based on the multimodal model, and obtaining the text summarization result includes: inputting the video feature vector and the multiple word information into an encoder of the transformer model provided with a sub-layer for fusing image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to a decoder for decoding processing, and generate text summarization results.

In one embodiment, when the processor executes the computer program, the method of video text summarization based on the multi-modal model is implemented to input video feature vectors and multiple word information into the transformer model. An encoder configured with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion encoding information, including: inputting multiple word information to the encoder of the transformer model, and sequentially extracting multiple word information through the first sub-layer and the second sub-layer in the encoder to obtain text feature vectors. The first sub-layer includes multi-head self -attention and Add&Norm, the second sublayer includes FFN and Add&Norm; the text feature vector and the video feature vector are input to the sublayer for fusing the image class feature and the text class feature for fusion processing to obtain fusion coding information.

In one embodiment, when the processor executes the computer program, the text feature vector and the video feature vector are input into the sublayer for fusing the image feature and the text feature in the video text summarization method based on the multimodal model to perform fusion processing, and obtaining the fusion coding information includes: inputting the text feature vector and the video feature vector to the sublayer for fusing the image feature and the text feature; performing matrix multiplication processing on _Zv and the weight matrix to obtain _Z′v , _Zv is the video feature vector; performing matrix transposition processing on _Z′v to obtain

Will

In one embodiment, when the processor executes the computer program, the feature extraction of the video data in the video text summarization method based on the multimodal model is realized, and obtaining the video feature vector includes: utilizing the 3D ResNet-101 model to perform feature extraction on the video data to obtain the video feature vector.

In one embodiment, when the processor executes the computer program, the text information in the video text summarization method based on the multimodal model is implemented to perform word segmentation processing, and obtaining a plurality of word information includes: using the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same row; performing row-by-line processing of the multiple word information to obtain a plurality of word information arranged in a branch dictionary structure.

The processor and memory can be connected by a bus or other means.

It should be noted that the device in this embodiment may correspond to include the memory and the processor in the embodiment shown in FIG. 1 , and can constitute a part of the system architecture platform in the embodiment shown in FIG. 1 .

The non-transitory software programs and instructions required to realize the video text summarization method based on the multimodal model of the device side of the above-mentioned embodiments are stored in the memory, and when executed by the processor, the video text summarization method based on the multimodal model of the above-mentioned embodiment is executed. For example, the method steps S100 to S600 in FIG. 650.

In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are used to execute the video text summarization method based on the multimodal model at the terminal side, the method includes: performing feature extraction on video data to obtain video features, and the video data is video data for which text summaries need to be extracted; performing vectorization processing on video features to obtain video feature vectors; performing speech extraction on video data to obtain monologue speech information; Segment the text information to obtain multiple word information; input the video feature vector and multiple word information into the transformer model for training, and obtain the text summary result. The encoder of the transformer model is set to integrate image features and text features. The sub-layer includes Text-vision fusion and Add&Norm.

In one embodiment, the computer-executable instructions are used to execute the method of video text summarization based on the multimodal model on the terminal side, inputting the video feature vector and multiple word information into the transformer model for training, and obtaining the text summarization result includes: inputting the video feature vector and multiple word information into the encoder of the transformer model provided with a sub-layer for fusing image features and text features to perform fusion processing to obtain fusion coding information; transfer the fusion coding information to the decoder for decoding processing, and generate text summarization results.

In one embodiment, computer-executable instructions are used to perform the video feature vector and multiple word information input into the transformer model in the video text summarization method based on the multi-modal model on the terminal side. An encoder configured with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion encoding information, including: inputting multiple word information to the encoder of the transformer model, and sequentially extracting multiple word information through the first sub-layer and the second sub-layer in the encoder to obtain text feature vectors. The first sub-layer includes multi-he ad self-attention and Add&Norm, the second sublayer includes FFN and Add&Norm; the text feature vector and video feature vector are input to the sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.

In one embodiment, the computer-executable instructions are used to perform the above-mentioned video text summarization method based on a multimodal model on the terminal side, inputting text feature vectors and video feature vectors to a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion encoding information includes: inputting text feature vectors and video feature vectors to a sub-layer for fusing image-like features and text-like features; performing matrix multiplication processing on _Zv and a weight matrix to obtain _Z′v , where _Zv is a video feature vector; performing matrix transposition processing on _Z′v , to get

Will

In one embodiment, the computer-executable instructions are used to perform feature extraction on video data in the video text summarization method based on the multimodal model on the terminal side, and obtaining video feature vectors includes: using a 3D ResNet-101 model to perform feature extraction on video data to obtain video feature vectors.

In one embodiment, the computer-executable instructions are used to perform word segmentation processing on the text information in the video text summarization method based on the multi-modal model on the terminal side, and obtaining a plurality of word information includes: using the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information arranged in the same line; performing row-by-line processing on the multiple word information to obtain a plurality of word information arranged in a branch dictionary structure.

For example, the method steps S100 to S600 in FIG. 2 , the method steps S410 to S420 in FIG. 4 , the method steps S510 to S520 in FIG. 5 , and the method steps S610 to S650 in FIG. 6 described above are performed.

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and that can be accessed by a computer. In addition, communication media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media, as are known to those of ordinary skill in the art. It should be noted that the computer-readable storage medium may be non-volatile or volatile.

The above is a specific description of the preferred implementation of the present application, but the present application is not limited to the above-mentioned embodiment. Those skilled in the art can also make various equivalent deformations or replacements under the shared conditions that do not violate the spirit of the application. These equivalent deformations or replacements are all included within the scope of the claims of the application.

Claims

A video text summarization method based on a multimodal model, including:

Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;

The video features are vectorized to obtain video feature vectors;

Carry out speech extraction to described video data, obtain monologue speech information;

Converting the monologue voice information into text information through automatic speech recognition technology ASR;

performing word segmentation processing on the text information to obtain multiple word information;

The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
The method for summarizing video text based on a multimodal model according to claim 1, wherein said inputting said video feature vector and a plurality of said word information to a transformer model for training, obtaining a text summarization result comprising:

The video feature vector and a plurality of word information are input to the encoder of the transformer model that is provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion encoding information;

The fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
The method for summarizing video text based on a multimodal model according to claim 2, wherein the encoder of the sublayer that is configured to input the video feature vector and a plurality of word information to the transformer model is used to fuse the image class feature and the text class feature to perform fusion processing to obtain fusion coding information, including:

A plurality of the word information is input to the encoder of the transformer model, and a plurality of the word information are sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain a text feature vector, the first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;

The text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
The method for summarizing video text based on a multimodal model according to claim 3, wherein said inputting said text feature vector and said video feature vector to a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion encoding information includes:

The text feature vector and the video feature vector are input to a sublayer for merging image class features and text class features;

Carry out matrix multiplication processing with Zv and weight matrix, obtain Z'v , described Zv is described video feature vector;

Perform matrix transposition processing on the Z′ v to obtain

Will
Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;

The A and Z v are multiplied to obtain AZ v , and AZ v and Z t are subjected to vector splicing processing to obtain Z′ t , where Z′ t is fusion coding information.
The method for summarizing video text based on a multimodal model according to claim 1, wherein said performing feature extraction on said video data to obtain a video feature vector comprises:

The 3D ResNet-101 model is used to extract features from the video data to obtain video feature vectors.
The method for summarizing video text based on a multimodal model according to claim 1, wherein said performing word segmentation processing on said text information to obtain a plurality of word information includes:

Use the hanlp word segmentation tool to perform word segmentation processing on the text information to obtain multiple word information.
The video text summarization method based on the multimodal model according to claim 6, wherein, the described text information is carried out word segmentation processing using the hanlp word segmentation tool, and obtaining a plurality of word information includes:

Described use hanlp participle instrument carries out participle processing to described text information, obtains the multiple word information that is arranged in the same row;

A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
A method and device for video text summarization based on a multimodal model, including:

The first extraction module is used to perform feature extraction on video data to obtain video features, and the video data is video data that needs to extract text summaries;

A vectorization module, configured to perform vectorization processing on the video features to obtain video feature vectors;

The second extraction module is used to perform voice extraction on the video data to obtain monologue voice information;

A conversion module, configured to convert the monologue voice information into text information through automatic speech recognition technology ASR;

A word segmentation module, configured to perform word segmentation processing on the text information to obtain multiple word information;

The training module is used to input the video feature vector and a plurality of word information to the transformer model for training to obtain text summary results. The encoder of the transformer model is provided with a sub-layer for merging image class features and text class features, and the sub-layers include Text-vision fusion and Add&Norm.
A device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the computer program, a video text summarization method based on a multimodal model is implemented, the method comprising:

Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;

The video features are vectorized to obtain video feature vectors;

Carry out speech extraction to described video data, obtain monologue speech information;

Converting the monologue voice information into text information through automatic speech recognition technology ASR;

performing word segmentation processing on the text information to obtain multiple word information;

The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
The device according to claim 9, wherein, when the processor executes the computer program, the multimodal model-based video text summarization method is implemented by inputting the video feature vector and a plurality of word information into a transformer model for training, and obtaining a text summarization result comprising:

The video feature vector and a plurality of word information are input to the encoder of the transformer model that is provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion encoding information;

The fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
The device according to claim 10, wherein, when the processor executes the computer program, it implements the multimodal model-based video text summarization method in which the video feature vector and a plurality of word information are input to the transformer model. An encoder that is provided with a sub-layer for fusing image-like features and text-like features performs fusion processing to obtain fusion coding information, including:

A plurality of the word information is input to the encoder of the transformer model, and a plurality of the word information are sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain a text feature vector, the first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;

The text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
The device according to claim 11, wherein, when the processor executes the computer program, the multimodal model-based video text summarization method of inputting the text feature vector and the video feature vector into a sublayer for fusing image-like features and text-like features for fusion processing, and obtaining fusion coding information includes:

The text feature vector and the video feature vector are input to a sublayer for merging image class features and text class features;

Carry out matrix multiplication processing with Zv and weight matrix, obtain Z'v , described Zv is described video feature vector;

Perform matrix transposition processing on the Z′ v to obtain

Will
Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;

The A and Z v are multiplied to obtain AZ v , and AZ v and Z t are subjected to vector splicing processing to obtain Z′ t , where Z′ t is fusion coding information.
The device according to claim 9, wherein, when the processor executes the computer program, the feature extraction of the video data in the video text summarization method based on a multimodal model is implemented, and obtaining a video feature vector includes:

The 3D ResNet-101 model is used to extract features from the video data to obtain video feature vectors.
The device according to claim 9, wherein, when the processor executes the computer program, the multimodal model-based video text summarization method is implemented to perform word segmentation processing on the text information, and obtaining a plurality of word information includes:

Described use hanlp participle tool to carry out participle processing to described text information, obtain the multiple word information that is arranged in the same row;

A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.
A computer-readable storage medium, storing computer-executable instructions, wherein the computer-executable instructions are used to execute the multimodal model-based video text summarization method, the method comprising:

Carry out feature extraction to video data, obtain video feature, described video data is the video data that needs to extract text summarization;

The video features are vectorized to obtain video feature vectors;

Carry out speech extraction to described video data, obtain monologue speech information;

Converting the monologue voice information into text information through automatic speech recognition technology ASR;

performing word segmentation processing on the text information to obtain multiple word information;

The video feature vector and a plurality of the word information are input to the transformer model for training, and the text summary result is obtained. In the encoder of the transformer model, a sub-layer for merging the image class feature and the text class feature is set, and the sub-layer includes Text-vision fusion and Add&Norm.
The computer-readable storage medium according to claim 15 , wherein the computer-executable instructions are used to perform the step of inputting the video feature vector and a plurality of word information into the transformer model in the multimodal model-based video text summarization method for training, and obtaining the text summarization results includes:

The video feature vector and a plurality of word information are input to the encoder of the transformer model that is provided with a sub-layer for merging image features and text features to perform fusion processing to obtain fusion encoding information;

The fusion coding information is transmitted to the decoder for decoding processing, and a text summarization result is generated.
The computer-readable storage medium according to claim 16 , wherein the computer-executable instructions are used to perform fusion processing on an encoder that is provided with a sub-layer for fusing image-like features and text-like features to obtain fusion encoding information by inputting the video feature vector and a plurality of the word information into the transformer model in the multimodal model-based video text summarization method, including:

A plurality of the word information is input to the encoder of the transformer model, and a plurality of the word information are sequentially extracted through the first sublayer and the second sublayer in the encoder to obtain a text feature vector, the first sublayer includes multi-head self-attention and Add&Norm, and the second sublayer includes FFN and Add&Norm;

The text feature vector and the video feature vector are input to a sublayer for fusing image-like features and text-like features for fusion processing to obtain fusion coding information.
The computer-readable storage medium according to claim 17, wherein the computer-executable instructions are used to perform the step of inputting the text feature vector and the video feature vector into a sublayer for fusing image-like features and text-like features in the multimodal model-based video text summarization method for fusion processing, and obtaining fusion coding information includes:

The text feature vector and the video feature vector are input to a sublayer for merging image class features and text class features;

Carry out matrix multiplication processing with Zv and weight matrix, obtain Z'v , described Zv is described video feature vector;

Perform matrix transposition processing on the Z′ v to obtain

Will
Perform matrix multiplication with Z t and perform calculation processing through the softmax function to obtain A, and A is the attention weight;

The A and Z v are multiplied to obtain AZ v , and AZ v and Z t are subjected to vector splicing processing to obtain Z′ t , where Z′ t is fusion coding information.
The computer-readable storage medium according to claim 15, wherein the computer-executable instructions are used to perform feature extraction on the video data in the multimodal model-based video text summarization method, and obtaining video feature vectors includes:

The 3D ResNet-101 model is used to extract features from the video data to obtain video feature vectors.
The computer-readable storage medium according to claim 15, wherein the computer-executable instructions are used to perform the word segmentation processing of the text information in the multimodal model-based video text summarization method, and obtaining a plurality of word information includes:

Described use hanlp participle tool to carry out participle processing to described text information, obtain the multiple word information that is arranged in the same row;

A plurality of word information is processed by rows to obtain a plurality of word information arranged in a row-by-row dictionary structure.