CN114398889A

CN114398889A - Video text summarization method, device and storage medium based on multi-modal model

Info

Publication number: CN114398889A
Application number: CN202210056075.6A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-04-26
Also published as: WO2023137913A1

Abstract

The invention relates to an artificial intelligence technology, the embodiment of the invention provides a video text summarization method, equipment and a storage medium based on a multi-modal model, wherein the method comprises the following steps: performing feature extraction on the video data to obtain video features, wherein the video data is video data needing text abstract extraction; vectorizing the video features to obtain video feature vectors; carrying out voice extraction on the video data to obtain monologue voice information; converting the monologue voice information into text information through an automatic voice recognition technology (ASR); performing word segmentation processing on the text information to obtain a plurality of word information; the video feature vectors and the word information are input into a transform model to be trained, a text abstract result is obtained, a sub-layer used for fusing image features and text features is arranged in an encoder of the transform model, and accuracy of text abstract content extracted from the video can be improved.

Description

Video text summarization method, device and storage medium based on multi-modal model

Technical Field

The embodiment of the invention relates to the field of artificial intelligence, in particular to a video text summarization method based on a multi-modal model, a device and a storage medium.

Background

At present, most of methods for intelligent video summarization in the industry use text extraction methods, because there is Speech monologue in a video, an Automatic Speech Recognition technology (ASR) is usually used to convert voice into text, and then a natural language processing technology is used, for example, to calculate the importance of a sentence and extract the sentence with important calculation result.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention mainly aims to provide a video text summarization method based on a multi-mode model, which can effectively improve the accuracy of text summarization content extracted from a video.

In a first aspect, an embodiment of the present invention provides a method for abstracting a video text based on a multi-modal model, including:

performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction;

vectorizing the video features to obtain video feature vectors;

performing voice extraction on the video data to obtain monologue voice information;

converting the monologue speech information into text information by an automatic speech recognition technology (ASR);

performing word segmentation processing on the text information to obtain a plurality of word information;

inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.

In an embodiment, the inputting the video feature vectors and the dictionary into a transform model for training to obtain a text summarization result includes:

inputting the video feature vector and the dictionary into an encoder of the transform model, wherein the encoder is provided with a sub-layer for fusing image features and text features, and fusing the sub-layer to obtain fused coding information;

and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.

In an embodiment, the inputting the video feature vector and the dictionary into an encoder of a transform model, where a sub-layer for fusing image features and text features is provided, performs a fusion process to obtain fusion coding information, and includes:

inputting the dictionary into the encoder of the transform model, and extracting the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, wherein the first sublayer comprises multi-head self-orientation and Add & Norm, and the second sublayer comprises FFN and Add & Norm;

and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.

In an embodiment, the inputting the text feature vector and the video feature vector into a sub-layer for fusing an image feature and a text feature to obtain fused encoded information includes:

inputting the text feature vector and the video feature vector into a sublayer for fusing image-class features and text-class features;

will Z_vMatrix multiplication processing is carried out on the weight matrix to obtain Z_v', said Z_vThe video feature vector is obtained;

the Z is_v' matrix transposition is carried out to obtain Z_v'^T；

Will Z_v'^TAnd Z_tMatrix multiplication and passage of softmax functionCalculating to obtain A, wherein A is attention weight;

reacting said A with Z_vMultiplying to obtain AZ_vAnd AZ is added_vAnd Z_tVector splicing treatment is carried out to obtain Z'_t,Z’_tInformation is encoded for fusion.

In an embodiment, the extracting the features of the video data to obtain the video feature vector includes:

and performing feature extraction on the video data by using a 3D ResNet-101 model to obtain a video feature vector.

In an embodiment, the performing word segmentation processing on the text information to obtain multiple word information includes:

and performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information.

In an embodiment, the performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information includes:

performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information arranged in the same line;

and performing line segmentation on the plurality of word information to obtain a plurality of word information arranged in a line segmentation dictionary structure.

In a second aspect, an embodiment of the present invention provides a video text summarization apparatus based on a multi-modal model, including:

the first extraction module is used for extracting the characteristics of video data to obtain video characteristics, wherein the video data is the video data needing to extract the text abstract;

the vectorization module is used for vectorizing the video features to obtain video feature vectors;

the second extraction module is used for carrying out voice extraction on the video data to obtain monologue voice information;

the conversion module is used for converting the uniwhite voice information into text information through an automatic voice recognition (ASR) technology;

the word segmentation module is used for carrying out word segmentation processing on the text information to obtain a plurality of word information;

and the training module is used for inputting the video feature vector and the word information into a transform model for training to obtain a Text summarization result, a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.

In an embodiment, the training module is further configured to input the video feature vector and the dictionary into an encoder of the transform model, where a sub-layer for fusing image features and text features is provided, to perform fusion processing, so as to obtain fusion encoding information; and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.

In an embodiment, the training module is further configured to input the dictionary into the encoder of the transform model, and extract the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, where the first sublayer includes multi-head self-orientation and Add & Norm, and the second sublayer includes FFN and Add & Norm; and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.

In an embodiment, the training module is further configured to input the text feature vector and the video feature vector to a sublayer for fusing image-class features and text-class features; will Z_vMatrix multiplication processing is carried out on the weight matrix to obtain Z_v', said Z_vThe video feature vector is obtained; the Z is_v' matrix transposition is carried out to obtain Z_v'^T(ii) a Will Z_v'^TAnd Z_tPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight; reacting said A with Z_vMultiplying to obtain AZ_vAnd AZ is added_vAnd Z_tCarrying out vector splicing treatment to obtain Z_t',Z_t' is fusing the encoded information.

In an embodiment, the extraction module is further configured to perform feature extraction on the video data by using a 3D ResNet-101 model to obtain a video feature vector.

In an embodiment, the word segmentation module is further configured to perform word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information.

In an embodiment, the word segmentation module is further configured to perform word segmentation on the text information by using a hand word segmentation tool to obtain multiple word information arranged in the same row;

In a third aspect, an embodiment of the present invention provides an apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing the multimodal model based video text summarization method according to the first aspect.

In a fourth aspect, a computer-readable storage medium stores computer-executable instructions for performing the multimodal model based video text summarization method of the first aspect.

The embodiment of the invention comprises the following steps: the video text summarization method based on the multi-modal model comprises the following steps: performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information; converting the monologue speech information into text information by an automatic speech recognition technology (ASR); performing word segmentation processing on the text information to obtain a plurality of word information; inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm. In the technical scheme of this embodiment, a sublayer for fusing image features and text features is newly provided in an encoder of a transform model, so that text summarization results obtained by training a video feature vector and a dictionary in the transform model are more accurate, that is, the accuracy of text summarization contents extracted from a video can be effectively improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

FIG. 1 is a schematic diagram of a system architecture platform for performing a multimodal model-based video text summarization method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for summarizing text in a video based on a multimodal model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a transform model improved in a multi-modal model-based video text summarization method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a text summarization result in a multi-modal model-based video text summarization method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of generating fusion coded information in a video text summarization method based on a multi-modal model according to an embodiment of the present invention;

fig. 6 is a flowchart of fusion processing in a newly added sub-layer in an encoder in a video text summarization method based on a multi-modal model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of newly added sub-layers in an encoder in a video text summarization method based on a multi-modal model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the invention provides a video text summarization method based on a multi-mode model, which comprises the following steps: performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information; converting the monologue speech information into text information by an automatic speech recognition technology (ASR); performing word segmentation processing on the text information to obtain a plurality of word information; inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm. In the technical scheme of the embodiment, a sublayer used for fusing image features and text features is newly arranged in an encoder of a transform model, so that text summarization results obtained by training a video feature vector and a dictionary in the transform model are more accurate, and the accuracy of text summarization contents extracted from a video can be effectively improved.

The embodiments of the present invention will be further explained with reference to the drawings.

As shown in fig. 1, fig. 1 is a schematic diagram of a system architecture platform 100 for executing a method for summarization of a video text based on a multi-modal model according to an embodiment of the present invention.

In the example of fig. 1, the system architecture platform 100 is provided with a processor 110 and a memory 120, wherein the processor 110 and the memory 120 may be connected by a bus or other means, and fig. 1 illustrates the connection by the bus as an example.

The memory 120, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 120 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to the system architecture platform via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It will be understood by those skilled in the art that the system architecture platform may be applied to a 5G communication network system, a mobile communication network system evolved later, and the like, and the embodiment is not limited thereto.

Those skilled in the art will appreciate that the system architecture platform illustrated in FIG. 1 does not constitute a limitation on embodiments of the invention, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

The system architecture platform 100 may be an independent system architecture platform 100, or may be a cloud system architecture platform 100 that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Based on the system architecture platform, the following provides various embodiments of the video text summarization method based on the multi-modal model.

As shown in fig. 2, fig. 2 is a flowchart of a multimodal model-based video text summarization method according to an embodiment of the present invention, the multimodal model-based video text summarization method is applied to the above-mentioned architecture platform, and the multimodal model-based video text summarization method includes, but is not limited to, step S100, step S200, step S300, step S400, and step S500.

Step S100, performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction;

step S200, vectorizing the video features to obtain video feature vectors;

step S300, performing voice extraction on the video data to obtain independent voice information;

step S400, converting the monologue voice information into text information through an automatic voice recognition technology (ASR);

step S500, performing word segmentation processing on the text information to obtain a plurality of word information;

step S600, inputting the video feature vector and the plurality of word information into a transform model for training to obtain a Text summarization result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.

In one embodiment, video data needing text abstraction extraction is obtained, the video data comprises video image information and voice information, the voice information comprises independent voice information and background voice information, then feature extraction is carried out on the video data to obtain video features, vectorization processing is carried out on the video features to obtain video feature vectors, and the video feature vectors are used for preparing for subsequent training steps; meanwhile, voice extraction can be carried out on the video data to obtain uniwhite voice information, the uniwhite voice information is converted into text information through an automatic voice recognition technology ASR, and then word segmentation processing is carried out on the text information to obtain a plurality of word information; and then inputting the processed video feature vector and the plurality of word information into a transformer model for training to obtain a text abstract result after sub-layer training processing of fusing image features and text features. Due to the fact that the sublayer used for fusing the image class features and the Text class features is newly arranged in the encoder of the transform model and comprises Text-vision fusion and Add & Norm, Text summarization results obtained by training the video feature vectors and the dictionary in the transform model are more accurate, and accuracy of Text summarization contents extracted from the video can be effectively improved. It should be noted that the 3D ResNet-101 model may be used to extract features of the video data to obtain video feature vectors, or other models may be used to extract video features in the video data, which is not limited in this embodiment.

It can be understood that the step of performing word segmentation processing on the text information to obtain multiple word information may be performing word segmentation processing on the text information to obtain multiple word information by using a hand word segmentation tool, or may be performing word segmentation processing on the text information by using the hand word segmentation tool to obtain multiple word information arranged in the same row; the word information is subjected to line segmentation processing to obtain a plurality of word information arranged in a line dictionary structure, which is not specifically limited in this embodiment. It is understood that the dictionary in this embodiment includes a plurality of word information, each word information is independent in a line in the dictionary, and each word information corresponds to a line position number.

It should be noted that, referring to fig. 3, the Encoder Layer in the conventional transform model of the present embodiment adds a sub-Layer, which is used for fusing image-like features and Text-like features, including Text-video fusion and Add & Norm. The improved transform model then includes an Encoder end (coder) and a Decoder end (Decoder), and an Encoder Layer in the Encoder end includes three sublayers: a first sublayer (multi-head self-orientation and Add & Norm), a second sublayer (FFN and Add & Norm), and a third sublayer (Text-vision fusion and Add & Norm), and a Decoder Layer at the Decoder end includes three sublayers: a fourth sublayer (Masked Multi-head self-orientation and Add & Norm), a fifth sublayer (Multi-head Enc-Dec orientation and Add & Norm), and a sixth sublayer (FFN and Add & Norm).

It should be noted that Text Inputs is a Text (dictionary) of input divided words, and the words in the conventional transform model correspond to 3 embedding. The token embedding method comprises the steps of token embedding, label embedding and word embedding, wherein the token embedding and the word embedding are used for mapping human languages into a geometric space, the segment embedding is used for segment embedding, and the position embedding is used for position embedding. In the embodiment, only token embedding and segment embedding are used. The method for acquiring the Token embedding comprises the following steps: multiplying the text inputs by a weight matrix with the size of N × 512 to obtain token embedding with the vector length of 512, wherein the text inputs are vectors with the length of the number N of the dictionary lines as dimensions, each word in the text inputs corresponds to one position in the dictionary, the position is marked with 1, and the rest positions are marked with 0.

Referring to FIG. 4, in an embodiment, step S500 includes, but is not limited to, step S410 and step S420.

Step S410, inputting the video feature vector and the dictionary into an encoder of a transform model, wherein the encoder is provided with sub-layers for fusing image features and text features, and fusing the sub-layers to obtain fused coding information;

step S420, the fused encoded information is transmitted to a decoder for decoding, and a text summary result is generated.

Specifically, the video feature vector and the dictionary may be respectively input into an encoder of the transform model, which is provided with a sub-layer for fusing image features and text features, to perform fusion processing, so as to obtain fusion coding information, and then the fusion coding information is transmitted to a decoder to perform decoding processing, so as to generate a text summarization result. Because a sublayer used for fusing image features and text features is newly arranged in an encoder of the transform model, rather than being trained through text feature vectors, the text abstract results obtained by training the video feature vectors and the dictionary in the transform model are more accurate, and the accuracy of text abstract contents extracted from the video can be effectively improved.

Referring to FIG. 5, in one embodiment, step S410 includes, but is not limited to, step S510 and step S520.

Step S510, inputting the dictionary into an encoder of a transform model, and extracting the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, wherein the first sublayer comprises a multi-head self-orientation and an Add & Norm, and the second sublayer comprises an FFN and an Add & Norm;

step S520, inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.

Specifically, the dictionary may be input into an encoder of the transform model for processing, for example, the dictionary is extracted sequentially through a first sublayer (multi-head self-orientation and Add & Norm) and a second sublayer (FFN and Add & Norm) in the encoder to obtain a text feature vector, and then the text feature vector and the video feature vector are input into a newly added sublayer for image feature and text feature fusion processing to obtain fusion coding information.

Referring to fig. 6, in an embodiment, step S510 includes, but is not limited to, step S610, step S620, step S630, step S640, and step S650.

Step S610, inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature;

step S620, adding Z_vMatrix multiplication processing is carried out on the weight matrix to obtain Z_v'，Z_vIs a video feature vector;

step S630, mixing Z'_vPerforming matrix transposition to obtain

Step S640, will

And Z_tPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight;

step S650, A and Z_vMultiplying to obtain AZ_vAnd AZ is added_vAnd Z_tVector splicing treatment is carried out to obtain Z'_t,Z’_tInformation is encoded for fusion.

Specifically, the text feature vector and the video feature vector are input into a sublayer for fusing the image feature and the text feature, and Z is firstly input into the sublayer_vPerforming matrix multiplication on the weight matrix to obtain Z'_vThen Z'_vPerforming matrix transposition to obtain

Then will be

And Z_tMatrix multiplication is carried out, calculation processing is carried out through a softmax function to obtain A, and A and Z are carried out_vMultiplying to obtain AZ_vAnd AZ is added_vAnd Z_tVector splicing treatment is carried out to obtain Z'_t. Wherein Z_vIs a video feature vector, A is an attention weight, Z'_tInformation is encoded for fusion.

In an embodiment, referring to fig. 7, the structure of Text-vision fusion in a sublayer added in the encoder of the transform model for fusing an image-like feature and a Text-like feature can be expressed by a mathematical formula based on the structure of Text-vision fusion, specifically as follows:

Z′_t＝Concat(Z_t，AZ_v)W₂

wherein W₁As a weight matrix, Z_vIs a video feature vector, A is an attention weight, Z'_tInformation is encoded for fusion.

The method can achieve better effect in automatic text summarization due to the introduction of image class characteristics, such as higher score in evaluation methods such as ROUGE-1, ROUGE-2, ROUGE-L and the like.

Based on the above multi-modal model-based video text summarization method, the following respectively proposes various embodiments of the multi-modal model-based video text summarization method apparatus, the controller and the computer-readable storage medium of the present invention.

An embodiment of the present invention further provides a device for predicting an answer sequence based on an improved topic reaction theory, including:

In an embodiment, the training module is further configured to input the video feature vector and the dictionary into an encoder of the transform model, where a sub-layer for fusing the image feature and the text feature is provided, to perform fusion processing, so as to obtain fusion encoding information; and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.

In an embodiment, the training module is further configured to input the dictionary into an encoder of the transform model, and extract the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, where the first sublayer includes multi-head self-orientation and Add & Norm, and the second sublayer includes FFN and Add & Norm; and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.

In an embodiment, the training module is further configured to input the text feature vector and the video feature vector to a sublayer for fusing the image-class feature and the text-class feature; will Z_vMatrix multiplication processing is carried out on the weight matrix to obtain Z_v'，Z_vIs a video feature vector; will Z_v' matrix transposition is carried out to obtain Z_v'^T(ii) a Will Z_v'^TAnd Z_tPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight; a and Z are_vMultiplying to obtain AZ_vAnd AZ is added_vAnd Z_tCarrying out vector splicing treatment to obtain Z_t',Z_t' is fusing the encoded information.

In an embodiment, the extraction module is further configured to perform feature extraction on the video data by using the 3D ResNet-101 model to obtain a video feature vector.

It should be noted that, technical means, technical problems solved and technical effects achieved in the embodiments of the apparatus for summarizing video text based on a multi-modal model and the embodiments of the method for summarizing video text based on a multi-modal model are the same, and detailed description thereof is omitted here for details.

Additionally, one embodiment of the present invention provides an apparatus comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor.

The processor and memory may be connected by a bus or other means.

It should be noted that the apparatus in this embodiment, which may be configured to include a memory and a processor as in the embodiment shown in fig. 1, can form a part of the system architecture platform in the embodiment shown in fig. 1, and both are within the same inventive concept, so that both have the same implementation principle and beneficial effects, and are not described in detail herein.

Non-transitory software programs and instructions required to implement the multimodal model based video text summarization method on the device side of the above-described embodiments are stored in a memory, and when executed by a processor, perform the multimodal model based video text summarization method of the above-described embodiments, e.g., performing the above-described method steps S100 to S600 in fig. 2, method steps S410 to S420 in fig. 4, method steps S510 to S520 in fig. 5, and method steps S610 to S650 in fig. 6.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions for performing the terminal-side multimodal model based video text summarization method, for example, performing the above-described method steps S100 to S600 in fig. 2, method steps S410 to S420 in fig. 4, method steps S510 to S520 in fig. 5, and method steps S610 to S650 in fig. 6.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A video text summarization method based on a multi-modal model comprises the following steps:

vectorizing the video features to obtain video feature vectors;

2. The method for summarization of video text based on multi-modal models according to claim 1, wherein the inputting the video feature vectors and the plurality of word information into a transform model for training to obtain text summarization results comprises:

inputting the video feature vector and the plurality of word information into an encoder of the transform model, wherein the encoder is provided with a sub-layer for fusing image features and text features, and fusing the sub-layer to obtain fused encoding information;

3. The method for summarization of video text based on multi-modal model according to claim 2, wherein the inputting the video feature vector and the plurality of word information into a transform model is performed with a fusion processing by an encoder having sub-layers for fusing image-like features and text-like features, so as to obtain fused encoded information, comprising:

inputting a plurality of word information into the encoder of the transform model, and extracting the plurality of word information sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, wherein the first sublayer comprises multi-head self-orientation and Add & Norm, and the second sublayer comprises FFN and Add & Norm;

4. The method for abstracting video text based on multi-modal model as claimed in claim 3, wherein the inputting the text feature vector and the video feature vector into a sub-layer for fusing image-like features and text-like features to obtain fused encoded information comprises:

will Z_vPerforming matrix multiplication on the weight matrix to obtain Z'_vSaidZ_vThe video feature vector is obtained;

the Z 'is'_vMatrix transposition is carried out to obtain Z'_v ^T；

Prepared from Z'_v ^TAnd Z_tPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight;

reacting said A with Z_vMultiplying to obtain AZ_vAnd AZ is added_vAnd Z_tVector splicing treatment is carried out to obtain Z'_t,Z′_tInformation is encoded for fusion.

5. The method for abstracting video text based on multi-modal model as claimed in claim 1, wherein said extracting the features of the video data to obtain the video feature vector comprises:

6. The method for abstracting video text based on multi-modal model as claimed in claim 1, wherein the word segmentation processing of the text information to obtain multiple word information comprises:

7. The method for abstracting video text based on multi-modal model as claimed in claim 6, wherein the using a hand word segmentation tool to perform word segmentation on the text information to obtain a plurality of word information comprises:

8. A video text summarization method device based on a multi-modal model is characterized by comprising the following steps:

9. An apparatus, comprising: memory, processor and computer program stored on the memory and executable on the processor, wherein the processor implements the method for summarization of a multimodal model based video text according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing computer-executable instructions for performing the multimodal model based video text summarization method of any one of claims 1 to 7.