CN114398889A - Video text summarization method, device and storage medium based on multi-modal model - Google Patents

Video text summarization method, device and storage medium based on multi-modal model Download PDF

Info

Publication number
CN114398889A
CN114398889A CN202210056075.6A CN202210056075A CN114398889A CN 114398889 A CN114398889 A CN 114398889A CN 202210056075 A CN202210056075 A CN 202210056075A CN 114398889 A CN114398889 A CN 114398889A
Authority
CN
China
Prior art keywords
text
video
information
features
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210056075.6A
Other languages
Chinese (zh)
Inventor
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210056075.6A priority Critical patent/CN114398889A/en
Publication of CN114398889A publication Critical patent/CN114398889A/en
Priority to PCT/CN2022/090712 priority patent/WO2023137913A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, the embodiment of the invention provides a video text summarization method, equipment and a storage medium based on a multi-modal model, wherein the method comprises the following steps: performing feature extraction on the video data to obtain video features, wherein the video data is video data needing text abstract extraction; vectorizing the video features to obtain video feature vectors; carrying out voice extraction on the video data to obtain monologue voice information; converting the monologue voice information into text information through an automatic voice recognition technology (ASR); performing word segmentation processing on the text information to obtain a plurality of word information; the video feature vectors and the word information are input into a transform model to be trained, a text abstract result is obtained, a sub-layer used for fusing image features and text features is arranged in an encoder of the transform model, and accuracy of text abstract content extracted from the video can be improved.

Description

Video text summarization method, device and storage medium based on multi-modal model
Technical Field
The embodiment of the invention relates to the field of artificial intelligence, in particular to a video text summarization method based on a multi-modal model, a device and a storage medium.
Background
At present, most of methods for intelligent video summarization in the industry use text extraction methods, because there is Speech monologue in a video, an Automatic Speech Recognition technology (ASR) is usually used to convert voice into text, and then a natural language processing technology is used, for example, to calculate the importance of a sentence and extract the sentence with important calculation result.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the invention mainly aims to provide a video text summarization method based on a multi-mode model, which can effectively improve the accuracy of text summarization content extracted from a video.
In a first aspect, an embodiment of the present invention provides a method for abstracting a video text based on a multi-modal model, including:
performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction;
vectorizing the video features to obtain video feature vectors;
performing voice extraction on the video data to obtain monologue voice information;
converting the monologue speech information into text information by an automatic speech recognition technology (ASR);
performing word segmentation processing on the text information to obtain a plurality of word information;
inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.
In an embodiment, the inputting the video feature vectors and the dictionary into a transform model for training to obtain a text summarization result includes:
inputting the video feature vector and the dictionary into an encoder of the transform model, wherein the encoder is provided with a sub-layer for fusing image features and text features, and fusing the sub-layer to obtain fused coding information;
and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.
In an embodiment, the inputting the video feature vector and the dictionary into an encoder of a transform model, where a sub-layer for fusing image features and text features is provided, performs a fusion process to obtain fusion coding information, and includes:
inputting the dictionary into the encoder of the transform model, and extracting the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, wherein the first sublayer comprises multi-head self-orientation and Add & Norm, and the second sublayer comprises FFN and Add & Norm;
and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.
In an embodiment, the inputting the text feature vector and the video feature vector into a sub-layer for fusing an image feature and a text feature to obtain fused encoded information includes:
inputting the text feature vector and the video feature vector into a sublayer for fusing image-class features and text-class features;
will ZvMatrix multiplication processing is carried out on the weight matrix to obtain Zv', said ZvThe video feature vector is obtained;
the Z isv' matrix transposition is carried out to obtain Zv'T
Will Zv'TAnd ZtMatrix multiplication and passage of softmax functionCalculating to obtain A, wherein A is attention weight;
reacting said A with ZvMultiplying to obtain AZvAnd AZ is addedvAnd ZtVector splicing treatment is carried out to obtain Z't,Z’tInformation is encoded for fusion.
In an embodiment, the extracting the features of the video data to obtain the video feature vector includes:
and performing feature extraction on the video data by using a 3D ResNet-101 model to obtain a video feature vector.
In an embodiment, the performing word segmentation processing on the text information to obtain multiple word information includes:
and performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information.
In an embodiment, the performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information includes:
performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information arranged in the same line;
and performing line segmentation on the plurality of word information to obtain a plurality of word information arranged in a line segmentation dictionary structure.
In a second aspect, an embodiment of the present invention provides a video text summarization apparatus based on a multi-modal model, including:
the first extraction module is used for extracting the characteristics of video data to obtain video characteristics, wherein the video data is the video data needing to extract the text abstract;
the vectorization module is used for vectorizing the video features to obtain video feature vectors;
the second extraction module is used for carrying out voice extraction on the video data to obtain monologue voice information;
the conversion module is used for converting the uniwhite voice information into text information through an automatic voice recognition (ASR) technology;
the word segmentation module is used for carrying out word segmentation processing on the text information to obtain a plurality of word information;
and the training module is used for inputting the video feature vector and the word information into a transform model for training to obtain a Text summarization result, a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.
In an embodiment, the training module is further configured to input the video feature vector and the dictionary into an encoder of the transform model, where a sub-layer for fusing image features and text features is provided, to perform fusion processing, so as to obtain fusion encoding information; and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.
In an embodiment, the training module is further configured to input the dictionary into the encoder of the transform model, and extract the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, where the first sublayer includes multi-head self-orientation and Add & Norm, and the second sublayer includes FFN and Add & Norm; and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.
In an embodiment, the training module is further configured to input the text feature vector and the video feature vector to a sublayer for fusing image-class features and text-class features; will ZvMatrix multiplication processing is carried out on the weight matrix to obtain Zv', said ZvThe video feature vector is obtained; the Z isv' matrix transposition is carried out to obtain Zv'T(ii) a Will Zv'TAnd ZtPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight; reacting said A with ZvMultiplying to obtain AZvAnd AZ is addedvAnd ZtCarrying out vector splicing treatment to obtain Zt',Zt' is fusing the encoded information.
In an embodiment, the extraction module is further configured to perform feature extraction on the video data by using a 3D ResNet-101 model to obtain a video feature vector.
In an embodiment, the word segmentation module is further configured to perform word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information.
In an embodiment, the word segmentation module is further configured to perform word segmentation on the text information by using a hand word segmentation tool to obtain multiple word information arranged in the same row;
and performing line segmentation on the plurality of word information to obtain a plurality of word information arranged in a line segmentation dictionary structure.
In a third aspect, an embodiment of the present invention provides an apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing the multimodal model based video text summarization method according to the first aspect.
In a fourth aspect, a computer-readable storage medium stores computer-executable instructions for performing the multimodal model based video text summarization method of the first aspect.
The embodiment of the invention comprises the following steps: the video text summarization method based on the multi-modal model comprises the following steps: performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information; converting the monologue speech information into text information by an automatic speech recognition technology (ASR); performing word segmentation processing on the text information to obtain a plurality of word information; inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm. In the technical scheme of this embodiment, a sublayer for fusing image features and text features is newly provided in an encoder of a transform model, so that text summarization results obtained by training a video feature vector and a dictionary in the transform model are more accurate, that is, the accuracy of text summarization contents extracted from a video can be effectively improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
FIG. 1 is a schematic diagram of a system architecture platform for performing a multimodal model-based video text summarization method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for summarizing text in a video based on a multimodal model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a transform model improved in a multi-modal model-based video text summarization method according to an embodiment of the present invention;
FIG. 4 is a flowchart of a text summarization result in a multi-modal model-based video text summarization method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of generating fusion coded information in a video text summarization method based on a multi-modal model according to an embodiment of the present invention;
fig. 6 is a flowchart of fusion processing in a newly added sub-layer in an encoder in a video text summarization method based on a multi-modal model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of newly added sub-layers in an encoder in a video text summarization method based on a multi-modal model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The embodiment of the invention provides a video text summarization method based on a multi-mode model, which comprises the following steps: performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction; vectorizing the video features to obtain video feature vectors; performing voice extraction on the video data to obtain monologue voice information; converting the monologue speech information into text information by an automatic speech recognition technology (ASR); performing word segmentation processing on the text information to obtain a plurality of word information; inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm. In the technical scheme of the embodiment, a sublayer used for fusing image features and text features is newly arranged in an encoder of a transform model, so that text summarization results obtained by training a video feature vector and a dictionary in the transform model are more accurate, and the accuracy of text summarization contents extracted from a video can be effectively improved.
The embodiments of the present invention will be further explained with reference to the drawings.
As shown in fig. 1, fig. 1 is a schematic diagram of a system architecture platform 100 for executing a method for summarization of a video text based on a multi-modal model according to an embodiment of the present invention.
In the example of fig. 1, the system architecture platform 100 is provided with a processor 110 and a memory 120, wherein the processor 110 and the memory 120 may be connected by a bus or other means, and fig. 1 illustrates the connection by the bus as an example.
The memory 120, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 120 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to the system architecture platform via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It will be understood by those skilled in the art that the system architecture platform may be applied to a 5G communication network system, a mobile communication network system evolved later, and the like, and the embodiment is not limited thereto.
Those skilled in the art will appreciate that the system architecture platform illustrated in FIG. 1 does not constitute a limitation on embodiments of the invention, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.
The system architecture platform 100 may be an independent system architecture platform 100, or may be a cloud system architecture platform 100 that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Based on the system architecture platform, the following provides various embodiments of the video text summarization method based on the multi-modal model.
As shown in fig. 2, fig. 2 is a flowchart of a multimodal model-based video text summarization method according to an embodiment of the present invention, the multimodal model-based video text summarization method is applied to the above-mentioned architecture platform, and the multimodal model-based video text summarization method includes, but is not limited to, step S100, step S200, step S300, step S400, and step S500.
Step S100, performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction;
step S200, vectorizing the video features to obtain video feature vectors;
step S300, performing voice extraction on the video data to obtain independent voice information;
step S400, converting the monologue voice information into text information through an automatic voice recognition technology (ASR);
step S500, performing word segmentation processing on the text information to obtain a plurality of word information;
step S600, inputting the video feature vector and the plurality of word information into a transform model for training to obtain a Text summarization result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.
In one embodiment, video data needing text abstraction extraction is obtained, the video data comprises video image information and voice information, the voice information comprises independent voice information and background voice information, then feature extraction is carried out on the video data to obtain video features, vectorization processing is carried out on the video features to obtain video feature vectors, and the video feature vectors are used for preparing for subsequent training steps; meanwhile, voice extraction can be carried out on the video data to obtain uniwhite voice information, the uniwhite voice information is converted into text information through an automatic voice recognition technology ASR, and then word segmentation processing is carried out on the text information to obtain a plurality of word information; and then inputting the processed video feature vector and the plurality of word information into a transformer model for training to obtain a text abstract result after sub-layer training processing of fusing image features and text features. Due to the fact that the sublayer used for fusing the image class features and the Text class features is newly arranged in the encoder of the transform model and comprises Text-vision fusion and Add & Norm, Text summarization results obtained by training the video feature vectors and the dictionary in the transform model are more accurate, and accuracy of Text summarization contents extracted from the video can be effectively improved. It should be noted that the 3D ResNet-101 model may be used to extract features of the video data to obtain video feature vectors, or other models may be used to extract video features in the video data, which is not limited in this embodiment.
It can be understood that the step of performing word segmentation processing on the text information to obtain multiple word information may be performing word segmentation processing on the text information to obtain multiple word information by using a hand word segmentation tool, or may be performing word segmentation processing on the text information by using the hand word segmentation tool to obtain multiple word information arranged in the same row; the word information is subjected to line segmentation processing to obtain a plurality of word information arranged in a line dictionary structure, which is not specifically limited in this embodiment. It is understood that the dictionary in this embodiment includes a plurality of word information, each word information is independent in a line in the dictionary, and each word information corresponds to a line position number.
It should be noted that, referring to fig. 3, the Encoder Layer in the conventional transform model of the present embodiment adds a sub-Layer, which is used for fusing image-like features and Text-like features, including Text-video fusion and Add & Norm. The improved transform model then includes an Encoder end (coder) and a Decoder end (Decoder), and an Encoder Layer in the Encoder end includes three sublayers: a first sublayer (multi-head self-orientation and Add & Norm), a second sublayer (FFN and Add & Norm), and a third sublayer (Text-vision fusion and Add & Norm), and a Decoder Layer at the Decoder end includes three sublayers: a fourth sublayer (Masked Multi-head self-orientation and Add & Norm), a fifth sublayer (Multi-head Enc-Dec orientation and Add & Norm), and a sixth sublayer (FFN and Add & Norm).
It should be noted that Text Inputs is a Text (dictionary) of input divided words, and the words in the conventional transform model correspond to 3 embedding. The token embedding method comprises the steps of token embedding, label embedding and word embedding, wherein the token embedding and the word embedding are used for mapping human languages into a geometric space, the segment embedding is used for segment embedding, and the position embedding is used for position embedding. In the embodiment, only token embedding and segment embedding are used. The method for acquiring the Token embedding comprises the following steps: multiplying the text inputs by a weight matrix with the size of N × 512 to obtain token embedding with the vector length of 512, wherein the text inputs are vectors with the length of the number N of the dictionary lines as dimensions, each word in the text inputs corresponds to one position in the dictionary, the position is marked with 1, and the rest positions are marked with 0.
Referring to FIG. 4, in an embodiment, step S500 includes, but is not limited to, step S410 and step S420.
Step S410, inputting the video feature vector and the dictionary into an encoder of a transform model, wherein the encoder is provided with sub-layers for fusing image features and text features, and fusing the sub-layers to obtain fused coding information;
step S420, the fused encoded information is transmitted to a decoder for decoding, and a text summary result is generated.
Specifically, the video feature vector and the dictionary may be respectively input into an encoder of the transform model, which is provided with a sub-layer for fusing image features and text features, to perform fusion processing, so as to obtain fusion coding information, and then the fusion coding information is transmitted to a decoder to perform decoding processing, so as to generate a text summarization result. Because a sublayer used for fusing image features and text features is newly arranged in an encoder of the transform model, rather than being trained through text feature vectors, the text abstract results obtained by training the video feature vectors and the dictionary in the transform model are more accurate, and the accuracy of text abstract contents extracted from the video can be effectively improved.
Referring to FIG. 5, in one embodiment, step S410 includes, but is not limited to, step S510 and step S520.
Step S510, inputting the dictionary into an encoder of a transform model, and extracting the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, wherein the first sublayer comprises a multi-head self-orientation and an Add & Norm, and the second sublayer comprises an FFN and an Add & Norm;
step S520, inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.
Specifically, the dictionary may be input into an encoder of the transform model for processing, for example, the dictionary is extracted sequentially through a first sublayer (multi-head self-orientation and Add & Norm) and a second sublayer (FFN and Add & Norm) in the encoder to obtain a text feature vector, and then the text feature vector and the video feature vector are input into a newly added sublayer for image feature and text feature fusion processing to obtain fusion coding information.
Referring to fig. 6, in an embodiment, step S510 includes, but is not limited to, step S610, step S620, step S630, step S640, and step S650.
Step S610, inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature;
step S620, adding ZvMatrix multiplication processing is carried out on the weight matrix to obtain Zv',ZvIs a video feature vector;
step S630, mixing Z'vPerforming matrix transposition to obtain
Figure BDA0003476285790000081
Step S640, will
Figure BDA0003476285790000082
And ZtPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight;
step S650, A and ZvMultiplying to obtain AZvAnd AZ is addedvAnd ZtVector splicing treatment is carried out to obtain Z't,Z’tInformation is encoded for fusion.
Specifically, the text feature vector and the video feature vector are input into a sublayer for fusing the image feature and the text feature, and Z is firstly input into the sublayervPerforming matrix multiplication on the weight matrix to obtain Z'vThen Z'vPerforming matrix transposition to obtain
Figure BDA0003476285790000083
Then will be
Figure BDA0003476285790000084
And ZtMatrix multiplication is carried out, calculation processing is carried out through a softmax function to obtain A, and A and Z are carried outvMultiplying to obtain AZvAnd AZ is addedvAnd ZtVector splicing treatment is carried out to obtain Z't. Wherein ZvIs a video feature vector, A is an attention weight, Z'tInformation is encoded for fusion.
In an embodiment, referring to fig. 7, the structure of Text-vision fusion in a sublayer added in the encoder of the transform model for fusing an image-like feature and a Text-like feature can be expressed by a mathematical formula based on the structure of Text-vision fusion, specifically as follows:
Figure BDA0003476285790000085
Figure BDA0003476285790000086
Z′t=Concat(Zt,AZv)W2
wherein W1As a weight matrix, ZvIs a video feature vector, A is an attention weight, Z'tInformation is encoded for fusion.
The method can achieve better effect in automatic text summarization due to the introduction of image class characteristics, such as higher score in evaluation methods such as ROUGE-1, ROUGE-2, ROUGE-L and the like.
Based on the above multi-modal model-based video text summarization method, the following respectively proposes various embodiments of the multi-modal model-based video text summarization method apparatus, the controller and the computer-readable storage medium of the present invention.
An embodiment of the present invention further provides a device for predicting an answer sequence based on an improved topic reaction theory, including:
the first extraction module is used for extracting the characteristics of video data to obtain video characteristics, wherein the video data is the video data needing to extract the text abstract;
the vectorization module is used for vectorizing the video features to obtain video feature vectors;
the second extraction module is used for carrying out voice extraction on the video data to obtain monologue voice information;
the conversion module is used for converting the uniwhite voice information into text information through an automatic voice recognition (ASR) technology;
the word segmentation module is used for carrying out word segmentation processing on the text information to obtain a plurality of word information;
and the training module is used for inputting the video feature vector and the word information into a transform model for training to obtain a Text summarization result, a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.
In an embodiment, the training module is further configured to input the video feature vector and the dictionary into an encoder of the transform model, where a sub-layer for fusing the image feature and the text feature is provided, to perform fusion processing, so as to obtain fusion encoding information; and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.
In an embodiment, the training module is further configured to input the dictionary into an encoder of the transform model, and extract the dictionary sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, where the first sublayer includes multi-head self-orientation and Add & Norm, and the second sublayer includes FFN and Add & Norm; and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.
In an embodiment, the training module is further configured to input the text feature vector and the video feature vector to a sublayer for fusing the image-class feature and the text-class feature; will ZvMatrix multiplication processing is carried out on the weight matrix to obtain Zv',ZvIs a video feature vector; will Zv' matrix transposition is carried out to obtain Zv'T(ii) a Will Zv'TAnd ZtPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight; a and Z arevMultiplying to obtain AZvAnd AZ is addedvAnd ZtCarrying out vector splicing treatment to obtain Zt',Zt' is fusing the encoded information.
In an embodiment, the extraction module is further configured to perform feature extraction on the video data by using the 3D ResNet-101 model to obtain a video feature vector.
In an embodiment, the word segmentation module is further configured to perform word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information.
In an embodiment, the word segmentation module is further configured to perform word segmentation on the text information by using a hand word segmentation tool to obtain multiple word information arranged in the same row;
and performing line segmentation on the plurality of word information to obtain a plurality of word information arranged in a line segmentation dictionary structure.
It should be noted that, technical means, technical problems solved and technical effects achieved in the embodiments of the apparatus for summarizing video text based on a multi-modal model and the embodiments of the method for summarizing video text based on a multi-modal model are the same, and detailed description thereof is omitted here for details.
Additionally, one embodiment of the present invention provides an apparatus comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor.
The processor and memory may be connected by a bus or other means.
It should be noted that the apparatus in this embodiment, which may be configured to include a memory and a processor as in the embodiment shown in fig. 1, can form a part of the system architecture platform in the embodiment shown in fig. 1, and both are within the same inventive concept, so that both have the same implementation principle and beneficial effects, and are not described in detail herein.
Non-transitory software programs and instructions required to implement the multimodal model based video text summarization method on the device side of the above-described embodiments are stored in a memory, and when executed by a processor, perform the multimodal model based video text summarization method of the above-described embodiments, e.g., performing the above-described method steps S100 to S600 in fig. 2, method steps S410 to S420 in fig. 4, method steps S510 to S520 in fig. 5, and method steps S610 to S650 in fig. 6.
Furthermore, an embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions for performing the terminal-side multimodal model based video text summarization method, for example, performing the above-described method steps S100 to S600 in fig. 2, method steps S410 to S420 in fig. 4, method steps S510 to S520 in fig. 5, and method steps S610 to S650 in fig. 6.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims (10)

1. A video text summarization method based on a multi-modal model comprises the following steps:
performing feature extraction on video data to obtain video features, wherein the video data is video data needing text abstract extraction;
vectorizing the video features to obtain video feature vectors;
performing voice extraction on the video data to obtain monologue voice information;
converting the monologue speech information into text information by an automatic speech recognition technology (ASR);
performing word segmentation processing on the text information to obtain a plurality of word information;
inputting the video feature vector and the word information into a transform model for training to obtain a Text summary result, wherein a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.
2. The method for summarization of video text based on multi-modal models according to claim 1, wherein the inputting the video feature vectors and the plurality of word information into a transform model for training to obtain text summarization results comprises:
inputting the video feature vector and the plurality of word information into an encoder of the transform model, wherein the encoder is provided with a sub-layer for fusing image features and text features, and fusing the sub-layer to obtain fused encoding information;
and transmitting the fusion coding information to a decoder for decoding to generate a text abstract result.
3. The method for summarization of video text based on multi-modal model according to claim 2, wherein the inputting the video feature vector and the plurality of word information into a transform model is performed with a fusion processing by an encoder having sub-layers for fusing image-like features and text-like features, so as to obtain fused encoded information, comprising:
inputting a plurality of word information into the encoder of the transform model, and extracting the plurality of word information sequentially through a first sublayer and a second sublayer in the encoder to obtain a text feature vector, wherein the first sublayer comprises multi-head self-orientation and Add & Norm, and the second sublayer comprises FFN and Add & Norm;
and inputting the text feature vector and the video feature vector into a sublayer for fusing the image feature and the text feature to obtain fused coding information.
4. The method for abstracting video text based on multi-modal model as claimed in claim 3, wherein the inputting the text feature vector and the video feature vector into a sub-layer for fusing image-like features and text-like features to obtain fused encoded information comprises:
inputting the text feature vector and the video feature vector into a sublayer for fusing image-class features and text-class features;
will ZvPerforming matrix multiplication on the weight matrix to obtain Z'vSaidZvThe video feature vector is obtained;
the Z 'is'vMatrix transposition is carried out to obtain Z'v T
Prepared from Z'v TAnd ZtPerforming matrix multiplication and calculating by a softmax function to obtain A, wherein A is the attention weight;
reacting said A with ZvMultiplying to obtain AZvAnd AZ is addedvAnd ZtVector splicing treatment is carried out to obtain Z't,Z′tInformation is encoded for fusion.
5. The method for abstracting video text based on multi-modal model as claimed in claim 1, wherein said extracting the features of the video data to obtain the video feature vector comprises:
and performing feature extraction on the video data by using a 3D ResNet-101 model to obtain a video feature vector.
6. The method for abstracting video text based on multi-modal model as claimed in claim 1, wherein the word segmentation processing of the text information to obtain multiple word information comprises:
and performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information.
7. The method for abstracting video text based on multi-modal model as claimed in claim 6, wherein the using a hand word segmentation tool to perform word segmentation on the text information to obtain a plurality of word information comprises:
performing word segmentation processing on the text information by using a hand word segmentation tool to obtain a plurality of word information arranged in the same line;
and performing line segmentation on the plurality of word information to obtain a plurality of word information arranged in a line segmentation dictionary structure.
8. A video text summarization method device based on a multi-modal model is characterized by comprising the following steps:
the first extraction module is used for extracting the characteristics of video data to obtain video characteristics, wherein the video data is the video data needing to extract the text abstract;
the vectorization module is used for vectorizing the video features to obtain video feature vectors;
the second extraction module is used for carrying out voice extraction on the video data to obtain monologue voice information;
the conversion module is used for converting the uniwhite voice information into text information through an automatic voice recognition (ASR) technology;
the word segmentation module is used for carrying out word segmentation processing on the text information to obtain a plurality of word information;
and the training module is used for inputting the video feature vector and the word information into a transform model for training to obtain a Text summarization result, a sub-layer for fusing image features and Text features is arranged in an encoder of the transform model, and the sub-layer comprises Text-vision fusion and Add & Norm.
9. An apparatus, comprising: memory, processor and computer program stored on the memory and executable on the processor, wherein the processor implements the method for summarization of a multimodal model based video text according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium storing computer-executable instructions for performing the multimodal model based video text summarization method of any one of claims 1 to 7.
CN202210056075.6A 2022-01-18 2022-01-18 Video text summarization method, device and storage medium based on multi-modal model Withdrawn CN114398889A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210056075.6A CN114398889A (en) 2022-01-18 2022-01-18 Video text summarization method, device and storage medium based on multi-modal model
PCT/CN2022/090712 WO2023137913A1 (en) 2022-01-18 2022-04-29 Video text summarization method based on multi-modal model, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210056075.6A CN114398889A (en) 2022-01-18 2022-01-18 Video text summarization method, device and storage medium based on multi-modal model

Publications (1)

Publication Number Publication Date
CN114398889A true CN114398889A (en) 2022-04-26

Family

ID=81231310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210056075.6A Withdrawn CN114398889A (en) 2022-01-18 2022-01-18 Video text summarization method, device and storage medium based on multi-modal model

Country Status (2)

Country Link
CN (1) CN114398889A (en)
WO (1) WO2023137913A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544244A (en) * 2022-09-06 2022-12-30 内蒙古工业大学 Cross fusion and reconstruction-based multi-mode generative abstract acquisition method
WO2023137913A1 (en) * 2022-01-18 2023-07-27 平安科技(深圳)有限公司 Video text summarization method based on multi-modal model, device and storage medium
CN117094367A (en) * 2023-10-19 2023-11-21 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117521017A (en) * 2024-01-03 2024-02-06 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474094B (en) * 2023-12-22 2024-04-09 云南师范大学 Knowledge tracking method based on fusion domain features of Transformer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781916A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Video data fraud detection method and device, computer equipment and storage medium
CN111625660A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Dialog generation method, video comment method, device, equipment and storage medium
CN111767461A (en) * 2020-06-24 2020-10-13 北京奇艺世纪科技有限公司 Data processing method and device
CN112417134A (en) * 2020-10-30 2021-02-26 同济大学 Automatic abstract generation system and method based on voice text deep fusion features
CN113159010A (en) * 2021-03-05 2021-07-23 北京百度网讯科技有限公司 Video classification method, device, equipment and storage medium
CN113762052A (en) * 2021-05-13 2021-12-07 腾讯科技(深圳)有限公司 Video cover extraction method, device, equipment and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113052149B (en) * 2021-05-20 2021-08-13 平安科技(深圳)有限公司 Video abstract generation method and device, computer equipment and medium
CN113609285B (en) * 2021-08-09 2024-05-14 福州大学 Multimode text abstract system based on dependency gating fusion mechanism
CN114398889A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Video text summarization method, device and storage medium based on multi-modal model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781916A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Video data fraud detection method and device, computer equipment and storage medium
CN111625660A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Dialog generation method, video comment method, device, equipment and storage medium
CN111767461A (en) * 2020-06-24 2020-10-13 北京奇艺世纪科技有限公司 Data processing method and device
CN112417134A (en) * 2020-10-30 2021-02-26 同济大学 Automatic abstract generation system and method based on voice text deep fusion features
CN113159010A (en) * 2021-03-05 2021-07-23 北京百度网讯科技有限公司 Video classification method, device, equipment and storage medium
CN113762052A (en) * 2021-05-13 2021-12-07 腾讯科技(深圳)有限公司 Video cover extraction method, device, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TIEZHENG YU 等: "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization", HTTPS://DOI.ORG/10.48550/ARXIV.2109.02401 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023137913A1 (en) * 2022-01-18 2023-07-27 平安科技(深圳)有限公司 Video text summarization method based on multi-modal model, device and storage medium
CN115544244A (en) * 2022-09-06 2022-12-30 内蒙古工业大学 Cross fusion and reconstruction-based multi-mode generative abstract acquisition method
CN115544244B (en) * 2022-09-06 2023-11-17 内蒙古工业大学 Multi-mode generation type abstract acquisition method based on cross fusion and reconstruction
CN117094367A (en) * 2023-10-19 2023-11-21 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117094367B (en) * 2023-10-19 2024-03-29 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117521017A (en) * 2024-01-03 2024-02-06 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics
CN117521017B (en) * 2024-01-03 2024-04-05 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics

Also Published As

Publication number Publication date
WO2023137913A1 (en) 2023-07-27

Similar Documents

Publication Publication Date Title
CN114398889A (en) Video text summarization method, device and storage medium based on multi-modal model
CN111444340B (en) Text classification method, device, equipment and storage medium
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN109726397B (en) Labeling method and device for Chinese named entities, storage medium and electronic equipment
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN114897060B (en) Training method and device for sample classification model, and sample classification method and device
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
CN114282055A (en) Video feature extraction method, device and equipment and computer storage medium
CN117217233A (en) Text correction and text correction model training method and device
CN117093687A (en) Question answering method and device, electronic equipment and storage medium
CN116975288A (en) Text processing method and text processing model training method
CN113761946A (en) Model training and data processing method and device, electronic equipment and storage medium
CN114722822A (en) Named entity recognition method, device, equipment and computer readable storage medium
CN117292146A (en) Industrial scene-oriented method, system and application method for constructing multi-mode large language model
CN112307179A (en) Text matching method, device, equipment and storage medium
CN116842944A (en) Entity relation extraction method and device based on word enhancement
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN114092931B (en) Scene character recognition method and device, electronic equipment and storage medium
CN115730051A (en) Text processing method and device, electronic equipment and storage medium
CN114492450A (en) Text matching method and device
CN111310847B (en) Method and device for training element classification model
CN114612826A (en) Video and text similarity determination method and device, electronic equipment and storage medium
CN114510561A (en) Answer selection method, device, equipment and storage medium
CN114676705A (en) Dialogue relation processing method, computer and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220426

WW01 Invention patent application withdrawn after publication