CN117688943A - Audio and video title generation method, device, equipment and storage medium - Google Patents

Audio and video title generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117688943A
CN117688943A CN202311432807.8A CN202311432807A CN117688943A CN 117688943 A CN117688943 A CN 117688943A CN 202311432807 A CN202311432807 A CN 202311432807A CN 117688943 A CN117688943 A CN 117688943A
Authority
CN
China
Prior art keywords
title
audio
sample
video
generation model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311432807.8A
Other languages
Chinese (zh)
Inventor
陈春全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311432807.8A priority Critical patent/CN117688943A/en
Publication of CN117688943A publication Critical patent/CN117688943A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses an audio and video title generation method, an audio and video title generation device, audio and video title generation equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: performing first-stage pre-training on a text generation model based on text corpus; performing second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the topic corpus to obtain a topic generation model; inputting sample audio and video text information of sample audio and video contents into a title generation model, and outputting a first sample title corresponding to the sample audio and video contents through the title generation model; performing model fine adjustment on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain an audio/video title generation model; inputting the audio and video text information of the target audio and video content into an audio and video title generation model, and outputting a target title corresponding to the target audio and video content through the audio and video title generation model; the generation efficiency of the audio and video title can be optimized, and the title quality of the audio and video title can be improved.

Description

Audio and video title generation method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to an audio and video title generation method, device and equipment and a storage medium.
Background
In the process that a video creator uploads an edited video to a content sharing platform, in order to improve the video exposure rate and the viewing volume, a corresponding video title is generally added to the video.
In the related art, in order to reduce the workload of an creator and improve the editing efficiency of video titles, a method based on templates and rules is adopted to determine a video title matching with a current video from a large number of title templates defined and maintained in advance according to video contents.
The video produced by the creator often has rich and diversified video contents, and the video title template generated based on the rules has a fixed style and lacks flexibility, so that the repeated and inscribed problems of the video title are easily caused, and the innovation and uniqueness of the video title are greatly reduced.
Disclosure of Invention
The embodiment of the application provides an audio and video title generation method, device and equipment and a storage medium, which can optimize the generation efficiency of an audio and video title and improve the title quality of the audio and video title. The technical scheme is as follows:
In one aspect, an embodiment of the present application provides a method for generating an audio/video title, where the method includes:
performing first-stage pre-training on a text generation model based on text corpus;
performing second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the topic corpus to obtain a topic generation model;
inputting sample audio and video text information of sample audio and video contents into the title generation model, and outputting a first sample title corresponding to the sample audio and video contents through the title generation model, wherein the sample audio and video text information comprises at least one of sample descriptive text information, sample audio identification text information and sample image identification text information;
performing model fine adjustment on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain an audio/video title generation model;
and inputting the audio and video text information of the target audio and video content into the audio and video title generation model, and outputting a target title corresponding to the target audio and video content through the audio and video title generation model.
On the other hand, an embodiment of the present application provides an audio/video title generating device, where the device includes:
The first training module is used for carrying out first-stage pre-training on the text generation model based on the text corpus;
the second training module is used for carrying out second-level pre-training on the text generation model subjected to the first-level pre-training based on the topic corpus to obtain a topic generation model;
the first output module is used for inputting sample audio and video text information of sample audio and video contents into the title generation model, outputting a first sample title corresponding to the sample audio and video contents through the title generation model, wherein the sample audio and video text information comprises at least one of sample description text information, sample audio identification text information and sample image identification text information;
the fine tuning module is used for carrying out model fine tuning on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain an audio and video title generation model;
and the second output module is used for inputting the audio and video text information of the target audio and video content into the audio and video title generation model, and outputting the target title corresponding to the target audio and video content through the audio and video title generation model.
In another aspect, embodiments of the present application provide a computer device comprising a processor and a memory; the memory stores at least one computer instruction for execution by the processor to implement the audio video title generation method as described in the above aspects.
In another aspect, embodiments of the present application provide a computer readable storage medium having stored therein at least one computer instruction that is loaded and executed by a processor to implement the method for generating an audio video title as described in the above aspect.
In another aspect, embodiments of the present application provide a computer program product comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform the audio video title generation method as described in the above aspect.
In the embodiment of the application, first-stage pre-training is performed on a text generation model by using text corpus, so that the text generation model can learn basic grammar, syntax and semantic knowledge of natural language; secondly, performing second-level pre-training on the text generation model subjected to the first-level pre-training by using the title corpus to obtain a title generation model, so that the title generation model can learn the language characteristics, styles and common expression modes of the audio and video title; after two-stage pre-training is completed, the sample audio and video text information of the sample audio and video content is input into a title generation model, and a first sample title is output through the title generation model, so that the title generation model is subjected to model fine adjustment according to the title prediction loss between the first sample title and a title true value, and the audio and video title generation model is obtained, so that the audio and video content can be better understood by the audio and video title generation model, and the generalization capability is improved; and finally, inputting the audio and video text information of the target audio and video content into the audio and video title generation model to obtain the target title corresponding to the target audio and video content output by the audio and video title generation model.
By adopting the scheme provided by the embodiment of the application, the title generation model is obtained through two-stage pre-training, and then the title generation model is subjected to model fine adjustment by utilizing the sample audio and video text information of the sample audio and video content, so that the audio and video title generation model is obtained, the audio and video title generation model can better understand the semantic information of the audio and video, the accuracy and the diversity of the audio and video title generation model in the process of generating the audio and video title are improved, the generation efficiency of the audio and video title is optimized, and the title quality of the audio and video title is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;
fig. 2 is a flowchart of an audio/video title generation method according to an exemplary embodiment of the present application;
FIG. 3 illustrates a schematic diagram of a first level of pre-training of a text generation model provided in an exemplary embodiment of the present application;
fig. 4 is a flowchart of an audio/video title generation method according to another exemplary embodiment of the present application;
FIG. 5 is a schematic diagram illustrating the structure of model tuning of a title generation model according to an exemplary embodiment of the present application;
fig. 6 is a flowchart of an audio/video title generation method according to another exemplary embodiment of the present application;
FIG. 7 illustrates a flowchart of title generation model training provided by an exemplary embodiment of the present application;
fig. 8 is a flowchart illustrating an audio/video title generation method according to still another exemplary embodiment of the present application;
FIG. 9 is a schematic diagram illustrating the structure of an output target title using an audio/video title generation model according to an exemplary embodiment of the present application;
fig. 10 is a block diagram showing the structure of an audio/video title generating apparatus according to an exemplary embodiment of the present application;
fig. 11 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and the like, and is specifically described through the following embodiment.
Referring to fig. 1, a schematic diagram of an implementation environment provided in one embodiment of the present application is shown. The implementation environment includes a terminal 120 and a server 140. The data communication between the terminal 120 and the server 140 is performed through a communication network, alternatively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.
The terminal 120 is a computer device in which an application program having an audio/video title generation function is installed. The audio/video title generation function may be a function of an original application in the terminal 120, or a function of a third party application; the terminal 120 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a wearable device, a vehicle-mounted terminal, or the like, and in fig. 1, the terminal 120 is taken as an example of a smart phone, but the present invention is not limited thereto.
The server 140 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like. In the embodiment of the present application, the server 140 may be a background server of an application having an audio/video title generation function.
In one possible implementation, as shown in fig. 1, there is data interaction between the server 140 and the terminal 120. The terminal 120 sends the text corpus and the caption corpus to the server 140, the server 140 performs a first-stage pre-training on the text generation model according to the text corpus, and performs a second-stage pre-training on the text generation model subjected to the first-stage pre-training according to the caption corpus, so as to obtain a caption generation model subjected to the two-stage pre-training, further, the terminal 120 sends sample audio and video text information of sample audio and video contents to the server 140, the server 140 outputs a first sample caption corresponding to the sample audio and video contents through the caption generation model, and performs model fine adjustment on the caption generation model according to the caption prediction loss between the first sample caption and the caption true value, so as to obtain an audio and video caption generation model. In the model application stage, the terminal 120 sends the audio/video text information of the target audio/video content to the server 140, the server 140 inputs the audio/video text information into the audio/video title generation model to obtain a target title corresponding to the target audio/video content output by the model, and the target title is returned to the terminal 120.
Referring to fig. 2, a flowchart of an audio/video title generating method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device, and the computer device may be the terminal 120 or the server 140 shown in fig. 1, and the method includes the following steps:
step 201, performing a first-stage pre-training on a text generation model based on a text corpus.
In some embodiments, to enable an audio video title generation model to have better language understanding capabilities, a computer device may first pre-train the text generation model with a large number of text corpora, thereby enabling the text generation model to learn basic grammatical, syntactic, and semantic knowledge of natural language.
Alternatively, the text corpus may be news articles, encyclopedia, web text, etc., such as news, novels, articles, conversations, chats, comments, critique, etc., which are not limited in this embodiment.
Optionally, in order to enable the text generation model to understand language expressions in different fields, the text corpus also needs to cover each field as much as possible, so that the text generation model has better generalization capability.
Optionally, after a large amount of general text corpus is obtained, the computer device may further perform preprocessing and data cleaning on the text corpus, so as to improve the data quality of the text corpus. In one possible implementation, the computer device first cleans a large amount of collected text corpus, removes irrelevant contents such as HTML tags, javaScript codes, special symbols and the like which may exist in the text corpus, and retains plain text contents in the text corpus. Furthermore, in order to avoid the problem of messy codes possibly existing in the data processing process, the computer device can also perform unified coding on the text corpus, for example, unified text corpus into UTF-8 coding.
In one possible implementation, to increase the training efficiency of the text generation model, the computer device may also delete low quality text in the text corpus, such as deleting text that contains too many errors, nonsensical, or duplicates. Furthermore, considering that the collection sources of the text corpus are wider, a large number of repeated text corpora may exist between different content sources, so that the computer equipment can also perform de-duplication processing on the text corpus, and redundant information in the text corpus is reduced by recognizing the text corpus and deleting the repeated or highly similar text corpus. For example, the computer device may employ a hash deduplication algorithm to perform text deduplication on the text corpus.
In some embodiments, the text generation model may be a transducer model structure, the transducer model being formed by stacking a plurality of identical layers, each layer comprising two sub-layers of a Multi-head Self-Attention mechanism (Multi-head Self-Attention) and a Feed-forward neural network (Feed-Forward Neural Network). Furthermore, each sub-layer is followed by a residual connection (Residual Connection) and layer normalization (Layer Normalization). The multi-head self-attention mechanism calculates the association degree between each vocabulary and other vocabularies in the input sequence, thereby capturing long-distance dependency relations in sentences, and allows the model to pay attention to information of different positions at the same time. The feedforward neural network is used to extract local features of the input sequence, and typically includes two fully connected layers and an activation function.
In one possible implementation, the text generation model employs a unidirectional attention mechanism, i.e., left to right attention, such that each word in the text is focused on only the other words preceding the word and not on the words following the word in outputting the text.
Optionally, before inputting the text corpus into the text generation model, the computer device further needs to perform data conversion on the text corpus, and convert the text corpus into a data form that can be understood by the text generation model. In one possible implementation, the computer device may first add identifiers before and after each text corpus, such as "Bos" before each text corpus, to represent the beginning of the text corpus; "Eos" is added after each text corpus, indicating termination of the text corpus. Furthermore, the computer device also needs to perform word segmentation and indexing processing on each group of text corpus, and in order to enable the text generation model to learn the position information of each natural word in the text corpus, the computer device can also perform position coding on each word segmentation in the text corpus.
In one possible implementation, after the word segmentation and encoding processing, a group of text corpus includes N word segments, and the N word segments include a starter Bos and a terminator Eos, and then the input of the text generation model is N-1 words except the terminator Eos, and the output is N-1 words except the starter Bos.
In some embodiments, after inputting the real text sequence into the text generation model and outputting the predicted text sequence through the text generation model, the computer device may further obtain a prediction penalty by means of cross entropy penalty calculation from the real text sequence and the predicted text sequence, so as to perform a first level of pre-training on the text generation model with the prediction penalty.
Illustratively, as shown in fig. 3, a set of text corpus is "the ancient building has a long culture and history", so that the computer device can convert the data into "the ancient building" ", has", the ' cultural ', ' and ', ' historical ', ' and the computer device may take as input to the text generation model 310 the ' Bos ', ' this (x 0) ', ' ancient architecture (x 1) ', ' have (x 2) ', ' long-term (x 3) ', ' cultural (x 4) ', ' and (x 5) ', ' history (x 6) ', the text generation model 310 predicts that the ' x0 ' is based on the ' Bos ', the (x 0) is based on the ' Bos ', the ' x0 ' is based on the ' Bos ', the (x 0) is based on the ' x0 ', the ' x 1) is based on the ' Bos ', the ' x2 ' is based on the ' Bos ', the (x 0) is based on the ' x 1) ', the ' x1 is based on the ' Bos ', the ' x 2) is based on the ' x3 ' is based on the ' Bos ', the (x 0) is based on the ' x 1) ' is based on the ' x 2) ' is based on the ' x 3), the ' x3 ' is based on the ' x 4) ' is based on the ' x3 ', the (x 0) is based on the ' x3 ', the ' x1 ' is based on the ' x2 ', the ' x2 is based on the ' x3, the ' x3 is based on the ' x 3. The "prediction output" of the culture (x 4) ' "and the" prediction output "of the culture (x 5) ', based on the" Bos ", the" x 0) ', "the ancient architecture (x 1) '," having the "x 2) '," the "long-term (x 3) '," the "culture (x 4) '," and the "x 5) '" the "prediction output" history (x 6) ', based on the "Bos", the "x 0) '," the "ancient architecture (x 1) '," having the "x 2) '," the "long-term (x 3) '," the "culture (x 4) '," and the "x 5)," the "history (x 6) '" the "prediction output" Eos ", and the computer device obtains the prediction loss by the cross entropy loss calculation method based on the real text sequence and the prediction text sequence.
Step 202, performing second-level pre-training on the text generation model subjected to the first-level pre-training based on the topic corpus to obtain a topic generation model.
In some embodiments, after the text generation model is first pre-trained by using the text corpus so that the text generation model has better text understanding capability, in order to improve learning and understanding capabilities of the model on the caption expression, the computer device further needs to perform second pre-training on the text generation model subjected to the first pre-training by using the caption corpus, so as to obtain the caption generation model.
Optionally, in order to improve the richness of the topic corpus, the topic corpus may include various field topics such as news topics, section topics, graphic topics, audio topics, video topics, and the like, which is not limited in the embodiment of the present application.
In one possible implementation manner, in order to improve the data quality, after a large amount of topic corpus is collected, the computer device further needs to perform data processing on the topic corpus, including data cleaning, unicode, special character removal, plain text retention, duplicate removal, and the like, so as to obtain the topic corpus after data processing.
In one possible implementation manner, the computer device inputs the real topic corpus subjected to word segmentation and indexing processing into a text generation model subjected to first-stage pre-training, and outputs a predicted topic corpus through the text generation model, so that based on the prediction loss between the real topic corpus and the predicted topic corpus, second-stage pre-training is performed on the text generation model subjected to the first-stage pre-training, and thus a topic generation model is obtained.
Step 203, inputting sample audio/video text information of the sample audio/video content into a title generation model, and outputting a first sample title corresponding to the sample audio/video content through the title generation model, wherein the sample audio/video text information comprises at least one of sample description text information, sample audio identification text information and sample image identification text information.
In some embodiments, after performing two-stage pre-training on the text generation model by using the text corpus and the topic corpus to obtain the topic generation model, in order to enable the topic generation model to better understand the audio-video related semantic information, the computer device further needs to perform model fine adjustment on the topic generation model through the audio-video text information.
In one possible implementation, the computer device inputs sample audiovisual text information of the sample audiovisual content into the title generation model, such that a first sample title corresponding to the sample audiovisual content is output by the title generation model, wherein the sample audiovisual text information includes at least one of sample descriptive text information, sample audio identifying text information, and sample image identifying text information.
Optionally, the sample audio-video text information refers to various types of text information related to the sample audio-video content. The sample description text information refers to text information for describing sample audio-video contents, such as a text introduction of a creator to the sample audio-video contents, tag information added in the sample audio-video contents, and the like; the sample audio recognition text information refers to text information obtained by carrying out audio recognition on sample audio and video contents through an audio recognition technology; the sample image recognition text information refers to text information obtained by performing image recognition on sample audio and video contents through an image recognition technology.
And 204, performing model fine adjustment on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
In some embodiments, after obtaining the first sample title corresponding to the audio/video content of the sample output by the title generation model, the computer device may determine a title prediction loss between the first sample title and the title true value according to the title true value, so as to fine tune the title generation model according to the title prediction loss, to obtain the audio/video title generation model.
In one possible implementation, after obtaining the first sample title, the computer device may calculate a title prediction penalty between the first sample title and the title truth value based on the title truth value using a negative log likelihood function (Negative Log Likelihood Loss, NLL Loss) as a penalty function.
Step 205, inputting the audio and video text information of the target audio and video content into an audio and video title generation model, and outputting the target title corresponding to the target audio and video content through the audio and video title generation model.
In some embodiments, after the model training and fine tuning are performed to obtain the audio/video title generation model, the computer device may input the audio/video text information of the target audio/video content into the audio/video title generation model, so as to output the target title corresponding to the target audio/video content through the audio/video title generation model.
Optionally, the audio-visual text information of the target audio-visual content includes at least one of descriptive text information, audio identification text information, and image identification text information.
In summary, in the embodiment of the present application, first-stage pre-training is performed on a text generation model by using a text corpus, so that the text generation model can learn basic grammar, syntax and semantic knowledge of natural language; secondly, performing second-level pre-training on the text generation model subjected to the first-level pre-training by using the title corpus to obtain a title generation model, so that the title generation model can learn the language characteristics, styles and common expression modes of the audio and video title; after two-stage pre-training is completed, the sample audio and video text information of the sample audio and video content is input into a title generation model, and a first sample title is output through the title generation model, so that the title generation model is subjected to model fine adjustment according to the title prediction loss between the first sample title and a title true value, and the audio and video title generation model is obtained, so that the audio and video content can be better understood by the audio and video title generation model, and the generalization capability is improved; and finally, inputting the audio and video text information of the target audio and video content into the audio and video title generation model to obtain the target title corresponding to the target audio and video content output by the audio and video title generation model.
By adopting the scheme provided by the embodiment of the application, the title generation model is obtained through two-stage pre-training, and then the title generation model is subjected to model fine adjustment by utilizing the sample audio and video text information of the sample audio and video content, so that the audio and video title generation model is obtained, the audio and video title generation model can better understand the semantic information of the audio and video, the accuracy and the diversity of the audio and video title generation model in the process of generating the audio and video title are improved, the generation efficiency of the audio and video title is optimized, and the title quality of the audio and video title is improved.
In some embodiments, to improve the understanding ability of the audio-video title generation model to the sample audio-video text information, the computer device may further perform classification prediction on the sample audio-video content using the title generation model during training of the title generation model, and perform model fine-tuning on the title generation model using the classification prediction loss.
Referring to fig. 4, a flowchart of an audio/video title generating method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device, and the computer device may be the terminal 120 or the server 140 shown in fig. 1, and the method includes the following steps:
Step 401, performing a first level of pre-training on a text generation model based on a text corpus.
And step 402, performing second-level pre-training on the text generation model subjected to the first-level pre-training based on the topic corpus to obtain a topic generation model.
Specific embodiments of steps 401 to 402 can refer to steps 201 to 202, and this embodiment is not described herein.
Step 403, inputting the sample audio/video text information of the sample audio/video content into the title generation model, and outputting the sample audio/video feature through the transformation layer of the title generation model.
In some embodiments, a transform layer is included in the title generation model, and N transform blocks are included in the transform layer. In one possible implementation, a computer device inputs sample audiovisual text information of sample audiovisual content into a title generation model and outputs sample audiovisual features through a transform layer of the title generation model.
Step 404, generating a first sample title corresponding to the sample audio-video content based on the sample audio-video feature.
In some embodiments, after obtaining the sample audio/video feature output by the last transformation layer in the title generation model, the computer device may generate the first sample title corresponding to the sample audio/video content according to the sample audio/video feature.
In one possible implementation manner, after obtaining the sample audio/video feature output by the transformation layer in the title generation model, the computer device may decode the sample audio/video feature, so as to obtain a first sample title corresponding to the sample audio/video content according to the decoding result.
And step 405, inputting the sample audio and video characteristics into a linear layer of the title generation model, and outputting sample classification corresponding to the sample audio and video contents through the linear layer.
In some embodiments, a linear layer is further included in the title generation model, and the linear layer is used for performing classification prediction on the sample audio-video content. Alternatively, multiple classifications are included in the linear layer, and the computer device can output probability distributions over each classification through the linear layer.
Optionally, in order to improve the classification prediction efficiency of the linear layer, the computer device further needs to perform corresponding data processing on the sample audio/video features before inputting the sample audio/video features into the linear layer. In one possible implementation manner, the computer device may perform pooling processing on the sample audio and video features, so as to obtain pooled sample audio and video features, and then the computer device inputs the pooled sample audio and video features into a linear layer of the title generation model, and outputs a sample classification corresponding to the sample audio and video content through the linear layer.
In one possible implementation, the computer device may pool the sample audio-video features in an average pooling manner.
Optionally, in order to further improve the classification prediction efficiency, a plurality of linear layers may be further set in the header generation model, and different linear layers correspond to different classification scales. In one possible implementation manner, the computer device may input the pooled sample audio and video features into at least one linear layer of the title generation model, so that at least one sample classification corresponding to the sample audio and video content is output through the at least one linear layer, and the sample classifications output by the respective linear layers correspond to different classification scales.
Step 406, determining a classification prediction loss by cross entropy loss calculation based on the sample classification and the corresponding classification truth.
In some embodiments, after obtaining the sample classification of the linear layer output, the computer device may obtain the classification prediction loss by cross entropy loss calculation according to the sample classification and the corresponding classification truth value.
In one possible implementation, since the linear layer outputs probability distributions over the respective classifications, the computer device also needs to normalize the output data of the linear layer with a softmax function in order to achieve computation by cross entropy loss.
In some embodiments, in the case that at least two linear layers are included in the header generation model, different classification weights may be set for each linear layer in the process of calculating the classification prediction loss in order to improve the model fine-tuning efficiency, considering that different linear layers correspond to different classification scales.
In one possible implementation, the computer device obtains first classification weights corresponding to the at least two linear layers respectively, and determines, based on the at least two sample classifications and the corresponding classification truth values, classification predictor losses corresponding to the respective sample classifications through cross entropy loss calculation, thereby determining classification predictor losses based on the classification predictor losses and the first classification weights.
In one possible implementation, there are two linear layers in the title generation model, a first linear layer and a second linear layer, respectively, and the classification scale of the second linear layer is smaller than the classification scale of the first linear layer. The computer equipment inputs the audio and video characteristics of the sample subjected to the pooling treatment into a primary linear layer and a secondary linear layer respectively to obtain N in the primary linear layer 1 Probability distribution over classes and N in the second linear layer 2 Probability distribution on each class is calculated, so that primary class predictor losses and secondary class predictor losses are respectively obtained through cross entropy loss calculation, and further, the class prediction losses are obtained according to the class weights corresponding to the linear layers.
Alternatively, the process of determining the primary class predictor loss may be expressed as: wherein N is 1 Representing the number of classifications in the first order linear layer, p 1 Representing N obtained by first-order linear layer processing 1 Probability distribution on individual categories, L 1 The sub-loss is predicted for the first class classification.
Alternatively, the process of determining the secondary class predictor losses may be expressed as: wherein N is 2 Representing the number of classifications in the secondary linear layer, p 2 Representing N obtained by two-stage linear layer processing 2 Probability distribution on individual categories, L 2 The sub-loss is predicted for the secondary classification.
Illustratively, as shown in fig. 5, the computer device inputs sample audio/video text information "" Bos ',' x0',' x1',' x2',' x3',' Sep ',' y0',' y1',' into the title generation model 510, wherein 'Bos' is a starter, 'Sep' is a text spacer between different types of text, "'x0', 'x1', 'x2', 'x3', 'may be sample descriptive text information, sample audio recognition text information, sample image recognition text information, primary classification truth value, secondary classification truth value,"' y0',' y1',' is a title truth value, thereby obtaining sample audio/video characteristics output by the title generation model 510, and outputting 'y0', 'y1', 'Eos' according to the sample audio and video features, wherein 'y0', 'y1', 'Eos' is a first sample title, and 'Eos' is a terminator, so that the computer device determines the title prediction loss 501 according to the first sample title and the title truth value, performs the average pooling processing on the sample audio and video features, inputs the pooled sample audio and video features into the linear layer 1 and the linear layer 2 of the title generation model 510, and outputs a first-stage sample classification and a second-stage sample classification through the linear layer 1 and the linear layer 2, respectively, so as to calculate a first-stage classification loss 502 and a second-stage classification loss 503, respectively.
And step 407, performing model fine tuning on the title generation model based on the classification prediction loss and the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
In some embodiments, after obtaining the classification prediction loss and the title prediction loss, respectively, the computer device may then use the classification prediction loss and the title prediction loss to perform model fine tuning on the title generation model, thereby obtaining the audio/video title generation model.
Alternatively, the process of determining the title prediction loss may be expressed as:wherein, the header sequence corresponding to the first sample header may be expressed as y= { Y 1 ,y 2 ,…,y n },y i Representing individual segmentations in the title sequence, p (y i ) Representing predictive probability distribution of individual word segments, L 0 Indicating the heading prediction loss.
Alternatively, the process of determining the total loss may be expressed as: l=l 01 *L 12 *L 2 Wherein lambda is 1 Weight super parameter lambda for first class classification loss 2 And (5) the weight of the secondary classification loss exceeds the parameter.
In some embodiments, considering that the classification prediction is performed by using the title generation model, only to improve the understanding capability of the title generation model to the audio and video content, the final purpose of training the title generation model is to make the model output more accurate titles, so in order to further improve the fine tuning efficiency of the model, the computer device may also set different loss weights for the classification prediction loss and the title prediction loss in different stages of fine tuning.
In one possible implementation, the computer device determines a second classification weight corresponding to the classification prediction loss and a heading weight corresponding to the heading prediction loss, wherein the second classification weight is in negative correlation with the model fine tuning round, and the heading weight is in positive correlation with the model fine tuning round, i.e., as the fine tuning round increases, the second classification weight of the classification prediction loss gradually decreases, and the heading weight of the heading prediction loss gradually increases. Further, the computer device performs model fine tuning on the title generation model according to the classification prediction loss, the second classification weight, the title prediction loss and the title weight to obtain an audio/video title generation model.
Step 408, inputting the audio and video text information of the target audio and video content into the audio and video title generation model, and outputting the target title corresponding to the target audio and video content through the audio and video title generation model.
For the specific implementation of step 408, reference may be made to step 205, and this embodiment is not described herein.
In the above embodiment, a multitask learning method is adopted, and in the process of performing title prediction on sample audio and video contents by using a title generation model, classification prediction on the sample audio and video contents is added, so that the title prediction loss and the classification prediction loss of the title generation model are determined, and the title generation model is subjected to model fine tuning by using the two losses, thereby improving the efficiency of model fine tuning.
In addition, under the condition of multi-stage classification prediction, the classification prediction loss is optimized by setting different classification weights for different classification scales; and the classification weight and the title weight are dynamically adjusted at different stages of model fine adjustment, so that the model fine adjustment is performed on the title generation model by combining the weight and the loss, the efficiency of model fine adjustment is further optimized, and the output accuracy of the audio and video title generation model is improved.
In some embodiments, in the process of training and fine tuning the model, the computer device may also perform data processing on the input data of the model first in order to improve the processing efficiency of the model on the data.
Referring to fig. 6, a flowchart of an audio/video title generating method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device, and the computer device may be the terminal 120 or the server 140 shown in fig. 1, and the method includes the following steps:
step 601, performing a first level of pre-training on a text generation model based on a text corpus.
Step 602, performing data cleaning and deduplication processing on the second sample header in the header corpus to obtain a processed second sample header.
In some embodiments, in consideration of that the collected topic corpus may include other information besides the topic text, such as source information of the topic text, network tags, pictures, etc., in order to optimize data quality of the topic corpus, the computer device may perform data cleaning and deduplication processing on a large number of second sample topics included in the topic corpus, so as to obtain processed second sample topics.
In one possible implementation, the computer device may first perform data cleansing on the topic corpus, remove irrelevant contents such as special characters, HTML tags, expression packages, and the like contained therein, and perform unicode on a second sample topic in the topic corpus, and delete repeated topics, topics containing too many errors, nonsensical topics, and the like in the topic corpus, so that the processed topic corpus includes only the second sample topic that is plain text and not repeated.
And 603, screening the processed second sample title based on a title screening rule to obtain a screened second sample title, wherein the title screening rule comprises at least one of a title length threshold, a title type and a title browsing amount threshold.
In some embodiments, considering that the second sample header has diversity in the header corpus, there are problems that the sample header with lower quality is easy to exist, such as too long or too short header length, relatively biased header type, and cool header, so in order to improve the training quality of the second stage pretraining, the computer device may further perform screening processing on the processed second sample header based on the header screening rule, to obtain a screened second sample header.
Optionally, the title screening rule includes at least one of a title length threshold, a title type, and a title browsing volume threshold.
In one possible implementation, the computer device may filter the second sample header based on the header length threshold, considering that too short a header may contain redundant garbage, increasing the difficulty of model training, while too short a header may lack valid information. For example, the header length threshold is 5 to 20 words, so that the computer device needs to delete the second sample header having a header length of less than 5 words and the second sample header having a header length of more than 20 words.
In one possible implementation, the computer device may also filter the second sample title based on the title browsing amount threshold, considering that sample titles with too low a title browsing amount are typically low quality titles. For example, the title browsing amount threshold is 100, so that the computer device can filter the second sample title with the title browsing amount below 100.
Step 604, performing second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the screened second sample title, so as to obtain a title generation model.
In some embodiments, after obtaining the filtered second sample header, the computer device further needs to perform data conversion on the second sample header, and convert the second sample header into a data form that can be understood by the text generation model.
In one possible implementation, the computer device may first add identifiers before and after the second sample title, respectively, such as "Bos" before the second sample title, indicating the start of the title; "Eos" is added after the second sample title, indicating the termination of the title. Furthermore, the computer device also needs to perform word segmentation and indexing processing on the second sample title, and in order to enable the text generation model to learn the position information of each word in the second sample title, the computer device may also perform position coding on each word in the second sample title.
In one possible implementation, the second sample title includes N segmented words after being segmented and encoded, and the N segmented words include a starter Bos and a terminator Eos, and then the input of the text generation model is N-1 words except the terminator Eos and output is N-1 words except the starter Bos.
In one illustrative example, the second sample is entitled "cake with electric cooker", the corresponding real title sequence is "" ' cake ', "" ' electric cooker ', "", the title input sequence is "" Bos ', "" electric cooker ', "" make ', "" cake ', "" and the title prediction sequence output is "" use ', "" electric cooker ', "" make ', "" cake ', "" Eos ', "" respectively.
In some embodiments, after inputting the second sample header into the text generation model and outputting the sample prediction header through the text generation model, the computer device may determine a prediction loss by means of cross entropy loss calculation according to the second sample header and the sample prediction header, so as to perform second-stage pre-training on the text generation model subjected to the first-stage pre-training with the prediction loss, to obtain the header generation model.
Step 605, performing audio recognition on the sample audio-video content to obtain sample original audio recognition text information corresponding to the sample audio-video content.
In some embodiments, considering that in order to improve the fine tuning efficiency of the model, it is insufficient to use the descriptive text information of the sample audio-video content as the input of the title generation model, so in order to enable the title generation model to learn to richer context information, so as to generate the title more accurately, the computer device may further perform audio recognition on the sample audio-video content to obtain sample original audio recognition text information corresponding to the sample audio-video content.
Optionally, the sample original audio recognition text information characterizes a voice side feature of the sample audio/video content, and may include information such as a person speaking content, background music, and bystanding in the sample audio/video content, which is not limited in the embodiment of the present application.
In one possible implementation, the computer device may use ASR (Automatic Speech Recognition ) technology to perform audio recognition on the sample audiovisual content to obtain sample raw audio recognition text information.
And step 606, performing image recognition on the sample audio/video content to obtain sample original image recognition text information corresponding to the sample audio/video content.
In some embodiments, in addition to the audio identifying text information, in the case that the sample audio-video content is video, the computer device may further perform image identification on the sample audio-video content, so as to obtain sample original image identifying text information corresponding to the sample audio-video content.
Optionally, the sample original image identification text information characterizes image side features of the sample audio/video content, and may include information such as subtitles, a barrage, and image text in a video frame, which is not limited in this embodiment of the present application.
In one possible implementation, the computer device may perform image recognition on each video frame in the sample video using OCR (Optical Character Recognition ) techniques to obtain sample raw image recognition text information.
In step 607, filtering and de-duplication processing are performed on the sample original audio recognition text information and the sample original image recognition text information based on the sample description text information and the text association degree of the sample audio/video content, so as to obtain the sample audio recognition text information and the sample image recognition text information.
In some embodiments, various types of audio contained in the sample audio-video content are recognized in the audio recognition process, such as background music, so that the original sample audio recognition text information may have information irrelevant to the sample audio-video content; in addition, various characters contained in the image frame, such as background pictures, pattern identifications and the like, can be identified in the image identification process, so that information irrelevant to the audio and video content of a sample can exist in sample original image identification text information, and therefore in order to optimize the fine adjustment efficiency of a model and improve the accuracy of model output, computer equipment also needs to perform data processing on the sample original audio identification text information and the sample original image identification text information.
In one possible implementation manner, considering that the sample description text information added by the creator is often a subject description of the sample audio-video content and has a relatively large correlation with the sample audio-video content, in order to perform data processing on the sample original audio recognition text information and the sample original image recognition text information, the computer device may further perform filtering screening and deduplication processing on the sample original audio recognition text information and the sample original image recognition text information through the sample description text information and the text correlation degree, so as to obtain the sample audio recognition text information and the sample image recognition text information.
In one possible implementation, the computer device may delete the information in the sample original audio identifying text information that has a low degree of association with the sample audio-video content, and delete the information in the sample original image identifying text information that has a low degree of association with the sample audio-video content.
In one possible implementation, the computer device may also delete duplicate information that exists between the sample original audio identifying text information and the sample original image identifying text information.
Alternatively, the sample description text information may include a simple description, a content tag, a content classification, etc. that the creator adds for the sample audio-video content, which the embodiments of the present application do not limit.
Illustratively, table 1 shows sample video text information corresponding to one sample video.
TABLE 1
And 608, performing word segmentation and position coding on the sample audio/video text information to obtain word vectors corresponding to the sample audio/video text information.
In some embodiments, after obtaining the sample audiovisual text information of the sample audiovisual content, the computer device also requires data format conversion of the sample audiovisual text information.
In one possible implementation manner, the computer device performs word segmentation processing on the sample audio/video text information, and performs position coding according to the position information of each word segment in the text information, so as to obtain a word vector corresponding to the sample audio/video text information, namely determining a word (token) and a position (position) corresponding to each word segment in the sample audio/video text information.
And step 609, text marking is carried out on each word vector based on the text type of the sample audio/video text information, so that word vectors subjected to text marking are obtained, and different text types correspond to different text marks.
In some embodiments, considering that the sample audio/video text information includes sample description text information, sample audio recognition text information, and sample image recognition text information, that is, includes different text types, for example, the sample audio/video description, sample audio/video tag, sample audio/video classification, OCR text, ASR text, and the like may be divided, so in order to improve efficiency of model learning text expression, model fine tuning efficiency is optimized, the computer device may further perform text marking on each word vector according to the text type of the sample audio/video text information, to obtain a text-marked word vector, where different text types correspond to different text marks.
In one possible implementation, the computer device may add a text label (Segment) for each word vector, thereby helping the model distinguish between different text types. For example, a tag of seg=1 is added to a word vector corresponding to a sample audio-video description, a tag of seg=2 is added to a word vector corresponding to a sample audio-video tag, a tag of seg=3 is added to a word vector corresponding to a sample audio-video classification, a tag of seg=4 is added to a word vector corresponding to OCR text, and a tag of seg=5 is added to a word vector corresponding to ASR text.
In one possible implementation, the computer device may further add a start identifier Bos and a stop identifier Eos before and after the sample audio-video text information after the word segmentation process, and add an interval identifier Sep between the word segments of different text types.
In step 610, the text-tagged word vector is input into a title generation model, and a first sample title corresponding to the sample audio/video content is output through the title generation model.
In some embodiments, after obtaining the text-tagged word vectors, the computer device may input the text-tagged word vectors into a title generation model, and output a first sample title corresponding to the sample audio-video content through the title generation model.
In step 611, based on the title prediction loss between the first sample title and the title true value, the title generation model is subjected to model fine tuning to obtain the audio/video title generation model.
Step 612, inputting the audio/video text information of the target audio/video content into the audio/video title generation model, and outputting the target title corresponding to the target audio/video content through the audio/video title generation model.
Specific embodiments of steps 611 to 612 can refer to steps 204 to 205, and this embodiment is not described herein.
In the above embodiment, in the process of performing the second-stage pre-training on the text generation model, the second sample headline in the headline corpus is screened by adding the headline screening rule, so that the model input quality in the second-stage pre-training process is optimized, and the model training efficiency is improved.
In addition, in the process of fine tuning a title generation model by using sample audio and video text information, the sample original audio recognition text information and the sample original image recognition text information are filtered and de-duplicated according to the text association degree, so that the model input quality in the fine tuning process is optimized.
And the text information of different text types is marked, so that the information of the different text types can be distinguished in the fine tuning process by the model, the fine tuning efficiency of the model is improved, and the accuracy and the diversity of the output title of the audio/video title generation model are further improved.
In summary, the above embodiments can be seen that in the embodiment of the present application, the process of generating the model output header from the training audio/video header is divided into three stages, namely, a first stage pre-training stage, a second stage pre-training stage, and a model fine tuning stage. Referring to fig. 7, a flowchart of title generation model training provided in one exemplary embodiment of the present application is shown.
Step 701, obtaining text corpus.
Firstly, a large amount of text corpus is obtained by computer equipment, and the text corpus covers all fields such as news, novel, articles, dialogue, chat, comments, critique and the like as much as possible, so that the computer equipment performs data cleaning, unified coding, duplication removal, word segmentation and indexing on the text corpus, and the processed text corpus is obtained.
Step 702, a first level of pre-training is performed on a text generation model.
The computer equipment inputs the processed text corpus into a text generation model, performs text prediction through the text generation model, outputs a predicted text, calculates a prediction loss according to the text corpus and the predicted text through cross entropy loss, and performs first-stage pre-training on the text generation model through the prediction loss.
In step 703, a topic corpus is obtained.
The computer equipment acquires the title corpus, and the title corpus can be acquired from the audio and video sharing platform, so that the data cleaning, unified coding, duplication removal, word segmentation and indexing processing are carried out on the title corpus, and the processed title corpus is obtained.
Step 704, performing a second level of pre-training on the text generation model.
The computer equipment inputs the processed topic corpus into a text generation model subjected to first-stage pre-training, performs topic prediction through the text generation model, so as to obtain a predicted topic, calculates a prediction loss according to the topic corpus and the predicted topic through cross entropy loss, and performs second-stage pre-training on the text generation model subjected to the first-stage pre-training through the prediction loss, so as to obtain the topic generation model.
Step 705, fine tuning the title generation model.
The computer equipment inputs sample audio and video text information of sample audio and video content into a title generation model, and generates a sample title through the title generation model, so that title prediction loss is determined according to the sample title and a title true value, and the title generation model is subjected to model fine tuning by the title prediction loss, so that the audio and video title generation model is obtained.
In some embodiments, in consideration of the fact that different authors may have different title requirements when sharing the audio and video content, for example, by setting a content title to require more click volume or praise volume, so as to further optimize the output quality of the audio and video title generation model, the computer device may further increase the data of the dimension of the sample title information in the process of fine tuning the title generation model, where the sample title information may include at least one of a sample audio and video click volume, a sample audio and video praise volume, and a sample audio and video play volume.
Referring to fig. 8, a flowchart of an audio/video title generating method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device, and the computer device may be the terminal 120 or the server 140 shown in fig. 1, and the method includes the following steps:
step 801, performing a first level of pre-training on a text generation model based on a text corpus.
Step 802, performing second-level pre-training on the text generation model subjected to the first-level pre-training based on the topic corpus to obtain a topic generation model.
Specific embodiments of steps 801 to 802 may refer to steps 201 to 202, and this embodiment is not described herein.
Step 803, sample title information corresponding to the sample audio/video content is obtained, where the sample title information includes at least one of a sample audio/video click amount, a sample audio/video praise amount, and a sample audio/video play amount.
In some embodiments, in order to enable the powder suction degree of the title corresponding to the true value of each title to be learned in the fine tuning process of the model, determine which types of titles can attract more clicks or browse, when sample audio/video text information of sample audio/video content is obtained, the computer device may further obtain sample title information corresponding to the sample audio/video content, where the sample title information includes at least one of a sample audio/video click amount, a sample audio/video praise amount and a sample audio/video play amount.
In step 804, the sample title information and the sample audio/video text information are input into the title generation model, and the first sample title corresponding to the sample audio/video content is output through the title generation model.
In some embodiments, after the sample title information is acquired, the computer device also needs to perform data processing and encoding on the sample title information, so that the sample title information and the sample audio/video text information are input into a title generation model, and a first sample title corresponding to the sample audio/video content is output through the title generation model.
And step 805, performing model fine tuning on the title generation model based on the sample title information and the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
In some embodiments, in order for the audio/video title generation model to be able to output titles according to different title generation requirements, the computer device needs to perform model fine tuning on the title generation model based on sample title information in addition to the title prediction loss between the first sample title and the title true value, to obtain the audio/video title generation model.
In one possible implementation, the computer device may tag the sample audio-video text information and the title prediction loss based on the sample title information, e.g., tag the title prediction loss corresponding to the sample audio-video with the sample audio-video click volume.
Step 806, obtaining a title generation target corresponding to the target audio/video content, where the title generation target includes at least one of a target audio/video click amount, a target audio/video praise amount and a target audio/video play amount.
In some embodiments, after obtaining the trained audio and video title generation model, in order to generate a target title according to different needs of the creator, the computer device may further obtain a title generation target corresponding to the target audio and video content, where the title generation target includes at least one of a target audio and video click amount, and a target audio and video play amount.
Step 807, inputting the audio/video text information and the title generation target of the target audio/video content into an audio/video title generation model, and outputting the target title corresponding to the target audio/video content through the audio/video title generation model.
In some embodiments, after obtaining the title generation target corresponding to the target audio/video content, the computer device may input the audio/video text information of the target audio/video content and the title generation target into the audio/video title generation model, and the audio/video title generation model generates the target output title for different titles.
In one possible implementation manner, the computer device may perform word segmentation and indexing processing on the audio and video text information of the target audio and video content, so as to obtain a word vector, a position code, a type mark and the like corresponding to the audio and video text information, and may further perform target marking on the word vector based on the title generation target, further input a series of text data into the audio and video title generation model, and output a target title through the audio and video title generation model.
In one possible implementation, where the title generation target is to obtain more clicks, the computer device may output the target title that can have a higher click through the audiovisual title generation model. In another possible implementation, the title generation targets include a target audio-video click amount, a target audio-video praise amount and a target audio-video play amount, and the computer device may output the target title with the higher click amount, the target title with the higher praise amount and the target title with the higher play amount through the audio-video title generation model, respectively, so as to be selected by the creator according to the current title requirement.
In the above embodiment, in the process of performing model fine tuning on the title generation model, by acquiring the sample title information, the title generation model learns the click quantity, praise quantity, and the like of the audio and video contents of each sample in the process of learning the sample audio and video text information, so that in the process of generating the target title by applying the model, the corresponding target title can be output according to the needs of the creator, and the title output quality of the audio and video title generation model is optimized.
Referring to fig. 9, a schematic diagram of a structure of an output target title using an audio/video title generation model according to an exemplary embodiment of the present application is shown.
Firstly, the computer equipment acquires audio and video text information of target audio and video content, performs word segmentation and coding processing on the audio and video text information, then carries out separation identifier of the audio and video text information "Box '," "x 0'," "x 1'," "x 2'," "x 3'," "Sep'," "x 4'," "x 5'," "input audio and video title generation model 910, wherein 'Box' is a starter," 'x0', "" x1', "" x2', "" x3', "" can be one of a target audio and video description, a target audio and video tag, a target audio and video classification, a target OCR text and a target ASR text, and then carries out a text type of "" x4', "" x5', "" which is different from "" x0', "" x1', "" x2', "" only, in fig. 9, the two text types are used as initial identifiers, the input audio and video title generation model is carried out, the input audio and the input model is an input model of 2, and the input model is an input model 1, and the input model is an input model 1 is a 2, and a title is generated 2, and a title 2 is generated.
It should be noted that, in the embodiments of the present application, the steps of each embodiment may be combined with each other, for example, the computer device may perform model fine adjustment on the title generation model based on the title prediction loss, the classification prediction loss and the sample title information at the same time; for another example, in the embodiment of performing model fine tuning on the title generation model based on the title prediction loss and the classification prediction loss, the computer device may also perform data processing on the sample audio/video text information, and other possible embodiment step combinations are not limited in this application.
Referring to fig. 10, a block diagram of an audio/video title generating apparatus according to an exemplary embodiment of the present application is shown, where the apparatus includes:
a first training module 1001, configured to perform a first level of pre-training on a text generation model based on a text corpus;
a second training module 1002, configured to perform second-level pre-training on the text generation model after the first-level pre-training based on the topic corpus, to obtain a topic generation model;
a first output module 1003, configured to input sample audio/video text information of sample audio/video content into the title generation model, and output, through the title generation model, a first sample title corresponding to the sample audio/video content, where the sample audio/video text information includes at least one of sample descriptive text information, sample audio recognition text information, and sample image recognition text information;
The fine tuning module 1004 is configured to perform model fine tuning on the title generation model based on the title prediction loss between the first sample title and the title truth value, so as to obtain an audio/video title generation model;
the second output module 1005 is configured to input the audio/video text information of the target audio/video content into the audio/video title generation model, and output, through the audio/video title generation model, a target title corresponding to the target audio/video content.
Optionally, the first output module 1003 includes:
the characteristic output unit is used for inputting the sample audio/video text information of the sample audio/video content into the title generation model and outputting sample audio/video characteristics through a conversion layer of the title generation model;
the title generation unit is used for generating a first sample title corresponding to the sample audio/video content based on the sample audio/video characteristics;
and the classification output unit is used for inputting the sample audio and video characteristics into a linear layer of the title generation model, and outputting sample classification corresponding to the sample audio and video contents through the linear layer.
Optionally, the classification output unit is configured to:
pooling the sample audio and video features to obtain pooled sample audio and video features;
And inputting the pooled sample audio and video characteristics into at least one linear layer of the title generation model, and outputting at least one sample classification corresponding to the sample audio and video contents through the at least one linear layer, wherein the classification scales of the sample classifications corresponding to different linear layers are different.
Optionally, the fine tuning module 1004 includes:
the loss calculation unit is used for determining classification prediction loss through cross entropy loss calculation based on the sample classification and the corresponding classification true value;
and the model fine tuning unit is used for carrying out model fine tuning on the title generation model based on the classification prediction loss and the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
Optionally, in the case that the title generation model includes at least two linear layers, the model fine-tuning unit is configured to:
acquiring first classification weights corresponding to at least two linear layers respectively;
determining the classification predictor loss corresponding to each sample classification through cross entropy loss calculation based on at least two sample classifications and corresponding classification truth values;
The classification predictor penalty is determined based on the classification predictor penalty and the first classification weight.
Optionally, the model fine tuning unit is configured to:
determining a second classification weight corresponding to the classification prediction loss and a title weight corresponding to the title prediction loss, wherein the second classification weight and a model fine tuning round form a negative correlation, and the title weight and the model fine tuning round form a positive correlation;
and performing model fine adjustment on the title generation model based on the classification prediction loss, the second classification weight, the title prediction loss and the title weight to obtain the audio/video title generation model.
Optionally, before the sample audio/video text information of the sample audio/video content is input into the title generation model and the first sample title corresponding to the sample audio/video content is output through the title generation model, the apparatus further includes:
the audio recognition module is used for carrying out audio recognition on the sample audio-video content to obtain sample original audio recognition text information corresponding to the sample audio-video content;
the image recognition module is used for carrying out image recognition on the sample audio and video content to obtain sample original image recognition text information corresponding to the sample audio and video content;
And the information processing module is used for filtering, screening and de-duplicating the sample original audio identification text information and the sample original image identification text information based on the sample description text information and the text association degree of the sample audio and video content to obtain the sample audio identification text information and the sample image identification text information.
Optionally, the first output module 1003 is configured to:
performing word segmentation and position coding on the sample audio and video text information to obtain word vectors corresponding to the sample audio and video text information;
text marking is carried out on each word vector based on the text type of the sample audio/video text information, the word vector marked by the text is obtained, and different text types correspond to different text marks;
and inputting the word vector subjected to text marking into the title generation model, and outputting the first sample title corresponding to the sample audio and video content through the title generation model.
Optionally, the first output module 1003 is further configured to:
acquiring sample title information corresponding to sample audio and video content, wherein the sample title information comprises at least one of sample audio and video click quantity, sample audio and video praise quantity and sample audio and video play quantity;
Inputting the sample title information and the sample audio/video text information into the title generation model, and outputting the first sample title corresponding to the sample audio/video content through the title generation model;
the fine tuning module 1004 is configured to:
and performing model fine tuning on the title generation model based on the sample title information and the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
Optionally, the second output module 1005 is configured to:
acquiring a title generation target corresponding to the target audio and video content, wherein the title generation target comprises at least one of target audio and video click quantity, target audio and video praise quantity and target audio and video play quantity;
and inputting the audio and video text information and the title generation targets of the target audio and video contents into the audio and video title generation model, and outputting the target titles corresponding to the target audio and video contents through the audio and video title generation model.
Optionally, the second training module 1002 is configured to:
performing data cleaning and de-duplication treatment on a second sample title in the title corpus to obtain the treated second sample title;
Screening the processed second sample title based on a title screening rule to obtain the screened second sample title, wherein the title screening rule comprises at least one of a title length threshold, a title type and a title browsing amount threshold;
and performing second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the screened second sample title, so as to obtain the title generation model.
In summary, in the embodiment of the present application, first-stage pre-training is performed on a text generation model by using a text corpus, so that the text generation model can learn basic grammar, syntax and semantic knowledge of natural language; secondly, performing second-level pre-training on the text generation model subjected to the first-level pre-training by using the title corpus to obtain a title generation model, so that the title generation model can learn the language characteristics, styles and common expression modes of the audio and video title; after two-stage pre-training is completed, the sample audio and video text information of the sample audio and video content is input into a title generation model, and a first sample title is output through the title generation model, so that the title generation model is subjected to model fine adjustment according to the title prediction loss between the first sample title and a title true value, and the audio and video title generation model is obtained, so that the audio and video content can be better understood by the audio and video title generation model, and the generalization capability is improved; and finally, inputting the audio and video text information of the target audio and video content into the audio and video title generation model to obtain the target title corresponding to the target audio and video content output by the audio and video title generation model.
By adopting the scheme provided by the embodiment of the application, the title generation model is obtained through two-stage pre-training, and then the title generation model is subjected to model fine adjustment by utilizing the sample audio and video text information of the sample audio and video content, so that the audio and video title generation model is obtained, the audio and video title generation model can better understand the semantic information of the audio and video, the accuracy and the diversity of the audio and video title generation model in the process of generating the audio and video title are improved, the generation efficiency of the audio and video title is optimized, and the title quality of the audio and video title is improved.
It should be noted that: the apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the method embodiments are described in the method embodiments, which are not repeated herein.
It should be noted that, before and during the process of collecting relevant user data such as target audio/video content, a prompt interface, a popup window or output voice prompt information may be displayed, where the prompt interface, popup window or voice prompt information is used to prompt the user to collect relevant data currently, so that the present application only starts to execute the relevant step of obtaining relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), ends the relevant step of obtaining relevant data of the user, i.e. does not obtain relevant data of the user. In other words, the information (including but not limited to user equipment information, user personal information, etc., user corresponding operation data), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to herein are all user authorized or fully authorized by the parties, and the collection, use, and processing of relevant data requires compliance with relevant laws and regulations and standards of the relevant country and region. For example, the data such as the target audio/video content referred to in the present application is acquired under the condition of sufficient authorization.
Referring to fig. 11, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the present invention relates to a method for manufacturing a semiconductor device. The computer device 1100 includes a central processing unit (Central Processing Unit, CPU) 1101, a system memory 1104 including a random access memory 1102 and a read only memory 1103, and a system bus 1105 connecting the system memory 1104 and the central processing unit 1101. The computer device 1100 also includes a basic Input/Output system (I/O) 1106, which helps to transfer information between the various devices within the computer, and a mass storage device 1107 for storing an operating system 1113, application programs 1114, and other program modules 1115.
The basic input/output system 1106 includes a display 1108 for displaying information and an input device 1109, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1108 and the input device 1109 are both coupled to the central processing unit 1101 through an input-output controller 1110 coupled to the system bus 1105. The basic input/output system 1106 may also include an input/output controller 1110 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1110 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1107 is connected to the central processing unit 1101 through a mass storage controller (not shown) connected to the system bus 1105. The mass storage device 1107 and its associated computer-readable media provide non-volatile storage for the computer device 1100. That is, the mass storage device 1107 may include a computer-readable medium (not shown), such as a hard disk or drive.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (RAM, random Access Memory), read Only Memory (ROM), flash Memory or other solid state Memory technology, compact disk (CD-ROM), digital versatile disk (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1104 and mass storage device 1107 described above may be collectively referred to as memory.
The memory stores one or more programs configured to be executed by the one or more central processing units 1101, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1101 executing the one or more programs to implement the methods provided by the various method embodiments described above.
According to various embodiments of the present application, the computer device 1100 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1100 may be connected to the network 1111 via a network interface unit 1112 coupled to the system bus 1105, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1112.
The embodiment of the application also provides a computer readable storage medium, and at least one computer instruction is stored in the readable storage medium, and the at least one computer instruction is loaded and executed by a processor to implement the audio/video title generation method described in the above embodiment.
Alternatively, the computer-readable storage medium may include: ROM, RAM, solid state disk (SSD, solid State Drives), or optical disk, etc. The RAM may include, among other things, resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory).
Embodiments of the present application provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio/video title generating method described in the above embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (15)

1. An audio/video title generation method, the method comprising:
performing first-stage pre-training on a text generation model based on text corpus;
Performing second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the topic corpus to obtain a topic generation model;
inputting sample audio and video text information of sample audio and video contents into the title generation model, and outputting a first sample title corresponding to the sample audio and video contents through the title generation model, wherein the sample audio and video text information comprises at least one of sample descriptive text information, sample audio identification text information and sample image identification text information;
performing model fine adjustment on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain an audio/video title generation model;
and inputting the audio and video text information of the target audio and video content into the audio and video title generation model, and outputting a target title corresponding to the target audio and video content through the audio and video title generation model.
2. The method according to claim 1, wherein inputting sample audio-visual text information of the sample audio-visual content into the title generation model, outputting a first sample title corresponding to the sample audio-visual content through the title generation model, comprises:
Inputting the sample audio/video text information of the sample audio/video content into the title generation model, and outputting sample audio/video characteristics through a conversion layer of the title generation model;
generating a first sample title corresponding to the sample audio-video content based on the sample audio-video characteristics;
and inputting the sample audio and video characteristics into a linear layer of the title generation model, and outputting sample classification corresponding to the sample audio and video contents through the linear layer.
3. The method of claim 2, wherein inputting the sample audio-video features into the linear layer of the title generation model, outputting the sample classification corresponding to the sample audio-video content through the linear layer, comprises:
pooling the sample audio and video features to obtain pooled sample audio and video features;
and inputting the pooled sample audio and video characteristics into at least one linear layer of the title generation model, and outputting at least one sample classification corresponding to the sample audio and video contents through the at least one linear layer, wherein the classification scales of the sample classifications corresponding to different linear layers are different.
4. The method of claim 2, wherein performing model fine-tuning on the title generation model based on the title prediction loss between the first sample title and the title truth value to obtain an audio-video title generation model comprises:
determining a classification prediction loss through cross entropy loss calculation based on the sample classification and the corresponding classification truth value;
and performing model fine tuning on the title generation model based on the classification prediction loss and the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
5. The method of claim 4, wherein, in the case where at least two linear layers are included in the title generation model, the determining a classification prediction loss by cross entropy loss calculation based on the sample classification and the corresponding classification truth value comprises:
acquiring first classification weights corresponding to at least two linear layers respectively;
determining the classification predictor loss corresponding to each sample classification through cross entropy loss calculation based on at least two sample classifications and corresponding classification truth values;
The classification predictor penalty is determined based on the classification predictor penalty and the first classification weight.
6. The method of claim 4, wherein said model tuning the title generation model based on the classification prediction loss and the title prediction loss between the first sample title and the title truth value comprises:
determining a second classification weight corresponding to the classification prediction loss and a title weight corresponding to the title prediction loss, wherein the second classification weight and a model fine tuning round form a negative correlation, and the title weight and the model fine tuning round form a positive correlation;
and performing model fine adjustment on the title generation model based on the classification prediction loss, the second classification weight, the title prediction loss and the title weight to obtain the audio/video title generation model.
7. The method according to claim 1, wherein before inputting the sample audio-visual text information of the sample audio-visual content into the title generation model and outputting the first sample title corresponding to the sample audio-visual content through the title generation model, the method further comprises:
Performing audio recognition on the sample audio-video content to obtain sample original audio recognition text information corresponding to the sample audio-video content;
performing image recognition on the sample audio and video content to obtain sample original image recognition text information corresponding to the sample audio and video content;
and filtering, screening and de-duplicating the sample original audio identification text information and the sample original image identification text information based on the sample description text information and the text association degree of the sample audio and video content to obtain the sample audio identification text information and the sample image identification text information.
8. The method according to claim 1, wherein inputting sample audio-visual text information of the sample audio-visual content into the title generation model, outputting a first sample title corresponding to the sample audio-visual content through the title generation model, comprises:
performing word segmentation and position coding on the sample audio and video text information to obtain word vectors corresponding to the sample audio and video text information;
text marking is carried out on each word vector based on the text type of the sample audio/video text information, the word vector marked by the text is obtained, and different text types correspond to different text marks;
And inputting the word vector subjected to text marking into the title generation model, and outputting the first sample title corresponding to the sample audio and video content through the title generation model.
9. The method according to claim 1, wherein inputting sample audio-visual text information of the sample audio-visual content into the title generation model, outputting a first sample title corresponding to the sample audio-visual content through the title generation model, comprises:
acquiring sample title information corresponding to sample audio and video content, wherein the sample title information comprises at least one of sample audio and video click quantity, sample audio and video praise quantity and sample audio and video play quantity;
inputting the sample title information and the sample audio/video text information into the title generation model, and outputting the first sample title corresponding to the sample audio/video content through the title generation model;
and performing model fine tuning on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain an audio/video title generation model, wherein the method comprises the following steps:
and performing model fine tuning on the title generation model based on the sample title information and the title prediction loss between the first sample title and the title true value to obtain the audio/video title generation model.
10. The method according to claim 9, wherein inputting the av text information of the target av content into the av title generation model, and outputting the target title corresponding to the target av content through the av title generation model, comprises:
acquiring a title generation target corresponding to the target audio and video content, wherein the title generation target comprises at least one of target audio and video click quantity, target audio and video praise quantity and target audio and video play quantity;
and inputting the audio and video text information and the title generation targets of the target audio and video contents into the audio and video title generation model, and outputting the target titles corresponding to the target audio and video contents through the audio and video title generation model.
11. The method according to claim 1, wherein the performing the second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the topic corpus to obtain a topic generation model comprises:
performing data cleaning and de-duplication treatment on a second sample title in the title corpus to obtain the treated second sample title;
screening the processed second sample title based on a title screening rule to obtain the screened second sample title, wherein the title screening rule comprises at least one of a title length threshold, a title type and a title browsing amount threshold;
And performing second-stage pre-training on the text generation model subjected to the first-stage pre-training based on the screened second sample title, so as to obtain the title generation model.
12. An audio/video title generation apparatus, the apparatus comprising:
the first training module is used for carrying out first-stage pre-training on the text generation model based on the text corpus;
the second training module is used for carrying out second-level pre-training on the text generation model subjected to the first-level pre-training based on the topic corpus to obtain a topic generation model;
the first output module is used for inputting sample audio and video text information of sample audio and video contents into the title generation model, outputting a first sample title corresponding to the sample audio and video contents through the title generation model, wherein the sample audio and video text information comprises at least one of sample description text information, sample audio identification text information and sample image identification text information;
the fine tuning module is used for carrying out model fine tuning on the title generation model based on the title prediction loss between the first sample title and the title true value to obtain an audio and video title generation model;
And the second output module is used for inputting the audio and video text information of the target audio and video content into the audio and video title generation model, and outputting the target title corresponding to the target audio and video content through the audio and video title generation model.
13. A computer device, the computer device comprising a processor and a memory; the memory stores at least one computer instruction for execution by the processor to implement the audio video title generation method of any one of claims 1 to 11.
14. A computer readable storage medium having stored therein at least one computer instruction that is loaded and executed by a processor to implement the audio video title generation method of any one of claims 1 to 11.
15. A computer program product, the computer program product comprising computer instructions stored in a computer readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform the audio video title generation method according to any one of claims 1 to 11.
CN202311432807.8A 2023-10-30 2023-10-30 Audio and video title generation method, device, equipment and storage medium Pending CN117688943A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311432807.8A CN117688943A (en) 2023-10-30 2023-10-30 Audio and video title generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311432807.8A CN117688943A (en) 2023-10-30 2023-10-30 Audio and video title generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117688943A true CN117688943A (en) 2024-03-12

Family

ID=90134070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311432807.8A Pending CN117688943A (en) 2023-10-30 2023-10-30 Audio and video title generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117688943A (en)

Similar Documents

Publication Publication Date Title
Gabeur et al. Multi-modal transformer for video retrieval
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN112307351A (en) Model training and recommending method, device and equipment for user behavior
CN111723295B (en) Content distribution method, device and storage medium
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN113705299A (en) Video identification method and device and storage medium
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN111666400B (en) Message acquisition method, device, computer equipment and storage medium
CN112231569A (en) News recommendation method and device, computer equipment and storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113392265A (en) Multimedia processing method, device and equipment
CN115408488A (en) Segmentation method and system for novel scene text
CN116303977A (en) Question-answering method and system based on feature classification
CN113806528A (en) Topic detection method and device based on BERT model and storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN115273856A (en) Voice recognition method and device, electronic equipment and storage medium
CN116821781A (en) Classification model training method, text analysis method and related equipment
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
CN114329005A (en) Information processing method, information processing device, computer equipment and storage medium
CN114328910A (en) Text clustering method and related device
CN117688943A (en) Audio and video title generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication