CN114611498A

CN114611498A - Title generation method, model training method and device

Info

Publication number: CN114611498A
Application number: CN202210271572.8A
Authority: CN
Inventors: 徐鲁辉; 熊鹏飞; 陈宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-06-10

Abstract

The application discloses a title generation method, a model training method and a device, and belongs to the technical field of computers. The method comprises the following steps: acquiring a target multimedia object; determining target characteristic information corresponding to the media information and text characteristic information corresponding to the text information; and performing cross-modal semantic analysis processing on the target characteristic information and the text characteristic information based on a cross-modal information processing model, and outputting a title text corresponding to the target multimedia object. According to the technical scheme provided by the embodiment of the application, the title text of the target multimedia object is automatically output after cross-modal semantic analysis processing is carried out on the feature information corresponding to the target modality and the text modality by the cross-modal information processing model through determining the target feature information corresponding to the media information of the target modality in the target multimedia object and the text feature information corresponding to the text information in the target multimedia object, so that the title generation efficiency and the title accuracy are improved.

Description

Title generation method, model training method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a title generation method, a model training method, and an apparatus.

Background

With the rapid development of internet technology, a large amount of multimedia content is generated every day on the internet. In the face of a wide variety of multimedia contents, the user can select the multimedia contents to browse according to the title.

In the related art, when a content creator distributes multimedia content, a corresponding content title is filled in a related content distribution page for the multimedia content to be distributed.

In the related art, the title of the multimedia content depends on manual filling, and the title generation efficiency and the title accuracy are low.

Disclosure of Invention

The embodiment of the application provides a title generation method, a model training method and a device, which can improve the title generation efficiency and the title accuracy of a multimedia object.

According to an aspect of an embodiment of the present application, there is provided a title generation method, including:

acquiring a target multimedia object, wherein the target multimedia object comprises media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality;

determining target characteristic information corresponding to the media information and text characteristic information corresponding to the text information;

performing cross-modal semantic analysis processing on the target characteristic information and the text characteristic information based on a cross-modal information processing model, and outputting a title text corresponding to the target multimedia object;

the cross-modal information processing model is a machine learning model obtained by training multimedia sample objects in the target mode and the characteristic information corresponding to the text mode as sample data.

According to an aspect of an embodiment of the present application, there is provided a model training method, including:

acquiring a first multimedia sample object, wherein the first multimedia sample object comprises first media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality;

determining target characteristic information corresponding to the first media information and text characteristic information corresponding to the text information;

acquiring a cross-modal information processing model to be trained;

performing model training on the cross-modal information processing model to be trained based on the target characteristic information and the text characteristic information, and outputting a title text corresponding to the first multimedia sample object;

determining first model loss information based on the header text and the text information, the first model loss information being used to characterize a semantic matching degree between the header text and the first multimedia sample object;

and under the condition that the first model loss information meets a first loss condition, obtaining a trained cross-modal information processing model.

According to an aspect of an embodiment of the present application, there is provided a title generation apparatus, including:

the system comprises an object acquisition module, a display module and a display module, wherein the object acquisition module is used for acquiring a target multimedia object, the target multimedia object comprises media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality;

the characteristic determining module is used for determining target characteristic information corresponding to the media information and text characteristic information corresponding to the text information;

the title output module is used for performing cross-modal semantic analysis processing on the target characteristic information and the text characteristic information based on a cross-modal information processing model and outputting a title text corresponding to the target multimedia object;

According to an aspect of an embodiment of the present application, there is provided a model training apparatus, including:

the multimedia display device comprises a sample object acquisition module, a display module and a display module, wherein the sample object acquisition module is used for acquiring a first multimedia sample object, the first multimedia sample object comprises first media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality;

the characteristic determining module is used for determining target characteristic information corresponding to the first media information and text characteristic information corresponding to the text information;

the model acquisition module is used for acquiring a cross-modal information processing model to be trained;

a caption output module, configured to perform model training on the cross-modal information processing model to be trained based on the target feature information and the text feature information, and output a caption text corresponding to the first multimedia sample object;

a loss information determination module, configured to determine first model loss information based on the header text and the text information, where the first model loss information is used to characterize a semantic matching degree between the header text and the first multimedia sample object;

and the model determining module is used for obtaining a trained cross-modal information processing model under the condition that the first model loss information meets a first loss condition.

According to an aspect of embodiments of the present application, there is provided a computer device including a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the above-mentioned title generation method, or the above-mentioned model training method.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the above-described title generation method or the above-described model training method.

According to an aspect of embodiments herein, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions to cause the computer device to execute to implement the above-described title generation method, or the above-described model training method.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

and training a cross-modal information processing model based on the characteristic information of the multimedia sample object corresponding to the target modality and the text modality, so that the trained cross-modal information processing model can perform cross-modal semantic analysis processing. For a target multimedia object needing to generate a title text, feature information corresponding to media information of a target mode in the multimedia object and feature information corresponding to text information in the multimedia object can be respectively determined, then cross-mode semantic analysis processing is carried out on the feature information corresponding to the target mode and the text mode based on the cross-mode information processing model, the title text of the target multimedia object is automatically output, and title generation efficiency and title accuracy are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

FIG. 2 is a first flowchart of a title generation method according to an embodiment of the present application;

FIG. 3 is a flowchart of a title generation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a model structure of a cross-modal information processing model;

fig. 5 is a flowchart three of a title generation method provided in an embodiment of the present application;

FIG. 6 is a fourth flowchart of a title generation method provided by an embodiment of the present application;

FIG. 7(a) is a schematic diagram of an exemplary video cover I;

FIG. 7(b) is a schematic diagram of an exemplary video cover II;

FIG. 8 is a flowchart of a title generation method provided by an embodiment of the present application;

FIG. 9 is a sixth flowchart of a title generation method provided by an embodiment of the present application;

fig. 10(a) schematically shows a first schematic diagram of a video;

FIG. 10(b) is a diagram illustrating an exemplary video diagram two;

FIG. 11 is a seventh flowchart of a title generation method provided by an embodiment of the present application;

FIG. 12 is a first flowchart of a model training method according to an embodiment of the present application;

FIG. 13 is a flowchart II of a model training method provided in an embodiment of the present application;

FIG. 14 is a flowchart III of a model training method provided in an embodiment of the present application;

FIG. 15 is a flowchart of a model training method according to an embodiment of the present application;

fig. 16 is a block diagram of a title generation apparatus provided in an embodiment of the present application;

fig. 17 is a block diagram of a title generation apparatus provided in an embodiment of the present application;

fig. 18 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The title generation method provided by the embodiment of the application relates to artificial intelligence technology, and is briefly described below to facilitate understanding by those skilled in the art.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application execution environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, a game console, an electronic book reader, a multimedia playing device, a wearable device, an aircraft, and other electronic devices. A client of the application may be installed in the terminal 10.

In the embodiment of the present application, the application may be any application capable of providing a streaming content service. Typically, the application is a video-type application. Of course, streaming content services may be provided in other types of applications besides video-type applications. For example, the application includes a news application, a social application, an interactive entertainment application, a browser application, a shopping application, a content sharing application, a Virtual Reality (VR) application, an Augmented Reality (AR) application, and the like, which is not limited in this embodiment of the present application. In some embodiments, the streaming content service described above encompasses many vertical content such as art, movie, news, finance, sports, entertainment, games, and the like. Optionally, the information flow content service includes many forms of multimedia objects such as articles, pictures, videos, short videos, live broadcasts, titles, and columns.

The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a backend server for the application described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Optionally, the server 20 provides background services for applications in multiple terminals 10 simultaneously.

Alternatively, the terminal 10 and the server 20 may communicate with each other through the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Before describing the method embodiments provided in the present application, a brief description is given to the application scenarios, related terms, or terms that may be involved in the method embodiments of the present application, so as to facilitate understanding by those skilled in the art of the present application.

A transform (transformation model) is a self-attention based model structure that includes an encoder and a decoder.

OCR (Optical Character Recognition) refers to a process in which an electronic device translates a shape into a computer word by a Character Recognition method.

Model parameters of a Unified Language Model (UniLM) are shared among Language Model task objects (i.e., a bi-directional Language Model task, a uni-directional Language Model task, and a sequential-to-sequential Language Model task). Optionally, the access of each corpus unit to the context is controlled by different self-attention mask information (self-attention masks).

The VUniLM (video unified Language Model) is a unified Language Model supporting cross-modal semantic analysis processing, and supports inputting feature data of non-text modalities, such as video feature data. The cross-modal information processing model in the embodiment of the present application may be a vullm.

BERT (Bidirectional Encoder Representation from transforms) is a pre-trained language Representation model that can generate deep Bidirectional language representations.

GPT (gerrate Pre-Training Model), which predicts the next word using the above, the cross-modal information processing Model in the embodiment of the present application may be vullm. .

CLIP (contextual Language-Image Pretraining, Language-Image contrast Pretraining model) is a neural network trained on various (Image, text) pairs. A given image may be indicated in natural language to predict the most relevant text segments.

CLS (identifier name) is a special identifier in a transformer-like model, originating from the start identifier in the BERT model.

Please refer to fig. 2, which illustrates a first flowchart of a title generation method according to an embodiment of the present application. The method can be applied to a computer device which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 or the server 20 in the application program running environment shown in fig. 1. The method can comprise the following steps (210-230).

Step 210, a target multimedia object is obtained.

Optionally, the target multimedia object is a multimedia object having at least two modalities of media information, including but not limited to a video object, an audio object, a graphics object, and the like. The target multimedia objects include non-title multimedia objects and title multimedia objects. Optionally, the length of the title corresponding to the target multimedia object is greater than the upper threshold of the length of the title or less than the lower threshold of the length of the title.

Optionally, the target multimedia object includes media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality. In some possible scenarios, the target modalities include, but are not limited to, visual modalities, audio modalities, and the like.

In an exemplary embodiment, the text information includes at least one text corpus unit. Optionally, the text information includes content text information and title text information corresponding to the target multimedia object. Correspondingly, the text information comprises a text corpus unit in the content text information and a text corpus unit in the title text information. The text corpus unit is a character or a word.

Alternatively, the content text information includes video text information, audio text information, and the like. The video text information comprises identification text information, subtitle text information, voice-over text information, transcription text information and the like corresponding to the video. The identification text information refers to text information obtained by performing text identification on video frames in a video. The text information is obtained by performing voice recognition on the audio in the video through the transcribed text information. The audio text information comprises transcription text information, voice-over text information and the like corresponding to the audio. The embodiment of the present application does not limit the text information.

In an exemplary embodiment, the media information includes at least one image corresponding to the target multimedia object. In some application scenarios, the target multimedia object comprises at least one image. For example, if the target multimedia object is a video, the at least one image may be a video frame in the video. For another example, if the target multimedia object is a text object, the at least one image may be an image in the text object or a video frame of a video included in the text object.

In a possible implementation, the target multimedia object is a target video. And performing frame extraction processing on the target video to obtain the at least one image. Alternatively, the frame extraction frequency corresponding to the frame extraction processing is 1FPS (Frames Per Second).

Optionally, the media information may further include at least one audio frame corresponding to the target multimedia object.

In a possible implementation, the target multimedia object includes a target video. The method also comprises the following steps: acquiring the first N video frames in a target video, wherein N is an integer greater than 0; and performing text recognition processing on the first N video frames to obtain recognition text information. Wherein the text information includes the identification text information. Alternatively, N equals 1; correspondingly, the first N video frames are the first video frame of the target video, and the identification text information includes the identification text corresponding to the first video frame. Optionally, the at least one text corpus unit may include a text corpus unit in the identification text information.

Step 220, determining target characteristic information corresponding to the media information and text characteristic information corresponding to the text information.

And under the condition that the text information comprises at least one text corpus unit, the text characteristic information comprises a text characteristic sequence corresponding to the at least one text corpus unit. The text feature sequence may include word embedding vectors corresponding to at least one text corpus unit.

In case the media information comprises at least one image corresponding to the target multimedia object, the target feature information comprises a sequence of visual features corresponding to the at least one image. The visual feature sequence may include visual feature vectors corresponding to the respective at least one image.

Correspondingly, as shown in fig. 3, the implementation process of the step 220 may include the following steps (221 to 224), and fig. 3 shows a second flowchart of a title generation method provided in an embodiment of the present application.

Step 221, performing visual feature extraction processing on at least one image to obtain a visual feature vector corresponding to the at least one image.

In a possible implementation manner, the at least one image is subjected to a cross-modality feature extraction process to obtain the visual feature vector, and the visual feature vector can be used for characterizing feature information of the at least one image on a text modality. Optionally, at least one image is input into the CLIP model, and a visual feature vector corresponding to each of the at least one image is output. Optionally, at least one image is input into an efficientnet model, and visual feature vectors corresponding to the at least one image are output.

In another possible implementation, the at least one image is subjected to image feature extraction processing to obtain the visual feature vector, and the visual feature vector can be used for representing image feature information corresponding to the at least one image.

Step 222, obtaining target characteristic information based on the visual characteristic vector.

In an exemplary embodiment, based on the visual feature vector, a visual feature sequence is obtained, and the target feature information includes the visual feature sequence. The visual feature sequence is used for characterizing feature information of the target multimedia object on the video modality.

In one possible implementation, the visual feature vectors corresponding to the at least one image are arranged in sequence to obtain the visual feature sequence.

In another possible implementation, a position feature vector and a paragraph feature vector corresponding to the visual feature vector are determined. The position feature vector is used for representing the corresponding position information of the visual feature vector in the visual feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the visual feature sequence input cross-modal information processing model. And fusing the visual feature vector and the corresponding position feature vector and paragraph feature vector to obtain a fused visual feature vector, and further obtain the visual feature sequence.

The target feature information may further include feature information of a modality other than the visual modality in the target modality. For example, the target feature information may further include feature information of an audio modality, such as an audio feature sequence. Optionally, the audio feature extraction processing is performed on at least one audio frame corresponding to the target multimedia object to obtain an audio feature vector corresponding to the at least one audio frame, and then the audio feature vectors corresponding to the at least one audio frame are arranged in sequence to obtain the audio feature sequence. Or, performing cross-modal feature extraction processing on the at least one audio frame to obtain an audio feature vector corresponding to the at least one audio frame, where the audio feature vector can be used to represent feature information corresponding to the audio frame in a text mode. Or determining a position feature vector and a paragraph feature vector corresponding to the audio feature vector. The position feature vector is used for representing the corresponding position information of the audio feature vector in the audio feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the audio feature sequence input cross-modal information processing model. And carrying out fusion processing on the audio feature vector and the corresponding position feature vector and paragraph feature vector to obtain a fused audio feature vector, and further obtain the audio feature sequence.

Step 223, performing word embedding processing on at least one text corpus unit to obtain a word embedding vector corresponding to at least one text corpus unit.

Optionally, word embedding processing is performed on each text corpus unit in the text information, so as to obtain a word embedding vector corresponding to each text corpus unit. The embodiment of the present application does not limit the manner of word embedding processing.

Step 224, based on the word embedding vector, obtaining text feature information.

Optionally, a text feature sequence is obtained based on the word embedding vector. The text feature information includes a sequence of text features.

In a possible implementation manner, words corresponding to each text corpus unit are embedded into vectors and are arranged in sequence, so as to obtain the text feature sequence.

In another possible implementation, the position feature vector and the paragraph feature vector corresponding to the word embedding vector are determined. The position feature vector is used for representing the corresponding position information of the word embedding vector in the text feature sequence, and the paragraph feature vector is used for representing the text segment position information of the text feature sequence input cross-modal information processing model. And fusing the word embedding vector with the corresponding position characteristic vector and paragraph characteristic vector to obtain a fused word embedding vector, and further obtain the text characteristic sequence.

And step 230, performing cross-modal semantic analysis processing on the target characteristic information and the text characteristic information based on the cross-modal information processing model, and outputting a title text corresponding to the target multimedia object.

The cross-modal information processing model is a machine learning model obtained by training a multimedia sample object in a target mode and characteristic information corresponding to a text mode as sample data.

In a possible embodiment, the cross-modal information processing model is a cross-modal unified language model, and supports text feature information corresponding to an input text modality and target feature information corresponding to a target modality.

Optionally, the cross-modal unified language model corresponds to at least two types of self-attention mask information. The self-attention mask information is used to characterize the selection direction of the context information. The cross-modal unified language model is a model parameter-shared language model. The whole set of model parameters corresponding to the cross-modal unified language model can respectively correspond to the at least two self-attention mask information, and different cross-modal information processing tasks are executed.

The at least two types of self-attention mask information include first self-attention mask information and second self-attention mask information.

Optionally, the first self-attention mask information is used to characterize the selection direction of the context information as a composite direction. The composite direction mentioned above refers to the composite of the context direction and the above direction. The context direction is used for indicating the model to extract semantic feature data of the current linguistic data unit according to the context information corresponding to the current linguistic data unit, and the context selecting direction is bidirectional. The direction indication model extracts the semantic feature data of the current linguistic data unit according to the information corresponding to the current linguistic data unit, and the context selecting direction is unidirectional. The model can determine whether the context selection direction is the context direction or the previous direction according to the position interval corresponding to the current corpus unit.

In one possible implementation, the feature corpus units in the target feature information and the text feature information correspond to a first position interval, and the text units in the title text correspond to a second position interval. If the position interval corresponding to the current corpus unit is the first position interval, that is, the current corpus unit is the target feature information and the feature corpus unit in the text feature information, determining the semantic feature data corresponding to the current corpus unit according to the context information corresponding to the current corpus unit (the semantic feature data corresponding to the feature corpus unit in the first position interval). If the position interval corresponding to the current corpus unit is the second position interval, that is, the current corpus unit is the text corpus unit of the title text, the semantic feature data corresponding to the current corpus unit is determined according to the upper information corresponding to the current corpus unit (the feature corpus unit in the first position interval and the semantic feature data corresponding to the predicted text corpus unit), so as to output the characters or words corresponding to the current corpus unit.

The second self-attention mask information is used for representing that the selection direction of the context information is a context direction.

For the training process of the cross-modal information processing model, the following description of the embodiment of the model training method can be found, and will not be described in detail here.

In one example, as shown in FIG. 4, a model structure diagram of a cross-modal information processing model is illustrated. In fig. 4, paragraphs 1(S1) and 2(S2) are input to the cross-modal information processing model 40. The paragraph 1 is a feature vector sequence formed by visual feature vectors corresponding to at least one image corresponding to a certain multimedia object, and the paragraph 2 is a feature vector sequence formed by word embedding vectors corresponding to each text corpus unit ("CLS", "me", "love", "ancestor" country ") in the title information (" i love country ") of the multimedia object. And inputting part or all of the feature vectors in the feature vector sequence during model training so that the model learns cross-modal semantic information and performs cross-modal semantic feature alignment. The cross-modality information processing model 40 may determine different self-attention mask information according to preset rules. For example, in the pre-training process, a second self-attention matrix 41 (representing second self-attention mask information) is selected to perform bidirectional cross-modal semantic analysis processing on the

paragraphs

1 and 2, so as to obtain a visual feature hidden vector corresponding to at least one image and a text feature hidden vector corresponding to header information, thereby determining a cosine distance between the visual feature hidden vector and the text feature hidden vector and determining loss information thereof, and under the condition that the loss information is smaller than a certain preset loss threshold, pre-training of the cross-modal information processing model is completed. For another example, in the formal training process, partial corpus text units in paragraphs 1 and 2 are input into the cross-modal information processing model 40, the cross-modal information processing model 40 may select the first self-attention matrix 42 (representing the first self-attention mask information), and perform cross-modal semantic analysis processing in a composite direction on the input features, for example, the input partial corpus text units are "CLS", "i", and the cross-modal information processing model 40 may perform bidirectional semantic analysis processing in the paragraph 1 in the process of extracting semantic feature information corresponding to each visual feature vector in the paragraph 1, so as to obtain semantic feature information corresponding to each visual feature vector; in the process of extracting semantic feature information corresponding to each word embedding vector in the paragraph 2, the cross-modal information processing model 40 performs forward semantic analysis processing based on the previous information of the current corpus unit to obtain semantic feature information corresponding to "CLS", "me" and "love", predicts words after "love" based on the predicted semantic feature information, compares the semantic feature information corresponding to the predicted words with the semantic feature information corresponding to "ancestor" country "to determine loss information, and completes formal training of the cross-modal information processing model when the loss information is smaller than a preset loss threshold.

In an exemplary embodiment, as shown in FIG. 3, the implementation of step 230 may include the following steps (231-234).

Step 231, inputting the target feature information and the text feature information into the cross-modal information processing model.

Optionally, the target feature information includes at least one first feature corpus unit, and the text feature information includes at least one second feature corpus unit. The first corpus unit refers to a corpus unit corresponding to a target modality. The second corpus unit refers to a corpus unit corresponding to a text modality.

Optionally, the at least one first feature corpus unit includes a visual feature vector corresponding to the at least one image, or a fused visual feature vector, and each visual feature vector may serve as an individual feature corpus unit.

Optionally, the at least one first feature corpus unit further includes an audio feature vector corresponding to the at least one audio frame, or a merged audio feature vector, and each audio feature vector may serve as an individual feature corpus unit.

Optionally, the at least one second corpus unit includes a word embedding vector corresponding to the at least one text corpus unit, or a fused word embedding vector, and each word embedding vector may serve as an individual corpus unit.

Correspondingly, the at least one first characteristic corpus unit and the at least one second characteristic corpus unit are input into the cross-modal information processing model.

Step 232, determining first self-attention mask information corresponding to the cross-modal information processing model.

The first self-attention mask information is used for representing that the selection direction of the contextual information corresponding to the cross-modal information processing model is a composite direction.

Optionally, a first self-attention matrix corresponding to the cross-modal information processing model is determined, and the first self-attention moment matrix is used for representing the first self-attention mask information.

Step 233, based on the first self-attention mask information, determining at least one first feature corpus unit and at least one second feature corpus unit as a context corpus unit corresponding to the at least one first feature corpus unit or the at least one second feature corpus unit.

In a possible implementation manner, the position interval corresponding to the first feature corpus unit in the target feature information and the second feature corpus unit in the text feature information is a first position interval, and the position interval corresponding to the text unit in the title text is a second position interval. The first self-attention mask information is used for indicating that the selection direction of the context information corresponding to the feature corpus unit in the first position interval is a context direction, and indicating that the selection direction of the context information corresponding to the text unit in the second position interval is an above direction.

For each first or second corpus unit, it is indicated that the context corpus unit corresponding to each first or second corpus unit includes a corpus unit in the first location interval, and the corpus unit in the first location interval is the at least one first and second corpus unit.

And step 234, performing cross-modal semantic analysis processing on the environmental language material unit based on the cross-modal information processing model, and outputting a title text.

In an exemplary embodiment, as shown in fig. 5, the implementation process of the step 234 may include the following steps (2341 to 2345), and fig. 5 shows a flowchart three of the title generation method provided in an embodiment of the present application.

2341, performing cross-modal semantic analysis processing on the context corpus units based on the cross-modal information processing model to obtain first semantic feature data corresponding to at least one first feature corpus unit and second semantic feature data corresponding to at least one second feature corpus unit.

In the process of traversing each characteristic corpus unit by the cross-modal information processing model, the position interval of the current corpus unit is judged. If the position interval corresponding to the current corpus unit is a first position interval, namely the current corpus unit is a target feature information and a feature corpus unit in the text feature information, determining context information corresponding to the current corpus unit, wherein the context information comprises semantic feature data corresponding to the feature corpus unit in the first position interval, namely the semantic feature data of the context corpus unit; and then determining semantic feature data corresponding to the current corpus unit according to the semantic feature data of the context corpus unit.

After traversing each feature corpus unit through the cross-modal information processing model, first semantic feature data corresponding to the at least one first feature corpus unit and second semantic feature data corresponding to the at least one second feature corpus unit can be obtained.

Step 2342, determining semantic feature data corresponding to the 1 st text unit in the title text based on the first semantic feature data and the second semantic feature data.

After the cross-modal information processing model traverses each feature corpus unit, the cross-modal information processing model predicts each text unit in the title text according to the semantic feature data which is extracted currently.

And in the process of predicting the title text by the cross-modal information processing model, judging the position interval of the current text unit, and acquiring the information corresponding to the current text unit according to the indication of the first self-attention mask information because the position interval corresponding to the current text unit is the second position interval. If the text unit is the 1 st text unit, the above information includes the semantic feature data corresponding to each of the above feature corpus units, i.e., the above first semantic feature data and the second semantic feature data, so as to determine the semantic feature data corresponding to the 1 st text unit according to the above semantic feature data.

Step 2343, according to the first self-attention mask information, determining the first semantic feature data, the second semantic feature data, and semantic feature data corresponding to a text unit before the ith text unit as context information corresponding to the ith text unit.

i is an integer greater than 1.

Step 2344, semantic feature data corresponding to the ith text unit is determined based on the context information.

If the text unit is not the 1 st text unit, the context information includes semantic feature data corresponding to each of the feature corpus units and semantic feature data corresponding to a text unit preceding the i-th text unit, so that the semantic feature data corresponding to the i-th text unit is determined according to the first semantic feature data, the second semantic feature data and the semantic feature data corresponding to the text unit preceding the i-th text unit.

Step 2345, outputting the title text according to the semantic feature data corresponding to each text unit.

In one possible embodiment, the semantic feature data corresponding to each text unit is input into a full link layer and a normalization (softmax) layer, probability distribution information of each text unit on a target dictionary is output, further, a word or a word corresponding to each text unit is determined according to the probability distribution information, and the header text can be generated and output based on the word or the word corresponding to each text unit.

In summary, according to the technical scheme provided by the embodiment of the application, the cross-modal information processing model is trained based on the feature information of the multimedia sample object corresponding to the target modality and the text modality, so that the trained cross-modal information processing model can perform cross-modal semantic analysis processing. For a target multimedia object needing to generate a title text, feature information corresponding to media information of a target mode in the multimedia object and feature information corresponding to text information in the multimedia object can be respectively determined, then cross-mode semantic analysis processing is carried out on the feature information corresponding to the target mode and the text mode based on the cross-mode information processing model, the title text of the target multimedia object is automatically output, and title generation efficiency and title accuracy are improved.

Please refer to fig. 6, which illustrates a fourth flowchart of a title generation method according to an embodiment of the present application. The method can be applied to a computer device which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 or the server 20 in the application program running environment shown in fig. 1. The method can include the following steps (601-606).

Step 601, acquiring a target multimedia object.

The target multimedia object comprises media information corresponding to a target mode and text information corresponding to a text mode, and the target mode refers to at least one information mode different from the text mode.

Alternatively, the target multimedia object in this embodiment may be a multimedia object without a title. For example, video without a title.

In an exemplary embodiment, the text information includes at least one text corpus unit, and the media information includes at least one image corresponding to the target multimedia object. Optionally, the text information includes content text information corresponding to a target multimedia object, and the at least one text corpus unit is a text corpus unit in the content text information.

Step 602, performing visual feature extraction processing on at least one image to obtain a visual feature vector corresponding to the at least one image.

In a possible implementation manner, the at least one image is subjected to a cross-modality feature extraction process to obtain the visual feature vector, and the visual feature vector can be used for characterizing feature information of the at least one image on a text modality. Optionally, at least one image is input into the CLIP model, and a visual feature vector corresponding to each of the at least one image is output.

Step 603, based on the visual feature vector, a visual feature sequence is obtained.

The target feature information includes a sequence of visual features corresponding to at least one image.

In another possible implementation, a position feature vector and a paragraph feature vector corresponding to the visual feature vector are determined. The position feature vector is used for representing the corresponding position information of the visual feature vector in the visual feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the visual feature sequence input cross-modal information processing model. And performing fusion processing on the visual feature vector and the corresponding position feature vector and paragraph feature vector to obtain a fused visual feature vector, and further obtaining the visual feature sequence.

Step 604, performing word embedding processing on at least one text corpus unit to obtain a word embedding vector corresponding to the at least one text corpus unit.

Step 605, based on the word embedding vector, a text feature sequence is obtained.

The text characteristic information comprises a text characteristic sequence corresponding to at least one text corpus unit.

In another possible implementation manner, the position feature vector and the paragraph feature vector corresponding to the word embedding vector are determined. The position feature vector is used for representing the corresponding position information of the word embedding vector in the text feature sequence, and the paragraph feature vector is used for representing the text segment position information of the text feature sequence input cross-modal information processing model. And fusing the word embedding vector with the corresponding position characteristic vector and paragraph characteristic vector to obtain a fused word embedding vector, and further obtain the text characteristic sequence.

And 606, performing cross-modal semantic analysis processing on the visual characteristic sequence and the text characteristic sequence based on the cross-modal information processing model, and outputting a title text.

In one example, as shown in fig. 7(a) and 7(b), fig. 7(a) exemplarily shows a schematic diagram of a video cover first, and fig. 7(b) exemplarily shows a schematic diagram of a video cover second. The cross-modal information processing model provided by the embodiment of the application can output the heading text "person A", which is really beautiful! ". The cross-modal information processing model provided in the embodiment of the present application may output the title text "dancing today is not very nice? ".

To sum up, according to the technical solution provided by the embodiment of the present application, for a target multimedia object that needs to generate a title text, feature information corresponding to media information of a target modality in the multimedia object and feature information corresponding to text information in the multimedia object may be respectively determined, and then cross-modality semantic analysis processing is performed on the feature information corresponding to the target modality and the text modality based on the cross-modality information processing model, so as to automatically output the title text of the target multimedia object, thereby improving title generation efficiency and title accuracy.

Please refer to fig. 8, which illustrates a fifth flowchart of a title generation method according to an embodiment of the present application. The method can be applied to a computer device which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 or the server 20 in the application program running environment shown in fig. 1. The method can include the following steps (810-830).

Step 810, obtaining the target multimedia object.

Optionally, the title length corresponding to the target multimedia object in this embodiment is greater than the upper limit threshold of the title length or less than the lower limit threshold of the title length.

In an exemplary embodiment, the text information includes original title information corresponding to the target multimedia object.

In an exemplary embodiment, the media information includes at least one image corresponding to the target multimedia object.

In step 820, a visual feature sequence corresponding to at least one image and an original header feature sequence corresponding to the original header information are determined.

And under the condition that the text information comprises original title information corresponding to the target multimedia object, the text characteristic information comprises an original title characteristic sequence corresponding to the original title information.

In case the media information comprises at least one image corresponding to the target multimedia object, the target feature information comprises a sequence of visual features corresponding to the at least one image.

There are many implementations of the determination process of the visual feature sequence, and in particular, the determination process in the above embodiment can be referred to.

In a possible implementation manner, word embedding processing is performed on the text corpus unit in the original header information to obtain a word embedding vector corresponding to the text corpus unit in the original header information.

Optionally, the words corresponding to each text corpus unit in the original header information are embedded into vectors and arranged in sequence to obtain the original header feature sequence.

Optionally, the position feature vector and the paragraph feature vector corresponding to each text corpus unit in the original header information are determined. The position feature vector is used for representing the corresponding position information of the word embedding vector in the original title feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the cross-modal information processing model input into the original title feature sequence. And fusing the word embedding vector with the corresponding position characteristic vector and paragraph characteristic vector to obtain a fused word embedding vector, and further obtain the original title characteristic sequence.

In an exemplary embodiment, the target multimedia object includes a target video, the text information further includes video text information corresponding to the target video, and the text feature information further includes a video text feature sequence corresponding to the video text information. Accordingly, as shown in fig. 9, the step 820 may be alternatively implemented by the following step 821, and fig. 9 shows a flowchart six of a title generation method provided in an embodiment of the present application.

In step 821, a visual feature sequence corresponding to at least one image, a video text feature sequence corresponding to the video text information, and an original title feature sequence corresponding to the original title information are determined.

The determination process for the above-mentioned visual feature sequence and original title feature sequence can be various, and the above description can be referred to.

In a possible implementation manner, word embedding processing is performed on a text corpus unit in video text information to obtain a word embedding vector corresponding to the text corpus unit in the video text information.

Optionally, the words corresponding to each text corpus unit in the video text information are embedded into vectors and arranged in sequence to obtain the video text feature sequence.

Optionally, a position feature vector and a paragraph feature vector corresponding to each text corpus unit in the video text information are determined. The position feature vector is used for representing the corresponding position information of the word embedding vector in the video text feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the cross-modal information processing model input into the video text feature sequence. And fusing the word embedding vector with the corresponding position characteristic vector and paragraph characteristic vector to obtain a fused word embedding vector, and further obtain the video text characteristic sequence.

And 830, performing cross-modal semantic analysis processing on the visual feature sequence and the original title feature sequence based on the cross-modal information processing model, and outputting a title text.

The visual feature vector in the visual feature sequence is the first feature corpus unit mentioned in the above embodiment, the word embedding vector in the original header feature sequence is the second feature corpus unit mentioned in the above embodiment, and the process of performing the cross-modal semantic analysis processing by the cross-modal information processing model has been described in the above embodiment, which is not described herein again.

In the case that the target feature information includes a visual feature sequence and the text feature information includes a video text feature sequence and an original title feature sequence, as shown in fig. 9, the implementation process of step 830 includes the following step 831.

And 831, performing cross-modal semantic analysis processing on the visual feature sequence, the video text feature sequence and the original title feature sequence based on the cross-modal information processing model, and outputting a video title text corresponding to the target video.

The visual feature vector in the visual feature sequence is the first feature corpus unit mentioned in the above embodiment, the word embedding vector in the video text feature sequence and the original title feature sequence is the second feature corpus unit mentioned in the above embodiment, and the process of performing the cross-modal semantic analysis processing by the cross-modal information processing model has been described in the above embodiment, and is not described herein again.

In one example, as shown in fig. 10(a) and 10(b), fig. 10(a) exemplarily shows a first schematic diagram of a video, and fig. 10(b) exemplarily shows a second schematic diagram of a video. The original title of the video 101 shown in fig. 10(a) is "after a team F hits a parked team B, a star C crooks at the field + clap a stool, and a thrill encourages teammates", and a title text rewritten by the cross-modal information processing model provided in the embodiment of the present application is "star C thrill encourages teammates". The original title of the video 102 shown in fig. 10(b) is "accident during driving", and the title text rewritten by the cross-modal information processing model provided in the embodiment of the present application is "accident during driving, what you will do".

In summary, according to the technical scheme provided by the embodiment of the application, for a target multimedia object whose title text needs to be rewritten, feature information corresponding to media information of a target modality in the multimedia object and feature information corresponding to text information in the multimedia object can be respectively determined, and then cross-modality semantic analysis processing is performed on the feature information corresponding to the target modality and the text modality based on the cross-modality information processing model, so that the title text of the target multimedia object is automatically output, and the title rewriting efficiency and the title accuracy are improved.

Please refer to fig. 11, which illustrates a seventh flowchart of a title generation method according to an embodiment of the present application. The method can be applied to a computer device which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 or the server 20 in the application program running environment shown in fig. 1. The method may include the following steps (1110-1150).

Step 1110, obtain the target multimedia object.

Step 1120, determining target characteristic information corresponding to the media information and text characteristic information corresponding to the text information.

At step 1130, a header length threshold is obtained.

The title length threshold is used for representing the upper limit of the length range of the title text. Alternatively, the title length threshold may be preset by the system, or may be set by the user. The setting manner and the value range of the header length threshold are not limited in the embodiments of the present application.

In step 1140, the header length characteristic information corresponding to the header length threshold is determined.

Optionally, the header length feature vector is determined according to the header length threshold.

Step 1150, based on the cross-modal information processing model, performing cross-modal semantic analysis processing on the header length feature information, the target feature information and the text feature information, and outputting a header text.

The length of the title text is less than or equal to the title length threshold.

In an exemplary embodiment, the target feature information includes a visual feature sequence, and the text feature information includes a text feature sequence, and the text feature sequence includes at least one of a content text feature sequence, an identification text feature sequence, an original title feature sequence, and a video text feature sequence.

Optionally, fusing the title length feature vector with a visual feature vector in a visual feature sequence, or performing secondary fusion with the visual feature vector fused in the visual feature sequence to obtain a visual feature fusion vector, where the visual feature fusion vector may be used as the first feature corpus unit; and fusing the title length feature vector with a word embedding vector in the text feature sequence, or secondarily fusing the title length feature vector with the word embedding vector fused in the text feature sequence to obtain a word embedding fusion vector, wherein the word embedding fusion vector can be used as the second feature corpus unit.

The process of performing cross-modal semantic analysis processing on the cross-modal information processing model has been described in the above embodiments, and is not described herein again.

In summary, according to the technical scheme provided by the embodiment of the application, by obtaining the title length threshold, the length feature information corresponding to the title length threshold is fused with the feature information of the target multimedia object corresponding to the target modality and the text information, and then the cross-modality semantic analysis processing is performed on the feature information corresponding to the fused target modality and the text modality based on the cross-modality information processing model, so that the title text with the length within the title length threshold range can be automatically output, and the title generation efficiency and the title accuracy are improved.

Referring to fig. 12, a first flowchart of a model training method provided in an embodiment of the present application is shown. The method can be applied to a computer device which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 or the server 20 in the application program running environment shown in fig. 1. The method can include the following steps (1210-1260).

In step 1210, a first multimedia sample object is obtained.

Optionally, the first multimedia sample object is a multimedia object having at least two modalities of media information, including but not limited to a video object, an audio object, a graphic object, and the like. The first multimedia sample object comprises an uncapped multimedia object and a titled multimedia object. Optionally, the title length corresponding to the first multimedia sample object is greater than the upper limit threshold of the title length or less than the lower limit threshold of the title length.

Optionally, the first multimedia sample object includes first media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality. In some possible scenarios, the target modalities include, but are not limited to, visual modalities, audio modalities, and the like.

In an exemplary embodiment, the text information includes at least one text corpus unit. Optionally, the text information includes content text information and title text information corresponding to the first multimedia sample object. Correspondingly, the text information comprises a text corpus unit in the content text information and a text corpus unit in the title text information. The text corpus unit is a character or a word.

Alternatively, the content text information includes video text information, audio text information, and the like. The video text information comprises identification text information, subtitle text information, voice-over text information, transcription text information and the like corresponding to the video. The identification text information refers to text information obtained by performing text identification on video frames in a video. The text information is obtained by performing voice recognition on the audio in the video through the transcribed text information. The audio text information comprises transcription text information, voice-over text information and the like corresponding to the audio. The embodiment of the present application does not limit text information.

In an exemplary embodiment, the media information includes at least one image corresponding to the first multimedia sample object. In some application scenarios, the first multimedia sample object includes at least one image. For example, if the first multimedia sample object is a video, the at least one image may be a video frame in the video. For another example, if the first multimedia sample object is a text object, the at least one image may be an image in the text object or a video frame of a video included in the text object.

In one possible implementation, the first multimedia sample object is a target sample video. And performing frame extraction processing on the target sample video to obtain the at least one image. Alternatively, the frame decimation frequency corresponds to 1FPS (Frames Per Second).

Optionally, the first media information may further include at least one audio frame corresponding to the first multimedia sample object.

Step 1220, determining target feature information corresponding to the first media information and text feature information corresponding to the text information.

In an exemplary embodiment, the process of determining the target feature information corresponding to the first media information is as follows:

and performing visual feature extraction processing on at least one image to obtain a visual feature vector corresponding to the at least one image. In a possible implementation manner, the at least one image is subjected to a cross-modality feature extraction process to obtain the visual feature vector, and the visual feature vector can be used for characterizing feature information of the at least one image on a text modality. Optionally, at least one image is input into the CLIP model, and a visual feature vector corresponding to each of the at least one image is output.

And obtaining target characteristic information based on the visual characteristic vector. In an exemplary embodiment, based on the visual feature vector, a visual feature sequence is obtained, and the target feature information includes the visual feature sequence. The visual feature sequence is used for characterizing feature information of the target multimedia object on the video modality.

The target feature information may further include feature information of a modality other than the visual modality in the target modality. For example, the target feature information may further include feature information of an audio modality, such as an audio feature sequence. Optionally, the audio feature extraction processing is performed on at least one audio frame corresponding to the first multimedia sample object to obtain an audio feature vector corresponding to the at least one audio frame, and then the audio feature vectors corresponding to the at least one audio frame are arranged in sequence to obtain the audio feature sequence. Or, performing cross-modal feature extraction processing on the at least one audio frame to obtain an audio feature vector corresponding to the at least one audio frame, where the audio feature vector can be used to represent feature information corresponding to the audio frame in a text mode. Or determining a position feature vector and a paragraph feature vector corresponding to the audio feature vector. The position feature vector is used for representing the corresponding position information of the audio feature vector in the audio feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the audio feature sequence input cross-modal information processing model. And carrying out fusion processing on the audio feature vector and the corresponding position feature vector and paragraph feature vector to obtain a fused audio feature vector, and further obtain the audio feature sequence.

In an exemplary embodiment, the process of determining text characteristic information corresponding to text information is as follows:

and performing word embedding processing on at least one text corpus unit to obtain a word embedding vector corresponding to at least one text corpus unit. Optionally, word embedding processing is performed on each text corpus unit in the text information, so as to obtain a word embedding vector corresponding to each text corpus unit. The embodiment of the present application does not limit the manner of word embedding processing.

And obtaining text characteristic information based on the word embedding vector. Optionally, a text feature sequence is obtained based on the word embedding vector. The text feature information includes a sequence of text features.

Step 1230, a cross-modal information processing model to be trained is obtained.

In an exemplary embodiment, the cross-modal information processing model to be trained is a pre-trained machine learning model. Correspondingly, as shown in fig. 13, the implementation process of the step 1230 may include the following steps (1231 to 1236), and fig. 13 shows a flowchart ii of a model training method provided in an embodiment of the present application.

Step 1231, a second multimedia sample object is acquired.

Optionally, the second multimedia sample object includes second media information corresponding to the target modality and title information corresponding to the text modality.

Optionally, the second media information includes at least one image corresponding to the second multimedia sample object.

Step 1232, determining media characteristic information corresponding to the second media information and title characteristic information corresponding to the title information.

In a possible implementation manner, the visual feature extraction processing is performed on at least one image to obtain a visual feature vector corresponding to the at least one image. And obtaining media characteristic information based on the visual characteristic vector.

Optionally, the at least one image is subjected to cross-modality feature extraction processing to obtain the visual feature vector, and the visual feature vector can be used for characterizing feature information of the at least one image on a text modality. Optionally, at least one image is input into the CLIP model, and a visual feature vector corresponding to each of the at least one image is output.

Optionally, the at least one image is subjected to image feature extraction processing to obtain the visual feature vector, and the visual feature vector may be used to represent image feature information corresponding to the at least one image.

In an exemplary embodiment, based on the visual feature vector, a visual feature sequence is derived, and the media feature information includes the visual feature sequence. The visual characteristic sequence is used for characterizing the characteristic information of the second multimedia sample object on the visual modality.

Optionally, the visual feature vectors corresponding to the at least one image are arranged in sequence to obtain the visual feature sequence.

Optionally, a position feature vector and a paragraph feature vector corresponding to the visual feature vector are determined. The position feature vector is used for representing the corresponding position information of the visual feature vector in the visual feature sequence, and the paragraph feature vector is used for representing the text paragraph position information of the visual feature sequence input cross-modal information processing model. And fusing the visual feature vector and the corresponding position feature vector and paragraph feature vector to obtain a fused visual feature vector, and further obtain the visual feature sequence.

Step 1233, the initial cross-modal information processing model and the second self-attention mask information are obtained.

The second self-attention mask information is used for representing that the selection direction of the context information corresponding to the initial cross-modal information processing model is the context direction.

Step 1234, pre-training the initial cross-modal information processing model based on the second self-attention mask information, the media feature information, and the title feature information, and outputting media semantic feature information corresponding to the media feature information and title semantic feature information corresponding to the title feature information.

And determining the selection direction of the context information corresponding to the cross-modal information processing model as a context direction under the instruction of the second self-attention mask information.

And under the condition that the selection direction of the contextual information is the contextual direction, determining the media characteristic information and the title characteristic information as the contextual information corresponding to each characteristic corpus unit in the media characteristic information or the title characteristic information.

And determining a hidden vector corresponding to each feature corpus unit based on the context information, and finally obtaining a visual feature hidden vector corresponding to each visual feature vector in the media feature information and a text feature hidden vector corresponding to each word embedding vector in the header feature information. Optionally, the word embedding vector in the header feature information includes a word embedding vector corresponding to the start position identifier.

Optionally, the target visual feature hidden vector is determined based on the visual feature hidden vector corresponding to each visual feature vector in the media feature information. Optionally, the visual feature hidden vector corresponding to each visual feature vector is averaged to obtain the target visual feature hidden vector. The target visual feature latent vector may be determined by the following formula (1).

Wherein H_viRepresenting the ith visual feature latent vector, n being the total number of visual feature latent vectors, H_vAnd the target visual feature latent vector is obtained.

Optionally, the text feature hidden vector corresponding to the word embedding vector corresponding to the start position identifier in the header feature information is determined as a target text feature hidden vector.

Optionally, the media semantic feature information includes the target visual feature hidden vector, and the header semantic feature information includes the target text feature hidden vector.

Step 1235, determining second model loss information based on the media semantic feature information and the title semantic feature information.

The second model loss information is used for representing the semantic alignment degree between the media semantic feature information and the title semantic feature information.

In one possible implementation, a cosine similarity between the target visual feature hidden vector and the target text feature hidden vector is determined, and a symmetric cross entropy used for representing the second model loss information is determined based on the cosine similarity.

Alternatively, the symmetric cross entropy can be determined by the following equation (2).

Wherein L represents symmetric cross entropy, B represents training batch size,

representing the target visual feature hidden vector corresponding to the ith sample,

representing the target text characteristic hidden vector corresponding to the ith sample,

and representing a target text characteristic hidden vector corresponding to the jth sample, and omega represents a sample set.

In another possible implementation, the semantic feature data corresponding to the start position identifier is input into a full connection layer and a normalization layer to obtain probability distribution data, and secondary classification is performed according to the probability distribution data to obtain a classification result. The classification result is used for representing the semantic alignment degree between the media semantic feature information and the title semantic feature information. Because the second attention mask information indicates that the selection direction of the context information is the context direction, the semantic feature data corresponding to the start position identifier can be used for representing the media semantic feature information and the title semantic feature information, the classification result includes an alignment identifier and a non-alignment identifier, and the second model loss information can be determined according to the respective corresponding number of the alignment identifier and the non-alignment identifier.

Alternatively, the second model loss information may also be determined by the following equation (3).

Wherein L represents second model loss information; b represents the training batch size; y is_iRepresenting a sample label corresponding to the ith sample, wherein the sample label is used for indicating whether the target characteristic information and the text characteristic information in the sample pair belong to the same multimedia object, if the sample label is 0, the target characteristic information and the text characteristic information in the sample pair do not belong to the same multimedia object, and if the sample label is 1, the target characteristic information and the text characteristic information in the sample pair belong to the same multimedia object; p is a radical of_i0Representing probability distribution data of alignment of target characteristic information and text characteristic information in the ith sample pair; p is a radical of_i1The probability distribution data of the target characteristic information and the text characteristic information in the ith sample pair are represented.

In some possible application scenarios, the task of determining whether the visual feature and the text feature are aligned is mostly used in the video retrieval direction, the model used for video retrieval is basically a dual-flow model, and the model used for text generation is a single-mode single-flow model. The dual-flow model is a model with two data processing paths, which process visual features and text features respectively. The VUniLM provided by the embodiment of the present application is a single-stream model, but a single data processing path corresponding to the VUniLM can process feature data streams of at least two modalities, so that a cross-modality text generation task can be performed. Optionally, in the embodiment of the present application, a task of determining whether the visual feature and the text feature are aligned is used in a pre-training stage of the VUniLM model. In the pre-training stage, visual feature vectors and word embedding vectors are used as input, and the pre-training target is to align corresponding semantic feature information of the two processed by the models in the same feature space, so that the models can learn cross-modal semantic knowledge, the accuracy of cross-modal semantic analysis processing of the VUNILM model is effectively improved, the data processing amount of subsequent formal training is reduced, the pre-trained models can be simultaneously suitable for a title generation task and a title rewriting task, and the model utilization efficiency is improved.

And step 1236, obtaining the cross-modal information processing model to be trained under the condition that the second model loss information meets the second loss condition.

And under the condition that the symmetric cross entropy is less than or equal to a second threshold (representing that the second model loss information meets a second loss condition), obtaining the cross-modal information processing model to be trained.

Step 1240, performing model training on the cross-modal information processing model to be trained based on the target characteristic information and the text characteristic information, and outputting a caption text corresponding to the first multimedia sample object.

In an exemplary embodiment, the text information includes content text information and preset title information corresponding to the first multimedia sample object, the text feature information includes a content text feature sequence corresponding to the content text information and a preset title feature sequence corresponding to the preset title information, the media information includes at least one image corresponding to the first multimedia sample object, and the target feature information includes a visual feature sequence corresponding to the at least one image; accordingly, as shown in fig. 13, the implementation of step 1240 may include the following step 1241.

And 1241, performing model training on the cross-modal information processing model to be trained based on the visual feature sequence, the content text feature sequence and the preset title feature sequence, and outputting a title text.

In an exemplary embodiment, the text information includes original title information and rewritten title information corresponding to the first multimedia sample object, the text feature information includes an original title feature sequence corresponding to the original title information and a rewritten title feature sequence corresponding to the rewritten title information, the media information includes at least one image corresponding to the first multimedia sample object, and the target feature information includes a visual feature sequence corresponding to the at least one image; accordingly, as shown in fig. 14, the implementation process of the step 1240 may include the following step 124a, and fig. 14 shows a flowchart three of a model training method provided in an embodiment of the present application.

And step 124a, performing model training on the cross-modal information processing model to be trained based on the visual feature sequence, the original title feature sequence and the rewritten title feature sequence, and outputting a title text.

In step 1250, first model loss information is determined based on the header text and the text information.

The first model loss information is used to characterize a semantic match between the caption text and the first multimedia sample object.

In one possible embodiment, first probability distribution information corresponding to the title text and second probability distribution information corresponding to the text information are determined,

and determining cross entropy based on the first probability distribution information and the second probability distribution information, wherein the cross entropy is used for representing the first model loss information.

Alternatively, the above cross entropy may be determined by the following formula (4).

Where n denotes the length of the predicted text, V denotes the dictionary, f_iRepresenting the probability distribution vector corresponding to the ith word in the title text, f_kRepresenting the probability distribution vector corresponding to the kth word in the dictionary.

Accordingly, after the step 1241, the step 1250 may be implemented as follows, including the step 1251.

Step 1251, determine the first model loss information based on the header text and the preset header information.

Accordingly, after the step 124a, the step 1250 may be implemented as the following step 125 a.

Step 125a, determining first model loss information based on the header text and the rewritten header information.

And 1260, obtaining the trained cross-modal information processing model under the condition that the first model loss information meets the first loss condition.

Optionally, when the cross entropy is less than or equal to the second threshold (representing that the first model loss information meets the first loss condition), the trained cross-modal information processing model is obtained.

In the embodiment of the present application, the model effect is tested by using two indexes, i.e., Bilingual Evaluation Understudy (Bilingual Evaluation override) and accuracy, wherein the BLEU can be calculated by a machine, the accuracy can be given by a annotator, and the annotator does not completely use an original title as a real value when reviewing the result, because more than one topic is to be expressed by one video. The results are shown in tables 1 and 2 below, and can be seen from tables 1 and 2: the introduction of pre-training improves the BLEU and accuracy.

TABLE 1 title Generation Effect index

TABLE 2 title rewrite Effect index

Wherein, VminiLM is the above-mentioned cross-modal information processing model.

In summary, according to the technical scheme provided by the embodiment of the application, the cross-modal information processing model is pre-trained based on the feature information of the multimedia sample object corresponding to the target modality and the text modality, so that the cross-modal information processing model can align the feature information of different modalities, and then the cross-modal information processing model is formally trained based on the feature information of the multimedia sample object corresponding to the target modality and the text modality, so that the trained cross-modal information processing model can perform cross-modal semantic analysis processing, title generation or title rewriting is realized, and the model accuracy is improved. For a target multimedia object needing to generate or rewrite a title text, feature information corresponding to media information of a target mode in the multimedia object and feature information corresponding to text information in the multimedia object can be respectively determined, then cross-mode semantic analysis processing can be carried out on the feature information corresponding to the target mode and the text mode based on the trained cross-mode information processing model, the title text of the target multimedia object is automatically output, and title generation efficiency and title accuracy are improved.

Referring to fig. 15, a fourth flowchart of a model training method according to an embodiment of the present application is shown. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 or the server 20 in the application program running environment shown in fig. 1. The method may include the following steps (1510-1570).

At step 1510, a first multimedia sample object is obtained.

The first multimedia sample object comprises first media information corresponding to a target mode and text information corresponding to a text mode, wherein the target mode refers to at least one information mode different from the text mode.

Step 1520, determining the target characteristic information corresponding to the first media information and the text characteristic information corresponding to the text information.

In one possible embodiment, the text information includes original title text.

In another possible embodiment, the text information includes a rewritten caption text.

In step 1530, the title length feature information corresponding to the text information is determined.

In one possible implementation, at least one title length feature vector corresponding to an original title text is determined, and a title length corresponding to the at least one title length feature vector is greater than or equal to a title length corresponding to the original title text.

In another possible implementation, at least one title length feature vector corresponding to the rewritten title text is determined, and the title length corresponding to the at least one title length feature vector is greater than or equal to the title length corresponding to the rewritten title text.

Step 1540, obtain the cross-modal information processing model to be trained.

And step 1550, performing model training on the cross-modal information processing model to be trained based on the header length characteristic information, the target characteristic information and the text characteristic information, and outputting a header text.

In an exemplary embodiment, the target feature information includes a visual feature sequence, and the text feature information includes a text feature sequence, and the text feature sequence includes at least one of a content text feature sequence, a preset title feature sequence, an original title feature sequence, and a rewritten title feature sequence.

Optionally, fusing each title length feature vector with a visual feature vector in the visual feature sequence, or performing secondary fusion with the visual feature vector fused in the visual feature sequence, to obtain a visual feature fusion vector, where the visual feature fusion vector may serve as the first feature corpus unit; and fusing each title length feature vector with a word embedding vector in the text feature sequence, or secondarily fusing the word embedding vector fused in the text feature sequence to obtain a word embedding fusion vector, wherein the word embedding fusion vector can be used as the second feature corpus unit. The samples can be enriched, and a plurality of title length vectors are fused into the visual feature sequence and the text feature sequence, so that the model can learn the range information of the title length, determine the title within the words of the title, and finally ensure that the output title length is less than or equal to the title length threshold.

Step 1560, based on the header text and the text information, first model loss information is determined.

And 1570, obtaining the trained cross-modal information processing model when the first model loss information meets the first loss condition and the title text is less than or equal to the title length threshold corresponding to the title length characteristic information.

In summary, according to the technical scheme provided by the embodiment of the application, different header length characteristic information is constructed, and the header length characteristic information is introduced into the model training process, so that the trained cross-modal information processing model can control the length of the output header text within a set range, and the header generation quality is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 16, a block diagram of a title generation apparatus according to an embodiment of the present application is shown. The device has the function of realizing the title generation method, and the function can be realized by hardware or by hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 1600 may include: an object acquisition module 1610, a feature determination module 1620, and a title output module 1630.

An object obtaining module 1610 is configured to obtain a target multimedia object, where the target multimedia object includes media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality.

The characteristic determining module 1620 is configured to determine target characteristic information corresponding to the media information and text characteristic information corresponding to the text information.

A title output module 1630, configured to perform cross-modal semantic analysis processing on the target feature information and the text feature information based on the cross-modal information processing model, and output a title text corresponding to the target multimedia object.

In an exemplary embodiment, the text information includes at least one text corpus unit, the text feature information includes a text feature sequence corresponding to the at least one text corpus unit, the media information includes at least one image corresponding to the target multimedia object, and the target feature information includes a visual feature sequence corresponding to the at least one image;

the title output module 1630 is further configured to:

and performing cross-modal semantic analysis processing on the visual feature sequence and the text feature sequence based on the cross-modal information processing model, and outputting the title text.

In an exemplary embodiment, the text information includes original title information corresponding to the target multimedia object, the text feature information includes an original title feature sequence corresponding to the original title information, the media information includes at least one image corresponding to the target multimedia object, and the target feature information includes a visual feature sequence corresponding to the at least one image;

the title output module 1630 is further configured to:

and performing cross-modal semantic analysis processing on the visual feature sequence and the original title feature sequence based on the cross-modal information processing model, and outputting the title text.

In an exemplary embodiment, the target multimedia object includes a target video, the text information further includes video text information corresponding to the target video, and the text feature information further includes a video text feature sequence corresponding to the video text information;

the title output module 1630 is specifically configured to:

and performing cross-modal semantic analysis processing on the visual feature sequence, the video text feature sequence and the original title feature sequence based on the cross-modal information processing model, and outputting a video title text corresponding to the target video.

In an exemplary embodiment, the title output module 1630 includes: the device comprises a characteristic information input unit, a mask information determining unit, a context corpus determining unit and a title text output unit.

A feature information input unit, configured to input the target feature information and the text feature information into the cross-modal information processing model, where the target feature information includes at least one first feature corpus unit, and the text feature information includes at least one second feature corpus unit.

The mask information determining unit is used for determining first self-attention mask information corresponding to the cross-modal information processing model, and the first self-attention mask information is used for representing that the selection direction of the context information corresponding to the cross-modal information processing model is a composite direction.

A context corpus determining unit, configured to determine, based on the first self-attention mask information, the at least one first feature corpus unit and the at least one second feature corpus unit as context corpus units corresponding to the at least one first feature corpus unit or the at least one second feature corpus unit.

And the title text output unit is used for performing cross-modal semantic analysis processing on the context corpus unit based on the cross-modal information processing model and outputting the title text.

In an exemplary embodiment, the caption text output unit includes: a semantic feature data determining subunit, a context information determining subunit and a title text output subunit.

And the semantic feature data determining subunit is configured to perform cross-modal semantic analysis processing on the context corpus unit based on the cross-modal information processing model to obtain first semantic feature data corresponding to the at least one first feature corpus unit and second semantic feature data corresponding to the at least one second feature corpus unit.

The semantic feature data determining subunit is further configured to determine semantic feature data corresponding to the 1 st text unit in the title text based on the first semantic feature data and the second semantic feature data.

A contextual information determination subunit, configured to determine, according to the first self-attention mask information, the first semantic feature data, the second semantic feature data, and semantic feature data corresponding to a text unit before an ith text unit as contextual information corresponding to the ith text unit, where i is an integer greater than 1.

The semantic feature data determining subunit is further configured to determine semantic feature data corresponding to the ith text unit based on the context information.

And the title text output subunit determines semantic feature data corresponding to each text unit and outputs the title text.

In an exemplary embodiment, the target multimedia object comprises a target video, and the apparatus 1600 further comprises: the device comprises a video frame acquisition module and a text recognition module.

And the video frame acquisition module is used for acquiring the first N video frames in the target video, wherein N is an integer greater than 0.

And the text recognition module is used for performing text recognition processing on the first N video frames to obtain recognition text information, and the text information comprises the recognition text information.

In an exemplary embodiment, the text information includes at least one text corpus unit, the media information includes at least one image corresponding to the target multimedia object, and the feature determination module 1620 includes: the device comprises a visual characteristic determining unit, a target characteristic determining unit, a word embedding characteristic determining unit and a text characteristic determining unit.

And the visual feature determining unit is used for performing visual feature extraction processing on the at least one image to obtain a visual feature vector corresponding to the at least one image.

The target feature determining unit is used for obtaining the target feature information based on the visual feature vector;

and the word embedding characteristic determining unit is used for carrying out word embedding processing on the at least one text corpus unit to obtain a word embedding vector corresponding to the at least one text corpus unit.

And the text characteristic determining unit is used for obtaining the text characteristic information based on the word embedding vector.

In an exemplary embodiment, the apparatus 1600 further comprises: a title length obtaining module and a length characteristic determining module.

And the title length acquisition module is used for acquiring the title length threshold.

And the length characteristic determining module is used for determining the title length characteristic information corresponding to the title length threshold.

The title output module 1630 is further configured to:

based on the cross-modal information processing model, performing cross-modal semantic analysis processing on the title length feature information, the target feature information and the text feature information, and outputting the title text, wherein the length of the title text is less than or equal to the title length threshold.

Referring to fig. 17, a block diagram of a model training apparatus provided in an embodiment of the present application is shown. The device has the function of realizing the model training method, and the function can be realized by hardware or by hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 1700 may include: sample object acquisition module 1710, feature determination module 1720, model acquisition module 1730, model training module 1740, loss information determination module 1750, and model determination module 1760.

A sample object obtaining module 1710, configured to obtain a first multimedia sample object, where the first multimedia sample object includes first media information corresponding to a target modality and text information corresponding to a text modality, and the target modality refers to at least one information modality different from the text modality;

a feature determining module 1720, configured to determine target feature information corresponding to the first media information and text feature information corresponding to the text information;

a model obtaining module 1730, configured to obtain a cross-modal information processing model to be trained;

a model training module 1740, configured to perform model training on the cross-modal information processing model to be trained based on the target feature information and the text feature information, and output a caption text corresponding to the first multimedia sample object;

a loss information determining module 1750, configured to determine first model loss information based on the caption text and the text information, where the first model loss information is used to characterize a semantic matching degree between the caption text and the first multimedia sample object;

a model determining module 1760, configured to obtain a trained cross-modal information processing model when the first model loss information meets a first loss condition.

In an exemplary embodiment, the text information includes content text information and preset header information corresponding to the first multimedia sample object, the text feature information includes a content text feature sequence corresponding to the content text information and a preset header feature sequence corresponding to the preset header information, the media information includes at least one image corresponding to the first multimedia sample object, and the target feature information includes a visual feature sequence corresponding to the at least one image;

the model training module 1740, further configured to:

performing model training on the cross-modal information processing model to be trained based on the visual feature sequence, the content text feature sequence and the preset title feature sequence, and outputting the title text;

the loss information determining module 1750 is further configured to:

and determining the first model loss information based on the title text and the preset title information.

In an exemplary embodiment, the text information includes original title information and rewritten title information corresponding to the first multimedia sample object, the text feature information includes an original title feature sequence corresponding to the original title information and a rewritten title feature sequence corresponding to the rewritten title information, the media information includes at least one image corresponding to the first multimedia sample object, and the target feature information includes a visual feature sequence corresponding to the at least one image;

the model training module 1740, further configured to:

performing model training on the cross-modal information processing model to be trained based on the visual feature sequence, the original title feature sequence and the rewritten title feature sequence, and outputting the title text;

the loss information determining module 1750 is further configured to:

determining the first model loss information based on the header text and the rewritten header information.

In an exemplary embodiment, the model acquisition module 1730 includes: the device comprises a sample object acquisition unit, a characteristic information determination unit, a mask information acquisition unit, a model pre-training unit, a loss information determination unit and a model determination unit.

A sample object obtaining unit, configured to obtain a second multimedia sample object, where the second multimedia sample object includes second media information corresponding to the target modality and title information corresponding to the text modality.

And the characteristic information determining unit is used for determining the media characteristic information corresponding to the second media information and the title characteristic information corresponding to the title information.

The mask information acquisition unit is used for acquiring an initial cross-modal information processing model and second self-attention mask information, and the second self-attention mask information is used for representing that the context information selection direction corresponding to the initial cross-modal information processing model is a context direction.

And the model pre-training unit is used for pre-training the initial cross-modal information processing model based on the second self-attention mask information, the media characteristic information and the title characteristic information, and outputting media semantic characteristic information corresponding to the media characteristic information and title semantic characteristic information corresponding to the title characteristic information.

And the loss information determining unit is used for determining second model loss information based on the media semantic feature information and the title semantic feature information, wherein the second model loss information is used for representing the semantic alignment degree between the media semantic feature information and the title semantic feature information.

And the model determining unit is used for obtaining the cross-modal information processing model to be trained under the condition that the second model loss information meets a second loss condition.

In an exemplary embodiment, the apparatus 1700 further comprises: a length feature determination module.

And the length characteristic determining module is used for determining the title length characteristic information corresponding to the text information.

The model training module 1740, further configured to:

and performing model training on the cross-modal information processing model to be trained based on the title length characteristic information, the target characteristic information and the text characteristic information, and outputting the title text.

The model determination module 1760 is further configured to:

and obtaining the trained cross-modal information processing model under the condition that the first model loss information meets a first loss condition and the title text is less than or equal to a title length threshold corresponding to the title length characteristic information.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 18, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for executing the above-described title generation method, or the above-described model training method. Specifically, the method comprises the following steps:

computer device 1800 includes a Central Processing Unit (CPU) 1801, a system Memory 1804 including a Random Access Memory (RAM) 1802 and a Read Only Memory (ROM) 1803, and a system bus 1805 that couples system Memory 1804 and Central Processing Unit 1801. The computer device 1800 also includes a basic Input/Output system (I/O) 1806, which facilitates transfer of information between devices within the computer, and a mass storage device 1807 for storing an operating system 1813, application programs 1814, and other program modules 1815.

The basic input/output system 1806 includes a display 1808 for displaying information and an input device 1809 such as a mouse, keyboard, etc. for a user to input information. Wherein the display 1808 and the input device 1809 are coupled to the central processing unit 1801 through an input output controller 1810 coupled to the system bus 1805. The basic input/output system 1806 may also include an input/output controller 1810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1807 is connected to the central processing unit 1801 through a mass storage controller (not shown) connected to the system bus 1805. The mass storage device 1807 and its associated computer-readable media provide non-volatile storage for the computer device 1800. That is, the mass storage device 1807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1804 and mass storage device 1807 described above may be collectively referred to as memory.

According to various embodiments of the application, the computer device 1800 may also operate as a remote computer connected to a network, such as the Internet. That is, the computer device 1800 may be connected to the network 1812 through the network interface unit 1811 that is coupled to the system bus 1805, or the network interface unit 1811 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the above-described title generation method, or the above-described model training method.

In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which when executed by a processor, implements the above-described title generation method, or the above-described model training method.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the title generation method or the model training method.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

In addition, in the specific implementation of the present application, data related to user information and the like are involved, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The above description is only exemplary of the application and should not be taken as limiting the application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the application should be included in the protection scope of the application.

Claims

1. A title generation method, comprising:

2. The method according to claim 1, wherein the text information includes at least one text corpus unit, the text feature information includes a text feature sequence corresponding to the at least one text corpus unit, the media information includes at least one image corresponding to the target multimedia object, and the target feature information includes a visual feature sequence corresponding to the at least one image;

the cross-modal semantic analysis processing is performed on the target feature information and the text feature information based on the cross-modal information processing model, and a title text corresponding to the target multimedia object is output, including:

3. The method of claim 1, wherein the text information comprises original header information corresponding to the target multimedia object, the text feature information comprises an original header feature sequence corresponding to the original header information, the media information comprises at least one image corresponding to the target multimedia object, and the target feature information comprises a visual feature sequence corresponding to the at least one image;

4. The method of claim 3, wherein the target multimedia object comprises a target video, the text information further comprises video text information corresponding to the target video, and the text feature information further comprises a video text feature sequence corresponding to the video text information;

5. The method according to any one of claims 1 to 4, wherein the performing cross-modal semantic analysis processing on the target feature information and the text feature information based on the cross-modal information processing model to output a title text corresponding to the target multimedia object comprises:

inputting the target feature information and the text feature information into the cross-modal information processing model, wherein the target feature information comprises at least one first feature corpus unit, and the text feature information comprises at least one second feature corpus unit;

determining first self-attention mask information corresponding to the cross-modal information processing model, wherein the first self-attention mask information is used for representing that a context information selection direction corresponding to the cross-modal information processing model is a composite direction;

determining the at least one first feature corpus unit and the at least one second feature corpus unit as context corpus units corresponding to the at least one first feature corpus unit or the at least one second feature corpus unit based on the first self-attention mask information;

and performing cross-modal semantic analysis processing on the context corpus unit based on the cross-modal information processing model, and outputting the title text.

6. The method according to claim 5, wherein said cross-modal semantic analysis processing the context corpus unit based on the cross-modal information processing model to output the title text comprises:

performing cross-modal semantic analysis processing on the context corpus unit based on the cross-modal information processing model to obtain first semantic feature data corresponding to the at least one first feature corpus unit and second semantic feature data corresponding to the at least one second feature corpus unit;

determining semantic feature data corresponding to the 1 st text unit in the title text based on the first semantic feature data and the second semantic feature data;

according to the first self-attention mask information, determining the first semantic feature data, the second semantic feature data and semantic feature data corresponding to a text unit before an ith text unit as context information corresponding to the ith text unit, wherein i is an integer greater than 1;

determining semantic feature data corresponding to the ith text unit based on the contextual information;

and outputting the title text according to the semantic feature data corresponding to each text unit.

7. The method of any of claims 1 to 3, wherein the target multimedia object comprises a target video, the method further comprising:

acquiring the first N video frames in the target video, wherein N is an integer greater than 0;

and performing text recognition processing on the first N video frames to obtain recognition text information, wherein the text information comprises the recognition text information.

8. The method of claim 1, further comprising:

acquiring a title length threshold;

determining title length characteristic information corresponding to the title length threshold;

9. A method of model training, the method comprising:

acquiring a cross-modal information processing model to be trained;

10. The method of claim 9, wherein the text information comprises content text information and preset header information corresponding to the first multimedia sample object, the text feature information comprises a content text feature sequence corresponding to the content text information and a preset header feature sequence corresponding to the preset header information, the media information comprises at least one image corresponding to the first multimedia sample object, and the target feature information comprises a visual feature sequence corresponding to the at least one image;

performing model training on the cross-modal information processing model to be trained based on the target feature information and the text feature information, and outputting a caption text corresponding to the first multimedia sample object, including:

the determining first model loss information based on the header text and the text information comprises:

11. The method of claim 9, wherein the textual information comprises original header information and rewritten header information corresponding to the first multimedia sample object, the textual feature information comprises an original header feature sequence corresponding to the original header information and a rewritten header feature sequence corresponding to the rewritten header information, the media information comprises at least one image corresponding to the first multimedia sample object, and the target feature information comprises a visual feature sequence corresponding to the at least one image;

12. The method according to any one of claims 9 to 11, wherein the obtaining of the cross-modal information processing model to be trained comprises:

acquiring a second multimedia sample object, wherein the second multimedia sample object comprises second media information corresponding to the target modality and title information corresponding to the text modality;

determining media characteristic information corresponding to the second media information and title characteristic information corresponding to the title information;

acquiring an initial cross-modal information processing model and second self-attention mask information, wherein the second self-attention mask information is used for representing that the selection direction of context information corresponding to the initial cross-modal information processing model is a context direction;

pre-training the initial cross-modal information processing model based on the second self-attention mask information, the media feature information and the title feature information, and outputting media semantic feature information corresponding to the media feature information and title semantic feature information corresponding to the title feature information;

determining second model loss information based on the media semantic feature information and the title semantic feature information, wherein the second model loss information is used for representing the semantic alignment degree between the media semantic feature information and the title semantic feature information;

and under the condition that the second model loss information accords with a second loss condition, obtaining the cross-modal information processing model to be trained.

13. The method according to any one of claims 9 to 11, further comprising:

determining title length characteristic information corresponding to the text information;

performing model training on the cross-modal information processing model to be trained based on the title length feature information, the target feature information and the text feature information, and outputting the title text;

obtaining a trained cross-modal information processing model under the condition that the first model loss information meets a first loss condition, wherein the method comprises the following steps:

14. A title generation apparatus, characterized in that the apparatus comprises:

15. A model training apparatus, the apparatus comprising:

and the model determining module is used for obtaining a trained trans-modal information processing model under the condition that the first model loss information accords with a first loss condition.