CN117473359A

CN117473359A - Training method and related device of abstract generation model

Info

Publication number: CN117473359A
Application number: CN202311178879.4A
Authority: CN
Inventors: 梁云龙; 孟凡东; 徐金安; 陈钰枫
Original assignee: Tencent Technology Shenzhen Co Ltd; Beijing Jiaotong University
Current assignee: Tencent Technology Shenzhen Co Ltd; Beijing Jiaotong University
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2024-01-30

Abstract

The application discloses a training method and a related device of a summary generation model; the method comprises the following steps: the initial generation model comprises an encoder, a fusion device and a decoder, wherein a first sample text, a first sample image and a first sample abstract in a first batch of samples are input into the encoder, and a first text vector, a first image vector, a first object vector and a first abstract word segmentation vector are output in an encoding mode. Inputting the first text vector and the first image vector into a fusion device, and outputting a first fusion vector in a cross-mode fusion way; the first fused vector and the first representative vector of the first sample digest are input to a decoder, which decodes and outputs a first probability density. Model parameters of the initial generation model are trained to obtain a abstract generation model by maximizing a first probability density, a first similarity between a first object vector and a first abstract word vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract word vectors. The method improves the abstract effect of the abstract generation model.

Description

Training method and related device of abstract generation model

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method for a summary generation model and a related device.

Background

Currently, users can browse multi-modal content such as text and images on a content browsing platform. In order to facilitate the user to quickly and conveniently understand the main information of the multi-mode content, the abstract corresponding to the text and the image can be generated through the abstract generation model.

In the related art, the training method of the abstract generation model refers to: and inputting a training sample formed by the sample text, the sample image and the sample abstract into an initial generation model to output probability density corresponding to the sample abstract, and training the initial generation model through maximizing the probability density to obtain the abstract generation model.

However, the training method only trains the initial generation model by maximizing the probability density, only learns the association relation between the multi-mode content formed by the sample text and the sample image and the sample abstract, and does not learn other effective association relations, so that the abstract generation model has poor abstract effect and abstract quality.

Disclosure of Invention

In order to solve the technical problems, the application provides a training method and a related device for a abstract generating model, which enable the abstract generating model to generate a more relevant multilingual abstract corresponding to texts and images so as to improve the generating accuracy of the abstract generating model and further improve the abstract effect and the abstract quality of the abstract generating model.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a training method of a summary generation model, where the method includes:

encoding a first sample text, a first sample image corresponding to the first sample text and a first sample abstract in a first batch of samples through an encoder in an initial generation model to obtain a first text vector of the first sample text, a first image vector of the first sample image, a first object vector of the first sample image and a first abstract word segmentation vector of the first sample abstract;

performing cross-modal fusion on the first text vector and the first image vector through a fusion device in the initial generation model to obtain a first fusion vector;

decoding the first fusion vector and the first representation vector of the first sample abstract through a decoder in the initial generation model to obtain a first probability density corresponding to the first sample abstract;

according to the first probability density maximization, the first similarity between the first object vector and the first abstract word segmentation vector is maximized, the second similarity between the first object vector and the second abstract word segmentation vectors is minimized, and model parameters of the initial generation model are trained to obtain the abstract generation model; the plurality of second abstract word vectors are obtained by encoding a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text in the first batch of samples through an encoder in the initial generation model.

On the other hand, the embodiment of the application provides a training method of a abstract generation model, which comprises the following steps:

encoding a third sample text and a third sample image corresponding to the third sample text through an encoder in an initial generation model to obtain a third text vector of the third sample text and a third image vector of the third sample image;

performing cross-modal fusion on the third text vector and the third image vector through a fusion device in the initial generation model to obtain a third fusion vector;

decoding, by the decoder in the initial generation model, the third fusion vector and a third representation vector of a third sample digest corresponding to the third sample text, and a fourth representation vector of a fourth sample digest corresponding to the third sample text, to obtain a third decoding vector and a third probability density corresponding to the third sample digest, and a fourth decoding vector and a fourth probability density corresponding to the fourth sample digest; the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages;

And training model parameters of the initial generation model according to the third probability density, the fourth probability density and the third similarity between the third decoding vector and the fourth decoding vector to obtain the abstract generation model.

On the other hand, an embodiment of the present application provides a training device for a abstract generation model, where the device includes: the first coding unit, the first fusion unit, the first decoding unit and the first training unit;

the first coding unit is configured to code, by using an encoder in an initial generation model, a first sample text in a first batch of samples, a first sample image corresponding to the first sample text, and a first sample abstract, so as to obtain a first text vector of the first sample text, a first image vector of the first sample image, a first object vector of the first sample image, and a first abstract word segmentation vector of the first sample abstract;

the first fusion unit is used for performing cross-modal fusion on the first text vector and the first image vector through the fusion device in the initial generation model to obtain a first fusion vector;

The first decoding unit is configured to decode, by using a decoder in the initial generation model, the first fusion vector and a first representation vector of the first sample digest, to obtain a first probability density corresponding to the first sample digest;

the first training unit is configured to train model parameters of the initial generation model according to maximizing the first probability density, maximizing a first similarity between the first object vector and the first abstract word segmentation vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract word segmentation vectors, so as to obtain the abstract generation model; the plurality of second abstract word vectors are obtained by encoding a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text in the first batch of samples through an encoder in the initial generation model.

On the other hand, an embodiment of the present application provides a training device for a abstract generation model, where the device includes: the second coding unit, the second fusion unit, the second decoding unit and the second training unit;

the second coding unit is configured to code a third sample text and a third sample image corresponding to the third sample text through an encoder in an initial generation model, so as to obtain a third text vector of the third sample text and a third image vector of the third sample image;

The second fusion unit is configured to perform cross-modal fusion on the third text vector and the third image vector through the fusion device in the initial generation model, so as to obtain a third fusion vector;

the second decoding unit is configured to decode, by using a decoder in the initial generation model, the third fusion vector and a third representation vector of a third sample digest corresponding to the third sample text, and a fourth representation vector of a fourth sample digest corresponding to the third sample text, to obtain a third decoding vector and a third probability density corresponding to the third sample digest, and a fourth decoding vector and a fourth probability density corresponding to the fourth sample digest; the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages;

the second training unit is configured to train model parameters of the initial generation model according to maximizing the third probability density, maximizing the fourth probability density, and maximizing a third similarity between the third decoding vector and the fourth decoding vector, so as to obtain the abstract generation model.

In another aspect, embodiments of the present application provide a computer device comprising a processor and a memory:

the memory is used for storing a computer program and transmitting the computer program to the processor;

the processor is configured to perform the method of any of the preceding aspects according to instructions in the computer program.

In another aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, which when run on a computer device, causes the computer device to perform the method of any one of the preceding aspects.

In another aspect, embodiments of the present application provide a computer program product comprising a computer program which, when run on a computer device, causes the computer device to perform the method of any of the preceding aspects.

According to the technical scheme, firstly, a first sample text in a first batch of samples and a first sample image corresponding to the first sample text are input into an encoder in an initial generation model to be encoded, and a first text vector of the first sample text and a first image vector of the first sample image are output; and outputting a first object vector of the first sample image, inputting a first sample abstract corresponding to the first sample text into an encoder in an initial generation model for encoding, and outputting a first abstract word segmentation vector of the first sample abstract. According to the method, on the basis of respectively encoding a first sample text and a first sample image to obtain a first text vector and a first image vector, the association relationship between the first sample image and a first sample abstract is considered, a first object vector of a first sample object in the first sample image is further obtained, and the first sample abstract is further encoded to obtain a first abstract word segmentation vector of a first abstract word in the first sample abstract.

Then, inputting the first text vector and the first image vector into a fusion device in an initial generation model to perform cross-modal fusion, and outputting a first fusion vector; and inputting the first fusion vector and a first representation vector of the first sample abstract into a decoder in an initial generation model for decoding, and outputting a first probability density corresponding to the first sample abstract. Based on a first text vector and a first image vector, the method takes the association relation between the multi-mode content formed by the first sample text and the first sample image and the first sample abstract into consideration, fuses the first text vector and the first image vector to obtain a first fused vector, and decodes the first fused vector into a first probability density corresponding to the first sample abstract by combining a first representation vector of the first sample abstract.

Finally, inputting a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample texts in the first batch of samples into an encoder in an initial generation model for encoding, and outputting a plurality of second abstract word segmentation vectors corresponding to the plurality of second sample abstracts; model parameters of the initial generation model are trained to obtain a abstract generation model by maximizing a first probability density, maximizing a first similarity between a first object vector and a first abstract word-segmentation vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract word-segmentation vectors. According to the method, the initial generation model can be trained to obtain the abstract generation model by pulling up the training direction of the incidence relation between the multimodal content formed by the first sample text and the first sample image and the first sample abstract and the incidence relation between the first sample image and the first sample abstract and pulling up the incidence relation between the first sample image and the plurality of second sample abstracts.

Based on the method, the training method not only learns the association relation between the multi-mode content formed by the sample text and the sample image and the sample abstract, but also can effectively learn the association relation between the sample image and the sample abstract through contrast learning without constructing the abstract image, so that the abstract generating model can effectively capture the image more relevant to the abstract, generate the abstract which corresponds to the text and the image and is more relevant, and improve the generating accuracy of the abstract generating model, thereby improving the abstract effect and the abstract quality of the abstract generating model.

In addition, as can be seen from the above another technical solution, first, a third sample text and a third sample image corresponding to the third sample text are input into an encoder in an initial generation model to be encoded, and a third text vector of the third sample text and a third image vector of the third sample image are output. In the method, the association relation between the multi-mode content formed by the third sample text and the third sample image, the third sample abstract and the fourth sample abstract under the same language and different languages is considered, and the third sample text and the third sample image are respectively encoded to obtain a first text vector and a first image vector.

Then, inputting the third text vector and the third image vector into a fusion device in the initial generation model to perform cross-modal fusion, and outputting a third fusion vector; inputting the third fusion vector and a third expression vector of a third sample abstract corresponding to the third sample text and a fourth expression vector of a fourth sample abstract corresponding to the third sample text into a decoder in an initial generation model for decoding, and outputting a third decoding vector and a third probability density corresponding to the third sample abstract and a fourth decoding vector and a fourth probability density corresponding to the fourth sample abstract; wherein the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages. According to the method, a first fusion vector is obtained by fusing a third text vector and a third image vector, a third representation vector of a third sample digest is combined to be decoded into a third probability density corresponding to the third sample digest, a fourth representation vector of a fourth sample digest is combined to be decoded into a fourth probability density corresponding to the fourth sample digest, and on the basis that the association relationship between the third sample digest and the fourth sample digest of different languages corresponding to the third sample text is considered, a third decoding vector corresponding to the third sample digest and a fourth decoding vector corresponding to the fourth sample digest are further obtained.

And finally, training model parameters of the initial generation model to obtain a summary generation model by maximizing the third probability density, maximizing the fourth probability density and maximizing the third similarity between the third decoding vector and the fourth decoding vector. According to the method, the initial generation model can be trained according to the training direction of the association relationship between the multi-modal content formed by the third sample text and the third sample image and the third sample abstract in the same language, the association relationship between the multi-modal content formed by the third sample text and the third sample image and the fourth sample abstract in different languages and the association relationship between the third sample abstract and the fourth sample abstract.

Based on the method, the training method not only learns the association relation between the multi-modal content and the sample abstract formed by the sample text and the sample image in the same language and in different languages, but also learns the association relation between the sample abstract in different languages corresponding to the same sample text through mutual distillation, so that the abstract generation model can effectively capture the sharing information of the abstract in different languages corresponding to the same text, generate a more relevant multi-language abstract corresponding to the text and the image, and improve the generation accuracy of the abstract generation model, thereby improving the abstract effect and the abstract quality of the abstract generation model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a schematic diagram of a system architecture of a training method of a summary generation model according to an embodiment of the present application;

FIG. 2 is a flowchart of a training method of a summary generation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a first image embedding vector of a first sample image according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of obtaining a first probability density by an initial generation model of a first sample text, a first sample image and a first sample abstract according to an embodiment of the present application;

fig. 5 is a schematic diagram of a first object vector and a first abstract word segmentation vector obtained by an initial generation model of a first sample image and a first sample abstract according to an embodiment of the application;

FIG. 6 is a flowchart of another training method of the abstract generating model according to the embodiment of the application;

FIG. 7 is a schematic diagram of a first loss and a second loss provided in an embodiment of the present application;

FIG. 8 is a block diagram of a training device for a summary generation model according to an embodiment of the present application;

FIG. 9 is a block diagram of another training device for a summary generation model according to an embodiment of the present application;

fig. 10 is a block diagram of a server according to an embodiment of the present application;

fig. 11 is a block diagram of a terminal according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

At present, in order to facilitate a user to quickly and conveniently understand main information of multi-mode contents such as texts and images, the summaries corresponding to the texts and the images can be generated through a summary generation model. The training method of the abstract generation model is as follows: and inputting a training sample formed by the sample text, the sample image and the sample abstract into an initial generation model to output probability density corresponding to the sample abstract, and training the initial generation model through maximizing the probability density to obtain the abstract generation model.

However, according to the training method, the initial generation model is trained only by maximizing the probability density, only the association relation between the multi-mode content formed by the sample text and the sample image and the sample abstract is learned, and other effective association relations are not learned, so that the abstract effect and the abstract quality of the abstract generation model are poor.

The embodiment of the application provides a training method of a abstract generation model, which not only learns the association relation between a sample text and multi-mode content formed by a sample image and a sample abstract, but also can effectively learn the association relation between the sample image and the sample abstract through contrast learning without constructing the abstract image, so that the abstract generation model can effectively capture images more relevant to the abstract, generate more relevant abstracts corresponding to the text and the image, and improve the generation accuracy of the abstract generation model, thereby improving the abstract effect and the abstract quality of the abstract generation model.

The embodiment of the invention provides another training method of a abstract generation model, which not only learns the association relation between the sample text and the multi-modal content formed by the sample image and the sample abstract under the same language and different languages, but also learns the association relation between the sample abstract of different languages corresponding to the same sample text effectively through mutual distillation learning, so that the abstract generation model can capture the sharing information of the abstract of different languages corresponding to the same text effectively, generate a more relevant multi-language abstract corresponding to the text and the image, and improve the generation accuracy of the abstract generation model, thereby improving the abstract effect and the abstract quality of the abstract generation model.

Next, a system architecture of a training method of the digest generation model will be described. Referring to fig. 1, fig. 1 is a schematic system architecture diagram of a training method of a summary generation model according to an embodiment of the present application, where the system architecture includes a server 100, and the server 100 is used for training the summary generation model.

The training method comprises the following steps:

the server 100 encodes a first sample text in the first batch of samples, a first sample image corresponding to the first sample text, and a first sample digest by using an encoder in the initial generation model, so as to obtain a first text vector of the first sample text, a first image vector of the first sample image, a first object vector of the first sample image, and a first digest word segmentation vector of the first sample digest. As an example, the first sample text is sample text 1, the first sample image is sample image 1, and the first sample digest is sample digest 1; the server 100 inputs the sample text 1 and the sample image 1 into an encoder in the initial generation model to encode, and outputs a first text vector of the sample text 1 as the text vector 1 and a first image vector of the sample image 1 as the image vector 1; and the first object vector of the sample image 1 is output as the object vector 1, the sample abstract 1 is input into an encoder in an initial generation model for encoding, and the first abstract word vector of the sample abstract 1 is output as the abstract word vector 1.

The server 100 performs cross-modal fusion on the first text vector and the first image vector through a fusion device in the initial generation model to obtain a first fusion vector. As an example, based on the above example, the server 100 inputs the text vector 1 and the image vector 1 into the fusion device in the initial generation model to perform cross-modal fusion, and outputs the first fusion vector as the fusion vector 1.

The server 100 decodes the first fusion vector and the first representation vector of the first sample digest by a decoder in the initial generation model to obtain a first probability density corresponding to the first sample digest. As an example, the first representative vector is representative vector 1; based on the above example, the server 100 inputs the fusion vector 1 and the representation vector 1 of the sample digest 1 into the decoder in the initial generation model to decode, and outputs the first probability density corresponding to the sample digest 1 as the probability density 1.

The server 100 trains model parameters of an initial generation model according to maximizing a first probability density, maximizing a first similarity between a first object vector and a first abstract word segmentation vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract word segmentation vectors, so as to obtain an abstract generation model; the plurality of second abstract word vectors are obtained by encoding a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text in the first batch of samples through an encoder in the initial generation model. As an example, the second sample text is sample text 2 and the second sample abstract is sample abstract 2; on the basis of the above example, the server 100 inputs a plurality of sample abstracts 2 corresponding to a plurality of sample texts 2 different from the sample text 1 in the first batch of samples into the initial generation model to be encoded, and outputs a plurality of second abstract word vectors corresponding to the plurality of sample abstracts 1 as a plurality of abstract word vectors 2; and training model parameters of the initial generation model to obtain a abstract generation model by maximizing the probability density 1, maximizing the first similarity between the object vector 1 and the abstract word vector 1 and minimizing the second similarities between the object vector 1 and the abstract word vectors 2.

That is, the training method not only learns the association relationship between the multi-modal content formed by the sample text and the sample image and the sample abstract, but also can learn the association relationship between the sample image and the sample abstract effectively through contrast learning without constructing the abstract image, so that the abstract generating model can capture the image more relevant to the abstract effectively, generate the abstract which corresponds to the text and the image and is more relevant, and improve the generating accuracy of the abstract generating model, thereby improving the abstract effect and the abstract quality of the abstract generating model.

Another training method is as follows:

the server 100 encodes the third sample text and the third sample image corresponding to the third sample text by using an encoder in the initial generation model, and obtains a third text vector of the third sample text and a third image vector of the third sample image. As an example, the third sample text is sample text 3 and the third sample image is sample image 3; the server 100 inputs the 3-sample text and the sample image 3 into the encoder in the initial generation model to encode, outputs a third text vector of the sample text 3 as the text vector 3 and a third image vector of the sample image 3 as the image vector 3.

The server 100 performs cross-modal fusion on the third text vector and the third image vector through the fusion device in the initial generation model, and obtains a third fusion vector. As an example, based on the above example, the server 100 inputs the text vector 3 and the image vector 3 into the fusion device in the initial generation model to perform cross-modal fusion, and outputs a third fusion vector as the fusion vector 3.

The server 100 decodes the third fused vector and the third expression vector of the third sample digest corresponding to the third sample text and the fourth expression vector of the fourth sample digest corresponding to the third sample text through a decoder in the initial generation model to obtain a third decoded vector and a third probability density corresponding to the third sample digest and a fourth decoded vector and a fourth probability density corresponding to the fourth sample digest; the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages. As an example, the third sample digest is sample digest 3, the third representative vector is representative vector 3, the fourth sample digest is sample digest 4, and the fourth representative vector is representative vector 4; based on the above example, the server 100 inputs the fusion vector 3 and the representation vector 3 of the sample digest 3, and the representation vector 4 of the sample digest 4 into the decoder in the initial generation model to be decoded, outputs the third decoded vector corresponding to the sample digest 3 as the decoded vector 3, the third probability density corresponding to the sample digest 3 as the probability density 3, the fourth decoded vector corresponding to the sample digest 4 as the decoded vector 4, and the fourth probability density corresponding to the sample digest 4 as the probability density 4.

The server 100 trains model parameters of the initial generation model according to the maximized third probability density, the maximized fourth probability density, and the maximized third similarity between the third decoding vector and the fourth decoding vector, and obtains a summary generation model. As an example, based on the above example, the server 100 trains model parameters of the initial generation model to obtain the digest generation model by maximizing the probability density 3, maximizing the probability density 4, and maximizing the third similarity between the decoding vector 3 and the decoding vector 4.

That is, the training method not only learns the association relationship between the multi-modal content and the sample abstract formed by the sample text and the sample image in the same language and in different languages, but also learns the association relationship between the sample abstract in different languages corresponding to the same sample text through mutual distillation, so that the abstract generation model can effectively capture the sharing information of the abstract in different languages corresponding to the same text, generate a more relevant multi-language abstract corresponding to the text and the image, and improve the generation accuracy of the abstract generation model, thereby improving the abstract effect and the abstract quality of the abstract generation model.

It should be noted that, in the embodiment of the present application, training the initial generation model to obtain the abstract generation model involves artificial intelligence. Artificial intelligence is a theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. In the embodiments of the present application, natural language processing techniques, computer vision techniques, and machine learning/deep learning are mainly involved.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify, follow and measure targets, and further perform graphic processing, so that the computer is processed into images more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and the swin-transformer, viT, V-MOE, MAE and other vision field pre-training models can be quickly and widely applied to downstream specific tasks through fine adjustment. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, three-dimensional space techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In the embodiment of the present application, the computer device may be a server or a terminal, and the method provided in the embodiment of the present application may be executed by the terminal or the server alone or in combination with the terminal and the server. The embodiment corresponding to fig. 1 is mainly described by taking a method provided by the embodiment of the application executed by a server as an example.

In addition, when the method provided in the embodiment of the present application is separately executed by the terminal, the execution method is similar to the embodiment corresponding to fig. 1, and mainly the server is replaced by the terminal. In addition, when the method provided in the embodiments of the present application is performed by the terminal and the server in cooperation, the steps that need to be embodied on the front-end interface may be performed by the terminal, and some steps that need to be calculated in the background and do not need to be embodied on the front-end interface may be performed by the server.

The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, a vehicle-mounted terminal, an aircraft, or the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service, but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein. For example, the terminal and the server may be connected by a network, which may be a wired or wireless network.

In addition, embodiments of the present application may be applied to a variety of scenarios including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, audio-visual, assisted driving, and the like.

Next, a training method of the abstract generation model provided in the embodiment of the present application will be described in detail by taking a method provided in the embodiment of the present application as an example by executing the method provided in the embodiment of the present application by a server, and referring to the accompanying drawings. Referring to fig. 2, fig. 2 is a flowchart of a training method of a abstract generation model according to an embodiment of the application, where the method includes:

s201: and encoding the first sample text in the first batch of samples, the first sample image corresponding to the first sample text and the first sample abstract through an encoder in the initial generation model to obtain a first text vector of the first sample text, a first image vector of the first sample image, a first object vector of the first sample image and a first abstract word segmentation vector of the first sample abstract.

Because in the related art, the training method of the abstract generation model refers to: and inputting a training sample formed by the sample text, the sample image and the sample abstract into an initial generation model to output probability density corresponding to the sample abstract, and training the initial generation model through maximizing the probability density to obtain the abstract generation model. However, according to the training method, the initial generation model is trained only by maximizing the probability density, only the association relation between the multi-mode content formed by the sample text and the sample image and the sample abstract is learned, and other effective association relations are not learned, so that the abstract effect and the abstract quality of the abstract generation model are poor.

Therefore, in the embodiment of the present application, in order to solve the above-mentioned problem, the association relationship between the sample image and the sample abstract is further learned in consideration of the association relationship between the sample text and the multimodal content formed by the sample image and the sample abstract; the abstract generation model obtained through training can capture images which are more relevant to the abstract, and generate more relevant abstract corresponding to texts and images, so that the generation accuracy of the abstract generation model is improved, and the abstract effect and the abstract quality of the abstract generation model are improved.

On the basis that the initial generation model comprises an encoder, a fusion device and a decoder, in order to learn the association relation between multi-mode content formed by sample texts and sample images and sample summaries, for each sample text in a first batch of samples and sample images and sample summaries corresponding to each sample text, namely, a first sample text, a first sample image corresponding to the first sample text and the first sample summary are input into the encoder in the initial generation model to be encoded, and a first text vector of the first sample text and a first image vector of the first sample image are output.

In order to further learn the association relationship between the sample image and the sample abstract, when the first sample image is input into the encoder in the initial generation model to be encoded, a first object vector of a first sample object in the first sample image needs to be output; in addition, the first sample abstract is input into an encoder in the initial generation model to be encoded, and a first abstract word segmentation vector of a first abstract word in the first sample abstract is output.

The step S201 is to encode the first sample text and the first sample image respectively to obtain a first text vector and a first image vector, and provide single-mode representation data for the subsequent learning of the association relationship between the multi-mode content formed by the first sample text and the first sample image and the first sample abstract; considering the association relation between the first sample image and the first sample abstract, further obtaining a first object vector of a first sample object in the first sample image, and further encoding the first sample abstract to obtain a first abstract word segmentation vector of a first abstract word in the first sample abstract; and providing single-mode representation data for effectively learning the association relationship between the first sample image and the first sample abstract through contrast learning without constructing abstract images in the follow-up process.

The encoder in the initial generation model comprises a text encoder and an image encoder; the text encoder is provided with a text embedding layer in front, the text encoder comprises an L-layer attention layer and a feedforward neural network layer, and L is a positive integer; the obtaining of the first text vector of the first sample text comprises: embedding the first text sample through a text embedding layer corresponding to a text encoder in the initial generation model to obtain a first text embedding vector of the first sample text; performing attention calculation on the first text embedded vector through an L-layer attention layer of a text encoder in the initial generation model to obtain a first text calculation vector; and carrying out vector transformation on the first text calculation vector through a feedforward neural network layer of the text encoder in the initial generation model to obtain a first text vector.

The image encoder is provided with an image embedding layer in front, and comprises an H-layer attention layer and a feedforward neural network layer, wherein H is a positive integer; the step of obtaining the first image vector of the first sample image comprises: embedding the first sample image through an image embedding layer corresponding to an image encoder in the initial generation model to obtain a first image embedding vector of the first sample image; performing attention calculation on the first image embedded vector through an H-layer attention layer of an image encoder in the initial generation model to obtain a first image calculation vector; and carrying out vector transformation on the first image calculation vector through a feedforward neural network layer of the image encoder in the initial generation model to obtain a first image vector.

Specifically, when the first sample corresponds to the plurality of first sample images, in order for the first image embedding vectors of the plurality of first sample images to represent the order information between the plurality of first sample images and the order information between the first sample objects in each of the first sample images, the first image embedding vectors need to include the object content embedding vector of the first sample object in the first sample image, the detection frame position embedding vector of the first sample object in the first sample image, the image identification embedding vector of the first sample image to which the first sample object belongs, and the object identification embedding vector of the first sample object.

Furthermore, the step of obtaining the first object vector and the first abstract word vector is described in detail with reference to the following description:

as an example of S201 described above, the obtaining formulas of the first text vector and the first image vector are as follows:

wherein X represents a text content embedding vector representing a sample text, E _pe Text label embedding vectors representing the aforementioned one sample to indicate the location of the language and text segmentation of the aforementioned one sample,a text embedding vector representing the aforementioned one sample; / >L-1 layer text calculation vector representing L-1 layer attention layer output, MAH (·) represents multi-head attention function in L layer attention layer, +.>An L-th text calculation vector representing an L-th attention layer output; FFN (& gt) is a function of the feedforward neural network in the feedforward neural network layer, & gt>A text vector representing the aforementioned one sample.

o _ij Representing an object content embedded vector of a jth sample object in an ith sample image corresponding to one sample, wherein i is a positive integer, i is less than or equal to m, m is a positive integer, m is more than or equal to 2, j is a positive integer, j is less than or equal to n, n is a positive integer, n is more than or equal to 2,a detection frame position embedding vector representing a jth sample object in the ith sample image,/for the sample image>An image identifier embedding vector representing an i-th sample image to which the j-th sample object belongs,>an object identification embedding vector, v, representing the jth sample object _ij An object embedding vector representing a jth sample object in the ith sample image, V representing an image embedding vector of the sample image corresponding to the one sample text; />L-1 layer image calculation vector representing L-1 layer attention layer output, MAH (·) represents multi-head attention function in L layer attention layer, +.>An L-th layer image calculation vector representing an L-th layer attention layer output; FFN (& gt) is a function of the feedforward neural network in the feedforward neural network layer, & gt >An image vector representing a sample image corresponding to the aforementioned one sample.

Referring to fig. 3, fig. 3 is a schematic diagram of a first image embedding vector of a first sample image according to an embodiment of the present application. Wherein Object Embeddings represents an object content embedding vector of a first sample object in the first sample image, roI box Embeddings represents a detection frame position embedding vector of the first sample object in the first sample image, image ID Embeddings represents an image identification embedding vector of the first sample image, object ID Embeddings represents an object identification embedding vector of the first sample object, and Object Embeddings + RoI box Embeddings + Image ID Embeddings + Object ID Embeddings represents the first image embedding vector.

S202: and performing cross-modal fusion on the first text vector and the first image vector through a fusion device in the initial generation model to obtain a first fusion vector.

S203: and decoding the first fusion vector and the first representation vector of the first sample abstract through a decoder in the initial generation model to obtain a first probability density corresponding to the first sample abstract.

In the embodiment of the present application, in order to learn the association relationship between the multi-modal content formed by the sample text and the sample image and the sample abstract, after executing S201 to obtain the first text vector of the first sample text and the first image vector of the first sample image, the first text vector and the first image vector need to be input into the fusion device in the initial generation model to perform cross-modal fusion, and the first fusion vector is output; and inputting the first fusion vector and a first representation vector of the first sample abstract into a decoder in an initial generation model for decoding, and outputting a first probability density corresponding to the first sample abstract.

Based on the first text vector and the first image vector, the S202-S203 take into account the association relationship between the multimodal content formed by the first sample text and the first sample image and the first sample abstract, fuse the first text vector and the first image vector to obtain a first fused vector, and decode the first fused vector into a first probability density corresponding to the first sample abstract in combination with a first representation vector of the first sample abstract; and providing multi-modal representation data and multi-modal generation data for the association relationship between the multi-modal content formed by the first sample text and the first sample image and the first sample abstract to be learned later.

The fusion device comprises an attention layer, an activation layer and a fusion layer; s202 may include, for example: performing cross-modal calculation on the first text vector and the first image vector through an attention layer of a fusion device in the initial generation model to obtain a first cross-modal vector; performing activation calculation on the first text vector and the first cross-modal vector through an activation layer of a fusion device in the initial generation model to obtain a first activation vector; and fusing the first text vector, the first cross-modal vector and the first activation vector through a fusion layer of a fusion device in the initial generation model to obtain a first fusion vector.

The decoder comprises an L-layer attention layer, a cross attention layer and a feedforward neural network layer, and a normalization layer is arranged behind the decoder; s203 may include, for example: performing attention calculation on the first representation vector through an L-layer attention layer of a decoder in the initial generation model to obtain a first representation calculation vector of a first sample abstract; performing cross calculation on the first fusion vector and the first representation calculation vector through a cross attention layer of a decoder in the initial generation model to obtain a first cross vector; vector transformation is carried out on the first cross vector through a feedforward neural network layer of a decoder in the initial generation model, so as to obtain a first decoding vector; and normalizing the first decoding vector by a normalization layer corresponding to the decoder in the initial generation model to obtain a first probability density.

As an example of S202 above, the formula for obtaining the first fusion vector based on S201 above is as follows:

M＝CMHA(Q,K,V)

wherein,text vector representing one sample of text, +.>An image vector W representing a sample image corresponding to the one sample _q 、W _k 、W _k 、W _g 、W _z 、b _g And b _z Model parameters representing an initially generated model, CMHA (·) representing a cross-modal annotation of the attention layer An intentional force function, M represents a cross-modal vector of the text vector and the image vector, concat (-) represents a fusion function, sigmoid (-) represents an activation function, G represents an activation vector, and +.>Representing cross product operation, Z _T+V And a fusion vector representing the text vector and the image vector.

As an example of the above S203, the first probability density obtaining formula based on the above S201 example is as follows:

/>

wherein,layer L-1 representing the output of the layer L-1 attention layer of the decoder represents the calculated vector, MAH (·) represents the multi-head attention function in the layer L attention layer,/and->Layer L representing the output of layer L attention layer of decoder represents the calculated vector, Z _T+V Representing the above fusion vector, MHCA (-) represents the cross-attention function in the cross-attention layer,representing the cross vector, FFN (,) is a function of the feedforward neural network in the feedforward neural network layer,/->Representing a decoding vector, W, corresponding to a sample digest corresponding to the sample of the sample _o And b _o Model parameters representing the initial generation model, softmax (·) representing the normalization function of the normalization layer, p (y) _t |X,V,y _<t ) Representing probability density of sample abstract corresponding to the sample, wherein the sample abstract comprises t first abstract words, t is positive integer, y _<t Word segmentation embedding vector, y, representing 1 st to t-1 st abstract word segments in the sample abstract _t And the word segmentation embedding vector represents the t-th abstract word segmentation in the sample abstract.

For the description of S201-S203, refer to fig. 4, fig. 4 is a schematic diagram of obtaining a first probability density by initially generating a model from a first sample text, a first sample image and a first sample abstract according to an embodiment of the present application; the initial generation model comprises a text encoder, an image encoder, a fusion device, a decoder and a normalization layer; firstly, a first text sample passes through a text encoder to obtain a first text vector, and a first sample image passes through an image encoder to obtain a first image vector; and then, the first text vector and the first image vector pass through a fusion device to obtain a first fusion vector, and finally, the first fusion vector and a first representation vector of the first sample abstract pass through a decoder and a normalization layer to obtain a first probability density corresponding to the first sample abstract.

Referring to fig. 5, fig. 5 is a schematic diagram of a first object vector and a first abstract word segmentation vector obtained by an initial generation model of a first sample image and a first sample abstract according to an embodiment of the application; on the basis of fig. 4, the first sample image passes through a text encoder to obtain a first object vector, and the first sample abstract passes through an image encoder to obtain a first abstract word segmentation vector.

S204: according to the maximized first probability density, the first similarity between the first object vector and the first abstract word segmentation vector is maximized, the second similarities between the first object vector and the second abstract word segmentation vectors are minimized, and model parameters of an initial generation model are trained to obtain an abstract generation model; the plurality of second abstract word vectors are obtained by encoding a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text in the first batch of samples through an encoder in the initial generation model.

In the embodiment of the application, in order to learn the association relationship between the sample text and the multi-mode content formed by the sample image and the sample abstract, the association relationship between the sample image and the sample abstract is effectively learned through contrast learning; after the step S202-S203 is executed to obtain the first probability density corresponding to the first sample abstract, a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text in the first batch of samples are also required to be input into an encoder in the initial generation model to be encoded, and a plurality of second abstract word segmentation vectors corresponding to the plurality of second sample abstracts are output; the training direction of the initial generation model is: and pulling up the association relationship between the multimodal content formed by the first sample text and the first sample image and the first sample abstract, the association relationship between the first sample image and the first sample abstract, and pulling up the association relationship between the first sample image and a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text. Based on this, model parameters of the initial generation model are trained to obtain a abstract generation model by maximizing a first probability density, maximizing a first similarity between the first object vector and the first abstract word vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract word vectors.

The step S204 is capable of training the initial generation model to obtain a summary generation model according to the training direction of the association relationship between the multi-mode content formed by the first sample text and the first sample image and the first sample summary and the association relationship between the first sample image and the first sample summary, and the association relationship between the first sample image and the plurality of second sample summaries; the abstract generation model can effectively capture the first sample image which is more relevant to the first sample abstract, and generate a more relevant abstract corresponding to the first sample text and the first sample image, so that the generation accuracy of the abstract generation model is improved, and the abstract effect and the abstract quality of the abstract generation model are improved.

As an example of the above S204, by maximizing p (y _t |X,V,y _<t ) And training model parameters of the initial generation model to obtain a abstract generation model, wherein the first similarity between the first object vector and the first abstract word segmentation vector is maximized, the second similarities between the first object vector and the second abstract word segmentation vectors are minimized.

In the above-described embodiment, in the specific implementation of S201, it is considered that the first object vector of the first sample image actually refers to the average code vector of the object code vectors of the first sample objects in the first sample image; it is necessary to encode the first sample object in the first sample image before averaging to obtain the first object vector. Based on the above, first, detecting first sample objects in first sample images to obtain a first number of first sample objects, wherein the first number refers to the product of the maximum image number of the sample images corresponding to each sample text and the maximum object number of the sample objects in each sample image; that is, when the number of objects in the first sample image is the maximum number of objects, the first number is equal to the number of objects in the first sample image, and when the number of objects in the first sample image is smaller than the maximum number of objects, the first number is larger than the number of objects in the first sample image. Then, the first number of first sample objects are input into an encoder in an initial generation model for encoding, and a first number of object encoding vectors are obtained. And finally, representing whether the first sample object is empty or not through the object coefficient corresponding to the first sample object, and combining the first number of object coefficients corresponding to the first sample object on the basis of the first number and the first number of object coding vectors, and calculating the average value to obtain the first object vector. Thus, the present application provides a possible implementation, the step of obtaining the first object vector comprises the following S1-S3 (not shown in the figures):

S1: detecting first sample objects in the first sample image to obtain a first number of first sample objects; the first number is greater than or equal to the number of objects in the first sample image.

S2: the first number of object-coded vectors are obtained by initially generating an encoder in the model, encoding a first number of first sample objects.

S3: average value calculation is carried out on a first number of object coefficients corresponding to a first number of first sample objects and a first number of object coding vectors to obtain a first object vector; the object coefficients are used to indicate whether the first sample object is empty.

The S1-S3 more uniformly represents the first object vector of the first sample image through the average value coding vector of the object coding vector of the first sample object in the first sample image; and providing object representation data of the first sample object in the first sample image for effectively learning the association relation between the first sample image and the first sample abstract through contrast learning without constructing an abstract image.

As an example of the above S1-S3, on the basis of the above S201 example, the formula for obtaining the first object vector is as follows:

wherein, An object code vector representing a jth sample object in an ith sample image corresponding to a sample of the text, and>representing the object coefficient corresponding to the jth sample object in the ith sample image, m representing the maximum number of images of the sample image corresponding to each sample text, n representing the maximum number of objects of the sample object in each sample image, m×n representing the first number, h ^vis An object vector representing a sample image corresponding to the one sample.

In the above embodiment, in the specific implementation of S201, the first abstract-word-segmentation vector considering the first sample abstract actually refers to the mean-value code vector of the abstract-word-segmentation code vector of the first abstract word in the first sample abstract; then the first abstract word segment in the first sample abstract is encoded and then averaged to obtain the first abstract word segment vector. Based on the above, first, detecting first abstract words in a first sample abstract to obtain a second number of first abstract words, wherein the second number refers to the maximum abstract word number of the abstract words in each sample abstract, namely, when the abstract word number in the first sample abstract is the maximum abstract word number, the second number is equal to the abstract word number in the first sample abstract, and when the abstract word number in the first sample abstract is smaller than the maximum abstract word number, the second number is larger than the abstract word number in the first sample abstract. And then, inputting the second number of first abstract-breaking words into an encoder in the initial generation model for encoding to obtain a second number of abstract-breaking word encoding vectors. And finally, representing whether the first abstract word is empty or not through the abstract word segmentation coefficient corresponding to the first abstract word, and calculating the average value by combining the second number of word segmentation coefficients corresponding to the second number of first abstract word on the basis of the second number and the second number of abstract word coding vectors to obtain the first abstract word vector. Thus, the present application provides a possible implementation manner, the step of obtaining the first abstract vector includes the following S4-S6 (not shown in the figure):

S4: word segmentation is carried out on the first sample abstract, and a second number of first abstract word segmentation is obtained; the second number is greater than or equal to the number of digest segmentations in the first sample digest.

S5: and encoding the second number of first abstract-word segments by using an encoder in the initial generation model to obtain a second number of abstract-word coding vectors.

S6: average value calculation is carried out on a second number of word segmentation coefficients corresponding to a second number of first abstract word segments and a second number of abstract word segment coding vectors to obtain first abstract word segment vectors; the word segmentation coefficient is used for indicating whether the first abstract word segmentation is empty.

The S4-S6 more uniformly represents the first abstract word segmentation vector of the first sample abstract through the average value code vector of the abstract word segmentation code vectors of the first abstract word in the first sample abstract; and providing abstract word representation data of the first abstract word in the first sample abstract for effectively learning the association relation between the first sample image and the first sample abstract through contrast learning without constructing an abstract image.

As an example of the above S4-S6, on the basis of the above S201 example, the formula for obtaining the first abstract-breaking vector is as follows:

Wherein,abstract word encoding vector representing kth abstract word in one sample abstract, k being positive integerNumber k=1, 2, …, N->The word segmentation coefficient of the kth abstract word segmentation, N represents a second quantity, h ^sum A digest word segmentation vector representing the foregoing one sample digest.

In the above embodiment, in the implementation of S204, in consideration of maximizing the first probability density, maximizing the first similarity, and minimizing the plurality of second similarities, it is necessary to construct a loss function to calculate the loss to train the model parameters of the initial generation model to obtain the summary generation model; the association relationship between the multi-modal content formed by the first sample text and the first sample image and the first sample abstract is represented by the maximized first probability density; maximizing the first similarity and minimizing the plurality of second similarities draws in the association between the first sample image and the first sample digest and draws in the association between the first sample image and the plurality of second sample digests is another training direction; one loss function is constructed for maximizing the first probability density, i.e. generating the loss function, and another loss function is constructed for maximizing the first similarity and minimizing the plurality of second similarities, i.e. contrasting the loss functions.

Based on the calculation, substituting the first probability density into a generation loss function to calculate the generation loss, substituting the first similarity and a plurality of second similarities into a comparison loss function to calculate the comparison loss; and then jointly training model parameters of the initial generation model to obtain the abstract generation model through generation loss and comparison loss. Thus, the present application provides one possible implementation, generating a loss function for maximizing a first probability density, comparing the loss function for maximizing a first similarity and minimizing a plurality of second similarities, S204 comprising the following S2041-S2043 (not shown):

s2041: and carrying out loss calculation according to the first probability density and the generation loss function to obtain the generation loss.

S2042: and carrying out loss calculation according to the first similarity, the plurality of second similarities and the contrast loss function to obtain contrast loss.

S2043: and training model parameters of the initial generation model according to the generation loss and the comparison loss to obtain the abstract generation model.

In the step S2041-S2043, considering different training directions, the first probability density is maximized by generating the loss function, the first similarity is maximized by comparing the loss function, and the plurality of second similarities are minimized, so that the initial generation model is trained more accurately, and a more accurate abstract generation model is obtained.

As an example of the above S2041 to S2043, on the basis of the above S204 example, the generation loss and the contrast loss are as follows:

L＝L _generate +L _contrastive

wherein p (y _t |X,V,y _<t ) Represents a first probability density, log represents logarithm, N represents a second number, L _generate Indicating that a loss has been generated,a first object vector representing a first sample image, a first sample image>First abstract word vector representing first sample abstract,/->A second digest word vector representing a second sample digest, sim (·) representing a similarity function, τ representing a temperature coefficient, e representing a natural constant, B representing the number of second sample digests in the first plurality of samples, L _contrastive Represents contrast loss, L represents generation lossAnd the total loss of contrast loss.

In the specific implementation of S2043, taking into account that the influence degrees of the generation loss and the comparison loss on the model training are different, model parameters of an initial generation model are trained mainly through the generation loss so that the abstract generation model learns the association relationship between the first sample text and the multi-mode content formed by the first sample image and the first sample abstract, and model parameters of the initial generation model are trained through the comparison loss in an auxiliary manner so that the abstract generation model learns the association relationship between the first sample image and the first sample abstract; the adjustment coefficient corresponding to the contrast loss is also required to be obtained so as to adjust the influence degree of the contrast loss on model training. On the basis of the generation loss and the contrast loss, the model parameters of the initial generation model are further accurately trained by combining the adjusting coefficients corresponding to the contrast loss, and the abstract generation model is obtained. Thus, the present application provides one possible implementation, S2043 includes the following S7-S8 (not shown):

S7: and obtaining an adjusting coefficient corresponding to the contrast loss.

S8: and training model parameters of the initial generation model according to the generation loss, the comparison loss and the adjustment coefficient to obtain the abstract generation model.

And S7-S8, taking the fact that the influence degree of the generation loss and the comparison loss on model training is different into consideration, and further accurately training an initial generation model through the generation loss, the comparison loss and the adjustment coefficient to obtain a further accurate abstract generation model.

As an example of the above S7 to S8, the calculation formula of the total loss L of the generated loss and the comparative loss on the basis of the above S2041 to S2043 example is as follows:

L＝L _generate +βL _contrastive

wherein L is _generate Indicating the loss of generation, L _contrastive And the contrast loss is represented, and beta is the adjustment coefficient corresponding to the contrast loss.

In addition, in the above embodiment, S201 to S204 are applicable not only to the first sample text and the first sample digest in the same language but also to the first sample text and the first sample digest in different languages; thus, the present application provides a possible implementation manner, where the first sample text and the first sample abstract belong to the same language or different languages.

Next, a training method of another abstract generating model provided in the embodiment of the present application will be described in detail by taking a method provided in the embodiment of the present application as an example by executing the method provided in the embodiment of the present application by a server, and referring to the accompanying drawings.

Referring to fig. 6, fig. 6 is a flowchart of another training method of a abstract generating model according to an embodiment of the application, where the method includes:

s601: and encoding the third sample text and a third sample image corresponding to the third sample text through an encoder in the initial generation model to obtain a third text vector of the third sample text and a third image vector of the third sample image.

In the embodiment of the application, on the basis that an initial generation model comprises an encoder, a fusion device and a decoder, in order to learn the association relationship between multi-modal content and a sample abstract formed by sample texts and sample images in the same language, aiming at each sample text and sample image and sample abstract corresponding to each sample text in the same language and different languages, namely, a third sample text, a third sample image, a third sample abstract and a fourth sample abstract corresponding to the third sample text, wherein the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages; first, the third sample text and the third sample image are input into an encoder in the initial generation model to be encoded, and a third text vector of the third sample text and a third image vector of the third sample image are output.

The step S601 is to respectively encode the third sample text and the third sample image to obtain a first text vector and a first image vector in consideration of the association relationship between the multi-modal content formed by the third sample text and the third sample image, the third sample abstract and the fourth sample abstract under the same language and different languages; and providing single-mode representation data for the association relationship between the multi-mode content formed by the sample text and the sample image and the sample abstract in the same language and different languages in the subsequent learning.

The encoder in the initial generation model comprises a text encoder and an image encoder; the text encoder is provided with a text embedding layer in front, the text encoder comprises an L-layer attention layer and a feedforward neural network layer, and L is a positive integer; the obtaining step of the third text vector of the third sample text comprises: embedding a third sample text through a text embedding layer corresponding to a text encoder in the initial generation model to obtain a third text embedding vector of the third sample text; performing attention calculation on a third text embedded vector through an L-layer attention layer of a text encoder in the initial generation model to obtain a third text calculation vector; and carrying out vector transformation on the third text calculation vector through a feedforward neural network layer of the text encoder in the initial generation model to obtain a third text vector.

The image encoder is provided with an image embedding layer in front, and comprises an H-layer attention layer and a feedforward neural network layer, wherein H is a positive integer; the step of obtaining a third image vector of the third sample image comprises: embedding a third sample image through an image embedding layer corresponding to an image encoder in the initial generation model to obtain a third image embedding vector of the third sample image; performing attention calculation on a third image embedded vector through an H-layer attention layer of an image encoder in the initial generation model to obtain a third image calculation vector; and carrying out vector transformation on the third image calculation vector through a feedforward neural network layer of the image encoder in the initial generation model to obtain a third image vector.

Specifically, when the third sample text corresponds to the plurality of third sample images, in order for the third image embedding vectors of the plurality of third sample images to be capable of representing sequence information between the plurality of third sample images and sequence information between the third sample objects in each of the third sample images, the third image embedding vectors need to include an object content embedding vector of the third sample object in the third sample image, a detection frame position embedding vector of the third sample object in the third sample image, an image identification embedding vector of the third sample image to which the third sample object belongs, and an object identification embedding vector of the third sample object.

As an example of S601 above, the formula for obtaining the first text vector and the first image vector can be found in the aboveAnd->Is a result of the above process.

S602: and performing cross-modal fusion on the third text vector and the third image vector through a fusion device in the initial generation model to obtain a third fusion vector.

S603: decoding a third fused vector, a third representing vector of a third sample abstract corresponding to a third sample text and a fourth representing vector of a fourth sample abstract corresponding to the third sample text by a decoder in an initial generation model to obtain a third decoding vector and a third probability density corresponding to the third sample abstract and a fourth decoding vector and a fourth probability density corresponding to the fourth sample abstract; the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages.

Therefore, in the embodiment of the present application, in order to solve the above-mentioned problem, on the basis of learning the association relationship between the sample abstracts and the multi-modal content formed by the sample text and the sample image in the same language and in different languages, the association relationship between the sample abstracts in different languages corresponding to the same sample text is further learned; the abstract generation model obtained through training can effectively capture the shared information of the sample abstracts of different languages corresponding to the same sample text, and generate a more relevant multilingual abstract corresponding to the text and the image, so that the generation accuracy of the abstract generation model is improved, and the abstract effect and the abstract quality of the abstract generation model are improved.

Therefore, in the embodiment of the present application, in order to learn the association relationship between the multi-modal content and the sample abstract formed by the sample text and the sample image in the same language and in different languages, after executing S601 to obtain the third text vector of the third sample text and the third image vector of the third sample image, the third text vector and the third image vector need to be input into the fusion device in the initial generation model to perform cross-modal fusion, and the third fusion vector is output; and inputting the third fusion vector and a third expression vector of a third sample abstract corresponding to the third sample text and a fourth expression vector of a fourth sample abstract corresponding to the third sample text into a decoder in an initial generation model for decoding, and outputting a third probability density corresponding to the third sample abstract and a fourth probability density corresponding to the fourth sample abstract.

In order to further learn the association relationship between the sample summaries of different languages corresponding to the same sample text, when the third fusion vector, the third expression vector of the third sample summary corresponding to the third sample text, and the fourth expression vector of the fourth sample summary corresponding to the third sample text are input to the decoder in the initial generation model to be decoded, the third decoding vector corresponding to the third sample summary and the fourth decoding vector corresponding to the fourth sample summary need to be output.

The step S602-S603 is to fuse the third text vector and the third image vector to obtain a first fused vector, decode the first fused vector into a third probability density corresponding to the third sample digest by combining a third representation vector of the third sample digest, and decode the first fused vector into a fourth probability density corresponding to the fourth sample digest by combining a fourth representation vector of the fourth sample digest; providing multi-mode representation data and multi-mode generation data for the subsequent study of the association relationship between multi-mode contents formed by a third sample text and a third sample image in the same language and different languages and a third sample abstract and a fourth sample abstract, and further obtaining a third decoding vector corresponding to the third sample abstract and a fourth decoding vector corresponding to the fourth sample abstract by considering the association relationship between the third sample abstract and the fourth sample abstract in different languages corresponding to the third sample text; and providing multi-modal representation data for the association relationship between the third sample abstract and the fourth sample abstract of different languages corresponding to the third sample text to be learned later.

The fusion device comprises an attention layer, an activation layer and a fusion layer; s602 may include, for example: performing cross-modal calculation on a third text vector and a third image vector through an attention layer of a fusion device in the initial generation model to obtain a third cross-modal vector; performing activation calculation on the third text vector and the third cross-modal vector through an activation layer of the fusion device in the initial generation model to obtain a third activation vector; and fusing the third text vector, the third cross-modal vector and the third activation vector through a fusion layer of a fusion device in the initial generation model to obtain a third fusion vector.

The decoder comprises an L-layer attention layer, a cross attention layer and a feedforward neural network layer, and a normalization layer is arranged behind the decoder; s603 may include, for example: performing attention calculation on the third representation vector through an L-layer attention layer of a decoder in the initial generation model to obtain a third representation calculation vector of a third sample abstract; performing cross calculation on the third fusion vector and the third representation calculation vector through a cross attention layer of a decoder in the initial generation model to obtain a third cross vector; vector transformation is carried out on the third cross vector through a feedforward neural network layer of a decoder in the initial generation model, and a third decoding vector is obtained; and normalizing the third decoding vector by a normalization layer corresponding to the decoder in the initial generation model to obtain a third probability density. Performing attention calculation on the fourth representation vector through an L-layer attention layer of a decoder in the initial generation model to obtain a fourth representation calculation vector of a fourth sample abstract; performing cross calculation on the fourth fusion vector and the fourth representation calculation vector through a cross attention layer of a decoder in the initial generation model to obtain a fourth cross vector; vector transformation is carried out on the fourth cross vector through a feedforward neural network layer of a decoder in the initial generation model, and a fourth decoding vector is obtained; and normalizing the fourth decoding vector by a normalization layer corresponding to the decoder in the initial generation model to obtain a fourth probability density.

As an example of the above S603, the obtaining formulas of the third probability density and the fourth probability density based on the above S602 example are described in the above p (y _t |X,V,y _<t ) The obtained formulas of the third decoding vector and the fourth decoding vector are described in the aboveIs a result of the above process.

S604: and training model parameters of the initial generation model according to the maximized third probability density, the maximized fourth probability density and the maximized third similarity between the third decoding vector and the fourth decoding vector to obtain the abstract generation model.

In the embodiment of the application, in order to learn the association relationship between the sample abstracts and the multi-modal content formed by the sample texts and the sample images in the same language and different languages, the association relationship between the sample abstracts in different languages corresponding to the same sample text is effectively learned through mutual distillation learning; after executing S202-S203 to obtain a third decoding vector and a third probability density corresponding to a third sample digest, and a fourth decoding vector and a fourth probability density corresponding to a fourth sample digest, the training direction of the initial generation model is: and (3) pulling up the association relation between the multi-modal content formed by the third sample text and the third sample image and the third sample abstract under the same language, the association relation between the multi-modal content formed by the third sample text and the third sample image and the fourth sample abstract under different languages, and the association relation between the third sample abstract and the fourth sample abstract. Based on this, model parameters of the initial generation model are trained to obtain a summary generation model by maximizing a third probability density, maximizing a fourth probability density, and maximizing a third similarity between the third decoding vector and the fourth decoding vector.

The step S604 is capable of training the initial generation model to obtain a summary generation model according to the training direction of the association relationship between the multimodal content formed by the third sample text and the third sample image and the third sample summary in the same language, the association relationship between the multimodal content formed by the third sample text and the third sample image and the fourth sample summary in different languages, and the association relationship between the third sample summary and the fourth sample summary; the abstract generation model can effectively capture the sharing information of the third sample abstract and the fourth sample abstract of different languages corresponding to the same third sample text, and generate a more relevant multi-language abstract corresponding to the third sample text and the third sample image, so that the generation accuracy of the abstract generation model is improved, and the abstract effect and the abstract quality of the abstract generation model are improved.

According to the technical scheme, first, a third sample text and a third sample image corresponding to the third sample text are input into an encoder in an initial generation model to be encoded, and a third text vector of the third sample text and a third image vector of the third sample image are output. In the method, the association relation between the multi-mode content formed by the third sample text and the third sample image, the third sample abstract and the fourth sample abstract under the same language and different languages is considered, and the third sample text and the third sample image are respectively encoded to obtain a first text vector and a first image vector.

Based on the method, the training method not only learns the association relation between the multi-modal content and the sample abstract formed by the sample text and the sample image in the same language and in different languages, but also learns the association relation between the sample abstract in different languages corresponding to the same sample text through mutual distillation, so that the abstract generation model can effectively capture the sharing information of the abstract in different languages corresponding to the same text, generate a more relevant multi-language abstract corresponding to the text and the image, and improve the generation accuracy of the abstract generation model, thereby improving the abstract effect and the abstract quality of the abstract generation model. In addition, the abstract generating model can simultaneously generate abstracts of different languages corresponding to the same text, so that the generating practicability of the abstract generating model is improved.

In the above embodiment, in the specific implementation of S604, in consideration of maximizing the third probability density, maximizing the fourth probability density, and maximizing the third similarity, it is necessary to construct a loss function to calculate the loss so as to train the model parameters of the initial generation model to obtain the abstract generation model; maximizing the third probability density and maximizing the third similarity means that under the condition of pulling the association relationship between the multi-modal content formed by the third sample text and the third sample image and the third sample abstract under the same language, pulling the association relationship between the third sample abstract and the fourth sample abstract; maximizing the fourth probability density and maximizing the third similarity means that under the condition of drawing the association relationship between the multi-modal content formed by the third sample text and the third sample image and the fourth sample abstract under different languages, drawing the association relationship between the third sample abstract and the fourth sample abstract; to achieve mutual distillation learning under different circumstances, one loss function needs to be constructed for maximizing the third probability density and maximizing the third similarity, i.e. the first loss function, and another loss function needs to be constructed for maximizing the fourth probability density and maximizing the third similarity, i.e. the second loss function.

Based on the first probability density and the third similarity are substituted into the first loss function to calculate the first loss, and the fourth probability density and the third similarity are substituted into the second loss function to calculate the second loss; and training model parameters of the initial generation model together through the first loss and the second loss to obtain the abstract generation model. Thus, the present application provides a possible implementation, where the first loss function is used to maximize the third probability density and maximize the third similarity, and the second loss function is used to maximize the fourth probability density and maximize the third similarity, and S604 may include, for example, the following S6041-S6043 (not shown in the figure):

s6041: and carrying out loss calculation according to the third probability density, the third similarity and the first loss function to obtain the first loss.

S6042: and carrying out loss calculation according to the fourth probability density, the third similarity and the second loss function to obtain a second loss.

S6043: and training model parameters of the initial generation model according to the first loss and the second loss to obtain the abstract generation model.

The S6041-S6043 take mutual distillation learning under the two conditions of the same language and different languages into consideration, maximize the first probability density and the third similarity through the first loss function, maximize the first similarity and the third similarity through the second loss function, train the initial generation model more accurately, and obtain a more accurate abstract generation model.

Referring to fig. 7, fig. 7 is a schematic diagram of a first loss and a second loss according to an embodiment of the present application; substituting the third probability density and the third similarity into the first loss function to calculate and obtain first loss; substituting the fourth probability density and the third similarity into the second loss function to calculate the second loss.

In the above embodiment, in the specific implementation of S6041, maximizing the third probability density draws in that the association between the multimodal content formed by the third sample text and the third sample image and the third sample abstract in the same language represents one training direction, and maximizing the third similarity draws in that the association between the third sample abstract and the fourth sample abstract represents another training direction; constructing the first loss function includes one sub-loss function for maximizing the third probability density, i.e., the first sub-loss function; in addition, considering that the degree of influence of the maximized third similarity on model training is different in different training stages, it is also necessary to determine the first coefficient corresponding to the third similarity in the case of maximizing the third probability density. Based on the first sub-loss function, substituting the third probability density into the first sub-loss function to calculate and obtain the first sub-loss; and combining a first coefficient corresponding to the third similarity on the basis of the first sub-loss and the third similarity, and carrying out weighted calculation to obtain the first loss. Thus, the present application provides one possible implementation, the first loss function comprising a first sub-loss function for maximizing the third probability density, the second loss function comprising a second sub-loss function for maximizing the fourth probability density; s6041 includes the following S9-S10, S6042 includes the following S11-S12 (not shown):

S9: and carrying out loss calculation according to the third probability density and the first sub-loss function to obtain the first sub-loss.

S10: weighting calculation is carried out according to the first sub-loss, the third similarity and a first coefficient corresponding to the third similarity, so as to obtain a first loss; the first coefficient is determined based on the number of trained times and the total number of trained times.

The S9-S10 takes different training directions into consideration, maximizes the third probability density through the first sub-loss function, maximizes the influence degree of the third similarity on model training under the condition of maximizing the third probability density, maximizes the third similarity through the first coefficient corresponding to the third similarity determined by the trained times and the total training times, and calculates the first loss more accurately; and providing a data basis for the subsequent more accurate training of the initial generation model.

As an example of the above S9 to S10, the calculation formula of the first loss is as follows:

wherein,represents L _C L corresponding to sample text _D The probability density corresponding to the sample digest, log representing the logarithm, N representing the second number, ++>Representing the same language L _C The first sub-loss to be made next,/>represents L _C L corresponding to sample text _D Decoding vector corresponding to sample abstract, < > >Represents L _C L corresponding to sample text _C Decoding vector corresponding to sample abstract, dist (·) representing similarity function, ++>Represents L _C L corresponding to sample text _C Decoding vector corresponding to sample abstract and L _C L corresponding to sample text _D A third similarity between the decoded vectors corresponding to the sample digests, 1-alpha representing the first coefficient, L ₁ Representing a first loss.

In the above embodiment, in the specific implementation of S6042, maximizing the fourth probability density draws in the correlation between the multimodal content formed by the third sample text and the third sample image in different languages and the fourth sample abstract represents one training direction, and maximizing the correlation between the third sample abstract and the fourth sample abstract drawn in the third similarity represents another training direction; constructing the second loss function comprises another sub-loss function for maximizing the third probability density, i.e. the second sub-loss function; in addition, considering that the degree of influence of the maximized third similarity on model training is different in different training stages, it is also necessary to determine a second coefficient corresponding to the third similarity in the case of maximizing the fourth probability density. Based on the first probability density, substituting the fourth probability density into the second sub-loss function to calculate and obtain second sub-loss; and combining a second coefficient corresponding to the third similarity on the basis of the second sub-loss and the third similarity, and carrying out weighted calculation to obtain a second loss. Thus, the present application provides one possible implementation, the second loss function comprising a second sub-loss function for maximizing the fourth probability density; s6042 includes the following S11-S12 (not shown in the figure):

S11: and carrying out loss calculation according to the fourth probability density and the second sub-loss function to obtain the second sub-loss.

S12: weighting calculation is carried out according to the second sub-loss, the third similarity and a second coefficient corresponding to the third similarity, so as to obtain a second loss; the second coefficient is determined from the first coefficient.

The S11-S12 takes different training directions into consideration, maximizes the fourth probability density through the second sub-loss function, maximizes the influence degree of the third similarity on model training under the condition of maximizing the fourth probability density, maximizes the third similarity through the second coefficient corresponding to the third similarity determined by the first coefficient, calculates the second loss more accurately, and provides data basis for the follow-up more accurate training of the initial generation model.

As an example of the above S11 to S12, the calculation formula of the second loss is as follows:

/>

wherein,represents L _C L corresponding to sample text _D The probability density corresponding to the sample digest, log representing the logarithm, N representing the second number, ++>Representing different languages L _C 、L _D Second sub-loss below,/->Represents L _C L corresponding to sample text _C Decoding vector corresponding to sample abstract, < >>Represents L _C L corresponding to sample text _D Decoding vector corresponding to sample abstract, dist (·) representing similarity function, ++ >Represents L _C L corresponding to sample text _D Decoding vector corresponding to sample abstract and L _C L corresponding to sample text _C A third similarity between the decoded vectors corresponding to the sample digests, alpha representing the second coefficient, L ₂ Representing a second loss.

As an example of the above S6043, on the basis of the above S9-S12 example, the calculation formula of the total loss L of the first loss and the second loss is as follows:

L＝L ₁ +L ₂

wherein L is ₁ Representing the first loss, L ₂ Representing a second loss.

In the above embodiment, in the specific implementation of S9-S12, the third similarity is maximized under the condition of maximizing the third probability density, and in fact, the third decoding vector corresponding to the third sample digest is made to approach the fourth decoding vector corresponding to the fourth sample digest; maximizing the third similarity with maximizing the fourth probability density effectively approximates the fourth decoding vector corresponding to the fourth sample digest to the third decoding vector corresponding to the third sample digest. Considering that the third decoding vector corresponding to the third sample digest in which the third sample text belongs to the same language is more accurate than the fourth decoding vector corresponding to the fourth sample digest in which the third sample text belongs to a different language in the preceding training phase, i.e., the total training times in which the number of trained times is less than the preset multiple, the training phase in which the preset multiple is less than 1; in this case, the first coefficient corresponding to the third similarity when the third probability density is maximized needs to be smaller than the second coefficient corresponding to the third similarity when the fourth probability density is maximized.

In the middle training stage, namely, the training stage of which the training times are equal to the total training times of the preset times, the third decoding vector corresponding to the third sample abstract of which the third sample text belongs to the same language and the fourth decoding vector corresponding to the fourth sample abstract of which the third sample text belongs to different languages have smaller difference in accuracy; in this case, the first coefficient corresponding to the third similarity in the case of maximizing the third probability density may be equal to the second coefficient corresponding to the third similarity in the case of maximizing the fourth probability density.

In the latter training stage, that is, the training stage in which the training times are greater than the total training times of the preset times, the fourth decoding vector corresponding to the fourth sample abstract of the third sample text belonging to the different language is more accurate than the third decoding vector corresponding to the third sample abstract of the third sample text belonging to the same language; in this case, the first coefficient corresponding to the third similarity when the third probability density is maximized needs to be larger than the second coefficient corresponding to the third similarity when the fourth probability density is maximized. Accordingly, the present application provides a possible implementation manner, where the determining step of the first coefficient and the second coefficient includes the following S13, S14 or S15 (not shown in the figure):

S13: if the trained times are smaller than the total training times of the preset times, determining that the first coefficient is smaller than the second coefficient; the preset multiple is smaller than 1;

s14: if the trained times are equal to the total training times of the preset times, determining that the first coefficient is equal to the second coefficient;

s15: if the training times are greater than the total training times of the preset times, determining that the first coefficient is greater than the second coefficient.

The step S13, the step S14 and the step S15 are further to accurately maximize the third similarity by considering the first coefficient and the second coefficient which are different to be determined in different training phases, so as to further accurately calculate the first loss and the second loss, and provide a data basis for further accurately training the initial generation model.

As an example of the above S11, S14, and S15, on the basis of the above examples of S9 to S12, the determination formulas of the first coefficient and the second coefficient are as follows:

α＝max(0.5,1-s/S)

wherein, alpha represents a second coefficient, S represents the trained times, S represents the total trained times, the preset multiple is 0.5, max (DEG) represents a maximum function, and the first coefficient is 1-alpha.

It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further combined to provide further implementation manners.

Based on the training method of the abstract generating model provided in the corresponding embodiment of fig. 2, the embodiment of the present application further provides a training device of the abstract generating model, referring to fig. 8, fig. 8 is a structural diagram of the training device of the abstract generating model provided in the embodiment of the present application, where the training device 800 of the abstract generating model includes: a first encoding unit 801, a first fusion unit 802, a first decoding unit 803, and a first training unit 804;

a first encoding unit 801, configured to encode, by using an encoder in an initial generation model, a first sample text in a first batch of samples, a first sample image corresponding to the first sample text, and a first sample abstract, to obtain a first text vector of the first sample text, a first image vector of the first sample image, a first object vector of the first sample image, and a first abstract word segmentation vector of the first sample abstract;

a first fusion unit 802, configured to perform cross-modal fusion on the first text vector and the first image vector through a fusion device in an initial generation model, so as to obtain a first fusion vector;

a first decoding unit 803, configured to decode, by using a decoder in the initial generation model, the first fusion vector and a first representation vector of the first sample digest, to obtain a first probability density corresponding to the first sample digest;

The first training unit 804 is configured to train model parameters of the initial generation model according to maximizing a first probability density, maximizing a first similarity between the first object vector and the first abstract-breaking vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract-breaking vectors, so as to obtain an abstract generation model; the plurality of second abstract word vectors are obtained by encoding a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample text in the first batch of samples through an encoder in the initial generation model.

In one possible implementation, a loss function is generated for maximizing the first probability density, and a contrast loss function is used for maximizing the first similarity and minimizing the plurality of second similarities; the first training unit 804 is specifically configured to:

performing loss calculation according to the first probability density and the generation loss function to obtain generation loss;

performing loss calculation according to the first similarity, the plurality of second similarities and the contrast loss function to obtain contrast loss;

and training model parameters of the initial generation model according to the generation loss and the comparison loss to obtain the abstract generation model.

In one possible implementation, the first training unit 804 is specifically configured to:

obtaining an adjusting coefficient corresponding to the contrast loss;

and training model parameters of the initial generation model according to the generation loss, the comparison loss and the adjustment coefficient to obtain the abstract generation model.

In one possible implementation, the first encoding unit 801 is specifically configured to:

detecting first sample objects in the first sample image to obtain a first number of first sample objects; the first number is greater than or equal to the number of objects in the first sample image;

encoding a first number of first sample objects by an encoder in an initial generation model to obtain a first number of object encoding vectors;

average value calculation is carried out on a first number of object coefficients corresponding to a first number of first sample objects and a first number of object coding vectors to obtain a first object vector; the object coefficients are used to indicate whether the first sample object is empty.

word segmentation is carried out on the first sample abstract, and a second number of first abstract word segmentation is obtained; the second number is greater than or equal to the number of abstract breaking words in the abstract of the first sample;

Encoding the first abstract word of the second number through an encoder in the initial generation model to obtain an abstract word encoding vector of the second number;

average value calculation is carried out on a second number of word segmentation coefficients corresponding to a second number of first abstract word segments and a second number of abstract word segment coding vectors to obtain first abstract word segment vectors; the word segmentation coefficient is used for indicating whether the first abstract word segmentation is empty.

In one possible implementation, the first sample and the first sample abstract belong to the same language or different languages.

According to the technical scheme, the training device of the abstract generation model comprises: the first coding unit, the first fusion unit, the first decoding unit and the first training unit; the first coding unit inputs a first sample text in a first batch of samples and a first sample image corresponding to the first sample text into an encoder in an initial generation model to be coded, and outputs a first text vector of the first sample text and a first image vector of the first sample image; and outputting a first object vector of the first sample image, inputting a first sample abstract corresponding to the first sample text into an encoder in an initial generation model for encoding, and outputting a first abstract word segmentation vector of the first sample abstract. On the basis of respectively encoding the first sample text and the first sample image to obtain a first text vector and a first image vector, taking the association relationship between the first sample image and the first sample abstract into consideration, further obtaining a first object vector of a first sample object in the first sample image, and further encoding the first sample abstract to obtain a first abstract word segmentation vector of a first abstract word in the first sample abstract.

The first fusion unit inputs the first text vector and the first image vector into a fusion device in an initial generation model to perform cross-modal fusion, and outputs a first fusion vector; the first decoding unit inputs the first fusion vector and the first expression vector of the first sample abstract into a decoder in an initial generation model for decoding, and outputs a first probability density corresponding to the first sample abstract. Based on the first text vector and the first image vector, taking into consideration the association relation between the multimodal content formed by the first sample text and the first sample image and the first sample abstract, fusing the first text vector and the first image vector to obtain a first fused vector, and decoding the first fused vector into a first probability density corresponding to the first sample abstract by combining a first representation vector of the first sample abstract.

The first training unit inputs a plurality of second sample abstracts corresponding to a plurality of second sample texts different from the first sample texts in the first batch of samples into an initial generation model for encoding, and outputs a plurality of second abstract word segmentation vectors corresponding to the plurality of second sample abstracts; model parameters of the initial generation model are trained to obtain a abstract generation model by maximizing a first probability density, maximizing a first similarity between a first object vector and a first abstract word-segmentation vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract word-segmentation vectors. According to the method, the initial generation model can be trained to obtain the abstract generation model by pulling up the training direction of the incidence relation between the multimodal content formed by the first sample text and the first sample image and the first sample abstract and the incidence relation between the first sample image and the first sample abstract and pulling up the incidence relation between the first sample image and the plurality of second sample abstracts.

Based on the method, the training device not only learns the association relation between the multi-mode content formed by the sample text and the sample image and the sample abstract, but also can learn the association relation between the sample image and the sample abstract effectively through contrast learning without constructing the abstract image, so that the abstract generating model can capture the image more relevant to the abstract effectively, generate the abstract which corresponds to the text and the image and is more relevant, and improve the generating accuracy of the abstract generating model, thereby improving the abstract effect and the abstract quality of the abstract generating model.

Based on the training method of the abstract generating model provided in the corresponding embodiment of fig. 6, the embodiment of the present application further provides another training device of the abstract generating model, referring to fig. 9, fig. 9 is a structural diagram of another training device of the abstract generating model provided in the embodiment of the present application, where the training device 900 of the abstract generating model includes: a second encoding unit 901, a second fusion unit 902, a second decoding unit 903, and a second training unit 904;

a second encoding unit 901, configured to encode, by using an encoder in the initial generation model, a third sample text and a third sample image corresponding to the third sample text, to obtain a third text vector of the third sample text and a third image vector of the third sample image;

The second fusion unit 902 is configured to perform cross-modal fusion on the third text vector and the third image vector through a fusion device in the initial generation model, so as to obtain a third fusion vector;

a second decoding unit 903, configured to decode, by using a decoder in the initial generation model, a third expression vector of a third sample digest corresponding to the third fusion vector and the third sample text, and a fourth expression vector of a fourth sample digest corresponding to the third sample text, to obtain a third decoding vector and a third probability density corresponding to the third sample digest, and a fourth decoding vector and a fourth probability density corresponding to the fourth sample digest; the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages;

the second training unit 904 is configured to train model parameters of the initial generation model according to maximizing a third probability density, maximizing a fourth probability density, and maximizing a third similarity between the third decoding vector and the fourth decoding vector, so as to obtain a summary generation model.

In one possible implementation, the first loss function is used to maximize the third probability density and maximize the third similarity, and the second loss function is used to maximize the fourth probability density and maximize the third similarity; the second training unit 904 is specifically configured to:

Performing loss calculation according to the third probability density, the third similarity and the first loss function to obtain first loss;

performing loss calculation according to the fourth probability density, the third similarity and the second loss function to obtain a second loss;

and training model parameters of the initial generation model according to the first loss and the second loss to obtain the abstract generation model.

In one possible implementation, the first loss function comprises a first sub-loss function for maximizing the third probability density, and the second loss function comprises a second sub-loss function for maximizing the fourth probability density; the second training unit 904 is specifically configured to:

performing loss calculation according to the third probability density and the first sub-loss function to obtain first sub-loss;

weighting calculation is carried out according to the first sub-loss, the third similarity and a first coefficient corresponding to the third similarity, so as to obtain a first loss; the first coefficient is determined according to the trained times and the total training times;

performing loss calculation according to the fourth probability density and the second sub-loss function to obtain second sub-loss;

weighting calculation is carried out according to the second sub-loss, the third similarity and a second coefficient corresponding to the third similarity, so as to obtain a second loss; the second coefficient is determined from the first coefficient.

In one possible implementation, the apparatus further includes: a determination unit;

a determining unit configured to:

if the trained times are smaller than the total training times of the preset times, determining that the first coefficient is smaller than the second coefficient; the preset multiple is smaller than 1;

if the trained times are equal to the total training times of the preset times, determining that the first coefficient is equal to the second coefficient;

if the training times are greater than the total training times of the preset times, determining that the first coefficient is greater than the second coefficient.

According to the technical scheme, the training device of the abstract generation model comprises: the second encoding unit, the second fusing unit, the second decoding unit and the second training unit. The second encoding unit inputs the third sample text and the third sample image corresponding to the third sample text into the encoder in the initial generation model for encoding, and outputs a third text vector of the third sample text and a third image vector of the third sample image. And taking the association relations between the multi-modal content formed by the third sample text and the third sample image, the third sample abstract and the fourth sample abstract under the same language and different languages into consideration, and respectively encoding the third sample text and the third sample image to obtain a first text vector and a first image vector.

The second fusion unit inputs a third text vector and a third image vector into a fusion device in the initial generation model to perform cross-modal fusion, and outputs a third fusion vector; the second decoding unit inputs a third fusion vector and a third expression vector of a third sample abstract corresponding to a third sample text and a fourth expression vector of a fourth sample abstract corresponding to the third sample text into the initial generation model to be decoded, and outputs a third decoding vector and a third probability density corresponding to the third sample abstract and a fourth decoding vector and a fourth probability density corresponding to the fourth sample abstract; wherein the third sample text and the third sample abstract belong to the same language, and the third sample text and the fourth sample abstract belong to different languages. And obtaining a first fusion vector by fusing a third text vector and a third image vector, decoding the first fusion vector into a third probability density corresponding to a third sample digest by combining a third representation vector of the third sample digest, and decoding the third probability density corresponding to a fourth sample digest by combining a fourth representation vector of a fourth sample digest.

The second training unit trains model parameters of the initial generation model to obtain a summary generation model by maximizing a third probability density, maximizing a fourth probability density and maximizing a third similarity between a third decoding vector and a fourth decoding vector. And training the initial generation model according to the training direction of the association relation between the multi-mode content formed by the third sample text and the third sample image and the third sample abstract in the same language, the association relation between the multi-mode content formed by the third sample text and the third sample image and the fourth sample abstract in different languages and the association relation between the third sample abstract and the fourth sample abstract.

Based on the method, the training device not only learns the association relation between the multi-modal content and the sample abstract formed by the sample text and the sample image in the same language and in different languages, but also learns the association relation between the sample abstract in different languages corresponding to the same sample text through mutual distillation, so that the abstract generation model can effectively capture the sharing information of the abstract in different languages corresponding to the same text, generate a more relevant multi-language abstract corresponding to the text and the image, and improve the generation accuracy of the abstract generation model, thereby improving the abstract effect and the abstract quality of the abstract generation model.

The embodiments of the present application further provide a computer device, which may be a server, referring to fig. 10, and fig. 10 is a block diagram of a server provided in the embodiments of the present application, where the server 1000 may be relatively different due to configuration or performance, and may include one or more processors, such as a CPU1022, and a memory 1032, and one or more storage media 1030 (such as one or more mass storage devices) storing application programs 1042 or data 1044. Wherein memory 1032 and storage medium 1030 may be transitory or persistent. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, central processor 1022 may be configured to communicate with storage medium 1030 to perform a series of instruction operations in storage medium 1030 on server 1000.

Server device1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

In this embodiment, the methods provided in the various alternative implementations of the above embodiments may be performed by the central processor 1022 in the server 1000.

The computer device provided in the embodiment of the present application may also be a terminal, and referring to fig. 11, fig. 11 is a block diagram of the terminal provided in the embodiment of the present application. Taking a terminal as an example of a smart phone, the smart phone comprises: radio Frequency (RF) circuitry 1110, memory 1120, input unit 1130, display unit 1140, sensor 1150, audio circuit 1160, wireless fidelity (Wireless Fidelity, wiFi) module 1170, processor 1180, power source 11120, and the like. The input unit 1130 may include a touch panel 1131 and other input devices 1132, the display unit 1140 may include a display panel 1141, and the audio circuit 1160 may include a speaker 1161 and a microphone 1162. Those skilled in the art will appreciate that the smartphone structure shown in fig. 11 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The memory 1120 may be used to store software programs and modules, and the processor 1180 executes various functional applications and data processing of the smartphone by executing the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 1180 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, performs various functions of the smart phone and processes data by running or executing software programs and/or modules stored in the memory 1120, and invoking data stored in the memory 1120. In the alternative, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1180.

In this embodiment, the processor 1180 in the smart phone may perform the methods provided in the various alternative implementations of the above embodiments.

According to one aspect of the present application, there is provided a computer readable storage medium for storing a computer program which, when run on a computer device, causes the computer device to perform the methods provided in the various alternative implementations of the embodiments described above.

According to one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The terms "first," "second," and the like in the description of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented, for example, in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or part of the technical solution that contributes to the prior art, in the form of a software product, which is stored in a storage medium, comprising several instructions for causing a computer device to perform all or part of the steps of the methods described in the various embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for training a summary generation model, the method comprising:

2. The method of claim 1, wherein generating a loss function is used to maximize the first probability density, and wherein a contrast loss function is used to maximize the first similarity and minimize the plurality of second similarities; the training model parameters of the initial generation model according to maximizing the first probability density, maximizing a first similarity between the first object vector and the first abstract-breaking vector, and minimizing a plurality of second similarities between the first object vector and a plurality of second abstract-breaking vectors, to obtain the abstract generation model, including:

performing loss calculation according to the first probability density and the generated loss function to obtain a generated loss;

3. The method according to claim 2, wherein the training model parameters of the initial generation model according to the generation loss and the comparison loss to obtain the abstract generation model comprises:

Acquiring an adjustment coefficient corresponding to the contrast loss;

4. A method according to any one of claims 1-3, wherein the obtaining of the first object vector comprises:

encoding the first number of first sample objects by an encoder in an initial generation model to obtain the first number of object encoding vectors;

performing average value calculation on the first number, the first number of object coefficients corresponding to the first number of first sample objects and the first number of object coding vectors to obtain the first object vectors; the object coefficients are used to indicate whether the first sample object is empty.

5. A method according to any one of claims 1-3, wherein the step of obtaining the first abstract-breaking vector comprises:

encoding the second number of first abstract-segmentation words through an encoder in the initial generation model to obtain the second number of abstract-segmentation word encoding vectors;

performing average value calculation on the second number, the word segmentation coefficients of the second number corresponding to the first abstract word segmentation of the second number, and the abstract word coding vectors of the second number to obtain first abstract word vectors; the word segmentation coefficient is used for indicating whether the first abstract word segmentation is empty.

6. A method according to any of claims 1-3, characterized in that the first sample text and the first sample abstract belong to the same language or to different languages.

7. A method for training a summary generation model, the method comprising:

8. The method of claim 7, wherein a first loss function is used to maximize the third probability density and maximize the third similarity, and a second loss function is used to maximize the fourth probability density and maximize the third similarity; the training the model parameters of the initial generation model according to the third probability density maximization, the fourth probability density maximization and the third similarity between the third decoding vector and the fourth decoding vector maximization, so as to obtain the abstract generation model, including:

Performing loss calculation according to the third probability density, the third similarity and the first loss function to obtain a first loss;

and training the model parameters of the initial generation model according to the first loss and the second loss to obtain the abstract generation model.

9. The method of claim 8, wherein the first loss function comprises a first sub-loss function for maximizing the third probability density, and the second loss function comprises a second sub-loss function for maximizing the fourth probability density; and performing loss calculation according to the third probability density, the third similarity and the first loss function to obtain a first loss, including:

weighting calculation is carried out according to the first sub-loss, the third similarity and a first coefficient corresponding to the third similarity, so as to obtain the first loss; the first coefficient is determined according to the trained times and the total training times;

And performing loss calculation according to the fourth probability density, the third similarity and the second loss function to obtain a second loss, including:

weighting calculation is carried out according to the second sub-loss, the third similarity and a second coefficient corresponding to the third similarity, so as to obtain the second loss; the second coefficient is determined from the first coefficient.

10. The method of claim 9, wherein the determining of the first coefficient and the second coefficient comprises:

if the trained times are smaller than the total training times of a preset multiple, determining that the first coefficient is smaller than the second coefficient; the preset multiple is smaller than 1;

and if the trained times are greater than the total training times of the preset times, determining that the first coefficient is greater than the second coefficient.

11. A training device for a summary generation model, the device comprising: the first coding unit, the first fusion unit, the first decoding unit and the first training unit;

12. A training device for a summary generation model, the device comprising: the second coding unit, the second fusion unit, the second decoding unit and the second training unit;

13. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-10 according to instructions in the computer program.

14. A computer readable storage medium for storing a computer program which, when run on a computer device, causes the computer device to perform the method of any one of claims 1-10.

15. A computer program product comprising a computer program, characterized in that the computer program, when run on a computer device, causes the computer device to perform the method of any of claims 1-10.