WO2023159763A1 - 模型的训练方法和装置、文本摘要生成方法和装置、设备 - Google Patents

模型的训练方法和装置、文本摘要生成方法和装置、设备 Download PDF

Info

Publication number
WO2023159763A1
WO2023159763A1 PCT/CN2022/090729 CN2022090729W WO2023159763A1 WO 2023159763 A1 WO2023159763 A1 WO 2023159763A1 CN 2022090729 W CN2022090729 W CN 2022090729W WO 2023159763 A1 WO2023159763 A1 WO 2023159763A1
Authority
WO
WIPO (PCT)
Prior art keywords
original
text
data
vector
matrix
Prior art date
Application number
PCT/CN2022/090729
Other languages
English (en)
French (fr)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023159763A1 publication Critical patent/WO2023159763A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to a model training method and device, a text abstract generation method and device, and equipment.
  • Multimodal summarization aims to condense information in multiple modalities into short, concise and readable text summaries, so that users can quickly and easily understand the main information in images or videos.
  • the embodiment of the present application proposes a model training method, the training method is used to train a text summary generation model, and the method includes:
  • the original training data includes original image data and original text data, and the original image data corresponds to the original text data;
  • the text summary generation model is obtained by performing comparative learning and training on the original summary generation model by using the original summary data, the plurality of first positive example pairs and the plurality of second positive example pairs.
  • the embodiment of the present application proposes a method for generating a text abstract, the method including:
  • the text summary generation model is trained according to a model training method; wherein, the model's The training method includes: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one by one; multiplexing the original image data Modal encoding processing to obtain an original image vector, performing multimodal encoding processing on the original text data to obtain an original text vector; obtaining original abstract data according to the original text data and the original image data; converting the original abstract
  • the data is vectorized to obtain the original abstract vector; the first positive example pair is constructed according to the original abstract vector and the original text vector; the second positive example pair is constructed according to the original text vector and the original image vector; by The original summary data, the plurality of first positive example pairs and the plurality of second positive example pairs perform comparative learning and training on the original summary generation model to obtain the text summary generation model.
  • the embodiment of the present application proposes a model training device, the training device is used to train the text summary generation model, and the training device includes:
  • a first acquiring module configured to acquire at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one by one;
  • An encoding module configured to perform multimodal encoding processing on the original image data to obtain an original image vector, and perform multimodal encoding processing on the original text data to obtain an original text vector;
  • a first processing module configured to obtain original summary data according to the original text data and the original image data
  • the second processing module is used to vectorize the original summary data to obtain the original summary vector
  • a first building module configured to construct a first positive example pair according to the original abstract vector and the original text vector
  • a second building block configured to construct a second positive example pair according to the original text vector and the original image vector
  • a training module configured to perform comparative learning and training on the original abstract generation model by using the original abstract data, the plurality of first positive example pairs, and the plurality of second positive example pairs to obtain the text abstract generation model.
  • the embodiment of the present application proposes a device for generating a text abstract, and the device for generating a text abstract includes:
  • the second acquisition module is used to acquire text data to be generated and image data to be generated;
  • a summary generation module configured to input the text data to be generated and the image data to be generated into a text summary generation model to generate a target text summary; wherein, the text summary generation model is trained according to a model training method;
  • the training method of the model includes: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data is in one-to-one correspondence with the original text data; performing multimodal encoding processing on the original image data to obtain an original image vector, performing multimodal encoding processing on the original text data to obtain an original text vector; obtaining original abstract data according to the original text data and the original image data ; Carry out vectorization processing on the original abstract data to obtain an original abstract vector; construct the first positive example pair according to the original abstract vector and the original text vector; construct the first positive example pair according to the original text vector and the original image vector
  • Two positive example pairs performing comparative learning and training on the original abstract generation model by using the original summary data, multiple first positive example pairs, and multiple second positive example pairs to obtain
  • the embodiment of the present application provides an electronic device, including; at least one memory;
  • the program is stored in the memory, and the processor executes the at least one program to implement a model training method, or a text summary generation method;
  • the training method of the model includes: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data is in one-to-one correspondence with the original text data; performing multimodal encoding processing on the original image data to obtain an original image vector, performing multimodal encoding processing on the original text data to obtain an original text vector; obtaining original abstract data according to the original text data and the original image data ; Carry out vectorization processing on the original abstract data to obtain an original abstract vector; construct the first positive example pair according to the original abstract vector and the original text vector; construct the first positive example pair according to the original text vector and the original image vector Two positive example pairs; performing comparative learning and training on the original abstract generation model by using the original summary data, multiple first positive example pairs, and multiple second positive example pairs to obtain the text summary generation model;
  • the text summary generation method includes: acquiring text data to be generated and image data to be generated; inputting the text data to be generated and the image data to be generated into a text summary generation model to generate a target text summary; wherein, The text summarization generation model is trained according to the training method of the model.
  • the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make the computer Execute a model training method, or a text summary generation method;
  • the training method of the model includes: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data is in one-to-one correspondence with the original text data; performing multimodal encoding processing on the original image data to obtain an original image vector, performing multimodal encoding processing on the original text data to obtain an original text vector; obtaining original abstract data according to the original text data and the original image data ; Carry out vectorization processing on the original abstract data to obtain an original abstract vector; construct the first positive example pair according to the original abstract vector and the original text vector; construct the first positive example pair according to the original text vector and the original image vector Two positive example pairs; performing comparative learning and training on the original abstract generation model by using the original summary data, multiple first positive example pairs, and multiple second positive example pairs to obtain the text summary generation model;
  • the text summary generation method includes: acquiring text data to be generated and image data to be generated; inputting the text data to be generated and the image data to be generated into a text summary generation model to generate a target text summary; wherein, The text summarization generation model is trained according to the training method of the model.
  • This application proposes a model training method and device, a text abstract generation method, device, and device.
  • the obtained text abstract generation model can not only have the ability to generate text abstracts, but also enhance the semantic representation ability of text and image multimodal data.
  • the text summarization generation model can fully consider the connection between the text and the image, and the connection between the text and the target summarization when generating the target summarization, thereby improving the accuracy of the text summarization generated by the text summarization generation model.
  • Fig. 1 is the flowchart of the training method of the model that the embodiment of the present application provides;
  • Fig. 2 is the flowchart of the concrete method of step S200 in Fig. 1;
  • Fig. 3 is the flow chart of the specific method of step S300 in Fig. 1;
  • Fig. 4 is the flowchart of the concrete method of step S230 in Fig. 2;
  • FIG. 5 is a flowchart of a specific method of step S700 in FIG. 1;
  • FIG. 6 is a flow chart of a method for generating a text abstract provided in an embodiment of the present application.
  • Fig. 7 is the module block diagram of the training device of the model that the embodiment of the present application provides;
  • FIG. 8 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.
  • Artificial Intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Natural language processing uses computers to process, understand and use human languages (such as Chinese, English, etc.). NLP belongs to a branch of artificial intelligence and is an interdisciplinary subject between computer science and linguistics. Known as computational linguistics. Natural language processing includes syntax analysis, semantic analysis, text understanding, etc. Natural language processing is often used in technical fields such as machine translation, handwritten and printed character recognition, speech recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining. It involves language processing Related data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research and linguistics research related to language computing, etc.
  • Encoder converts the input sequence into a fixed-length vector.
  • Cross-modal encoding refers to the interactive encoding between input sequences. For example, interactive encoding between language and image, cross-modal encoding between language and image is realized through language encoder and image encoder.
  • Decoding convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, video; the output sequence can be text, image.
  • Mean Pooling refers to averaging all values in the local receptive field.
  • Multimodal Abstraction aims to condense information from multiple modalities (such as images, text, and audio) into short, concise, and readable text summaries.
  • the tasks of multimodal summarization include identifying subjects and generating words based on the understanding of the input.
  • language and visual features are combined, and a Seq2Seq layered attention method is used to generate summaries.
  • Seq2Seq is a very important and popular model in natural language processing technology. It includes an encoder (Encoder) to vectorize input information and integrate semantics, and a decoder (Decoder) to obtain text output by decoding intermediate vectors.
  • Encoder Encoder
  • Decoder decoder
  • the supervised training data can train the parameters in the encoder and decoder, and finally make the target model fit. This method does not consider the problem of semantic alignment between different modal data, and at the same time, direct information fusion will easily lead to the accumulation of redundant information.
  • the embodiment of the present application proposes a model training method and device, a text summary generation method and device, and a device, which can improve the accuracy of the model-generated summary.
  • the embodiment of the present application provides a model training method and device, a text summary generation method and device, and equipment, which are specifically described through the following embodiments. First, the model training method in the embodiment of the present application is described.
  • AI artificial intelligence
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the model training method and the text summary generation method provided in the embodiments of the present application relate to the field of artificial intelligence technology, especially to the field of data mining technology.
  • the model training method or the text summary generation method provided by the embodiment of the present application can be applied to the terminal, can also be applied to the server, and can also be software running on the terminal or the server.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer or a smart watch, etc.
  • the server can be an independent server, or can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage , network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms; software can be used to implement the model Application of training methods or text summarization methods, etc., but not limited to the above forms.
  • the application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc.
  • This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • some embodiments of the present application provide a model training method, which is used to train a text summary generation model.
  • the training method includes but is not limited to step S100 to step S700, as shown below in conjunction with FIG. 1 Describe these seven steps in detail.
  • Step S100 Acquire at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one-to-one.
  • the original training data is a training set
  • the original training data includes original image data, original text data, and there is a one-to-one correspondence between the original image data and the original text data.
  • the original image data is represented by I
  • the original text data is represented by T.
  • the specific content of the original image data and the original text data is not specifically limited.
  • Step S200 Perform multimodal encoding processing on the original image data to obtain an original image vector, and perform multimodal encoding processing on the original text data to obtain an original text vector.
  • Step S300 Obtain original summary data according to the original text data and original image data.
  • Step S400 Perform vectorization processing on the original summary data to obtain the original summary vector.
  • step S400 of some embodiments use Y to represent the original abstract data, first encode the original abstract data Y through a cross-modal encoder to obtain the original abstract matrix, and then perform mean pooling and mapping processing on the original abstract matrix, Get the feature vector corresponding to the original summary data, that is, the original summary vector, use express.
  • Step S500 Construct the first positive example pair according to the original summary vector and the original text vector.
  • the training set of input is a batch (a batch), and a batch includes M samples (sample), so in order to guarantee the original text data of input It has thematic consistency with the original summary data, and the original summary vector and the original text vector corresponding to the original summary vector need to be constructed as the first positive example pair, while other original text data and original image data can be regarded as the first positive example pair The first negative pair of .
  • the original text data T i of a certain sample and the corresponding original abstract data Y i can be regarded as a first positive example pair, while the original text data and original abstract data of other samples in the batch are corresponding The first negative pair of .
  • Step S600 Construct a second positive example pair according to the original text vector and the original image vector.
  • step S600 of some embodiments similar to S500 of the preceding step, when the input text of the training model is consistent with the input image, the input training set is a batch (a batch), and a batch includes M samples (sample ), so in order to ensure that the input original text data and original image data have thematic consistency, it is necessary to construct the second positive example pair from the original text vector and the original image vector corresponding to the original text vector, while other original image data and original text data Then it can be regarded as the second negative example pair of the second positive example pair.
  • the original text data T i of a certain sample and the corresponding original image data I i can be regarded as a second positive example pair, while the original text data and original image data of other samples in the batch are corresponding The second negative pair of .
  • the model can learn the subject consistency of the original text data and the original image data, thereby completing the training goal of the consistency of the input text image.
  • Step S700 Perform comparative learning and training on the original summary generation model by using the original summary data, multiple first positive example pairs and multiple second positive example pairs to obtain a text summary generation model.
  • the target abstract is a text abstract
  • the original abstract generation model can be expressed by formula (1), and formula (1) is specifically:
  • D represents a sample in the training process
  • y j represents the jth word of the target summary.
  • the target summary is composed of words one by one. By generating several Word get target abstract.
  • the parameters ⁇ in the original summary generation model are compared and learned and trained by using the original summary data, multiple first positive example pairs and multiple second positive example pairs, so as to obtain the ability to simultaneously characterize the input text input image consistency and
  • the model of the consistency of the input text and the output summary enables the model to take both image data and text data into account when generating the target text summary, and improves the accuracy of the text summary generated by the text summary generation model.
  • the training method of the model of the embodiment of the present application obtains the original training data, and then performs multi-modal encoding processing on the original image data in the original training data to obtain the original image vector, and performs multi-modal coding on the original text data in the original training data.
  • the original text vector is obtained by state encoding processing, the original summary data is obtained according to the original text data and the original image data, and then the original summary data is vectorized to obtain the original summary vector, and then according to the original summary vector and the original text corresponding to the original summary vector
  • Vectors construct the first positive example pair, construct the second positive example pair according to the original text vector and the original image vector corresponding to the original text vector, and finally pass the original summary data, multiple first positive example pairs and multiple second positive example pairs
  • the original summarization generation model is trained by comparative learning, and the text summarization generation model is obtained.
  • the resulting text summary generation model can not only have the ability to generate text summaries, but also enhance the semantic representation capabilities of text and image multi-modal data, and, through the original summary data, multiple first positive example pairs and multiple The second positive example pairs the original summary generation model with comparative learning training, so that the text summary generation model can fully consider the connection between the text and the image, and the connection between the text and the target summary when generating the target summary, thus improving the text The accuracy of the text summarization generated by the summarizer model.
  • step S200 includes step S210, step S220, step S230, and step S240. It should be understood that step S200 includes but is not limited to step S210 to step S240. The four steps are described in detail.
  • Step S210 Perform cross-modal encoding on the original text data according to a preset cross-modal encoder to obtain an original text matrix.
  • the original text data is encoded cross-modally by a cross-modal encoder to obtain the original text matrix, where H T represents the original text matrix, H T ⁇ R L ⁇ D , where L represents the original The length of the text data, while D represents the dimension of the mapping vector.
  • Step S220 Perform pooling mapping processing on the original text matrix to obtain the original text vector.
  • step S220 of some embodiments the original text matrix obtained in step S210 is subjected to mean pooling processing, and then the mapping operation is performed to obtain the original text vector, which is used Represents raw text vectors.
  • Step S230 Perform cross-modal encoding on the original image data according to the cross-modal encoder and the original text data to obtain an original image matrix.
  • Step S240 Perform pooling mapping processing on the original image matrix to obtain the original image vector.
  • the obtained original image matrix is denoted by H I , H I ⁇ R L ⁇ D , and the original image vector is denoted by express.
  • step S300 includes but not limited to step S310 and step S320 , and these two steps will be described in detail below with reference to FIG. 3 .
  • Step S310 splicing the original text matrix and the original image matrix to obtain the target summary matrix
  • Step S320 Decode the target summary matrix according to a preset decoder to obtain original summary data.
  • the original text matrix H T and the original image matrix H I are subjected to vector splicing processing to obtain the target summary matrix, denoted by H, and the original text data and the original image are combined through the splicing operation The features of the data are combined so that the subsequent original generative model can learn the features of the original text data and the original image data.
  • the target summary matrix H is decoded by the decoder to obtain the original summary data Y.
  • the original summary data processed in this way does not take into account the topic consistency between the original text data and the original image data, and the topic consistency between the original text data and the original summary data.
  • the first positive example pair and the second positive example pair are constructed for comparative learning training, so that the text summarization model can learn the topic consistency between the original text data and the original image data, and the original text data and the original summary data. thematic consistency.
  • step S230 includes but not limited to step S231 , step S232 , step S233 and step S234 , and these three steps will be described in detail below with reference to FIG. 4 .
  • Step S231 Pre-encoding the original text data to obtain a text sub-vector matrix
  • Step S232 Pre-encoding the original image data to obtain an image sub-vector matrix
  • Step S233 Obtain the transpose matrix of the image sub-vector matrix to obtain the image sub-vector transpose matrix
  • Step S234 Perform iterative processing according to the text sub-vector matrix, the image sub-vector transpose matrix and the image sub-vector matrix to obtain the original image matrix.
  • the original text data includes L words
  • the original text data is firstly pre-encoded to obtain the text sub-vector matrix H T1
  • the original image data is pre-encoded to obtain the image sub-vector matrix H I1
  • get the transpose matrix of the image sub-vector matrix get the image sub-vector transpose matrix
  • iteratively process according to the text sub-vector matrix, the image sub-vector transpose matrix and the image sub-vector matrix to obtain the original image matrix.
  • it can be expressed by the following formula:
  • the vector on the left side of the formula is repeatedly updated N times, and finally the original image matrix H I is obtained.
  • the first input is the above-mentioned H T1 and H I1 , and the next N-1 layers use the previous layer output result, It is used to normalize the vector, that is, to smooth the value. If the dimension is large, the calculated value will be very large. After dividing by the root number D, the value will be smaller and appear smoother.
  • formula (3) is used to obtain the original text matrix, and formula (3) is specifically:
  • step S700 includes but not limited to step S710 , step S720 , step S730 and step S740 , and these four steps will be described in detail below in conjunction with FIG. 5 .
  • Step S710 Construct a first loss function according to the original summary data, the first positive example pair and the corresponding first negative example pair.
  • the input training set is a batch (a batch), and a batch includes M samples (sample), so in order to ensure the input
  • the original text data and the original summary data have thematic consistency, and the original summary vector and the original text vector corresponding to the original summary vector need to be constructed as the first positive example pair, while other original text data and original image data can be regarded as the first positive example pair.
  • the first negative pair of the positive pair is a batch (a batch), and a batch includes M samples (sample), so in order to ensure the input
  • the original text data and the original summary data have thematic consistency, and the original summary vector and the original text vector corresponding to the original summary vector need to be constructed as the first positive example pair, while other original text data and original image data can be regarded as the first positive example pair.
  • the first negative pair of the positive pair is a batch (a batch), and a batch includes M samples (sample), so in order to ensure the input
  • the original text data and the original summary data have thematic consistency, and the original summary vector
  • the original text data T i of a certain sample and the corresponding original abstract data Y i can be regarded as a first positive example pair, while the original text data and original abstract data of other samples in the batch are corresponding The first negative pair of .
  • sim() represents the calculation function for calculating cos between two vectors
  • represents a hyperparameter used to control the speed of model fitting
  • represents a hyperparameter used to control the speed of model fitting
  • Step S720 Construct a second loss function according to the original summary data, the second positive example pair and the corresponding second negative example pair.
  • the input training set is a batch (a batch), and a batch includes M samples (samples), so in order to ensure that the input original text
  • the data and the original image data have thematic consistency, and the original text vector and the original image vector corresponding to the original text vector need to be constructed as the second positive example pair, while other original image data and original text data can be regarded as the second positive example The second negative example of the right pair.
  • the original text data T i of a certain sample and the corresponding original image data I i can be regarded as a second positive example pair, while the original text data and original image data of other samples in the batch are corresponding The second negative pair of .
  • the second loss function is constructed based on the original summary data obtained in the previous steps, the second positive example pair and the corresponding second negative example pair.
  • the second loss function is expressed by formula (5), and formula (5) is:
  • sim() represents the calculation function for calculating cos between two vectors
  • represents a hyperparameter used to control the speed of model fitting
  • represents a hyperparameter used to control the speed of model fitting
  • Step S730 Obtain the target loss function according to the first loss function and the second loss function.
  • the target loss function is constructed according to the first loss function and the second loss function, the target loss function is expressed by formula (6), and formula (6) is specifically:
  • Step S740 fine-tuning the parameters of the original summary generation model according to the target loss function to obtain a text summary generation model.
  • the target loss function is used to fine-tune the parameter ⁇ in the original summary generation model shown in formula (1), so as to obtain the text summary generation model.
  • some embodiments of the present application also propose a method for generating a text abstract, which includes step S800 and step S900. It should be understood that the method for generating a text abstract in the embodiment of the present application includes but is not limited to the steps S800 and step S900, these two steps will be described in detail below with reference to FIG. 6 .
  • Step S800 acquiring text data to be generated and image data to be generated
  • Step S900 Input the text data to be generated and the image data to be generated into the text summary generation model to generate a target text summary; wherein, the text summary generation model is trained according to a model training method; wherein, the model training method includes: obtaining At least two original training data; wherein the original training data includes original image data and original text data, and the original image data is in one-to-one correspondence with the original text data; the original image data is subjected to multi-modal encoding processing to obtain the original image vector, and the original text The data is multimodally encoded to obtain the original text vector; the original abstract data is obtained according to the original text data and the original image data; the original abstract data is vectorized to obtain the original abstract vector; the second abstract vector is constructed according to the original abstract vector and the original text vector A positive example pair; constructing a second positive example pair according to the original text vector and the original image vector; performing comparative learning and training on the original summary generation model through the original summary data, multiple first positive example pairs and multiple second positive example pairs, Get the text summarization
  • the text data to be generated and the image data to be generated that need to generate a multimodal summary are input into the text summary generation model trained in the embodiment of the first aspect, so as to obtain the text data to be generated and the image data to be generated Corresponding multimodal target text summarization.
  • the original training data by obtaining the original training data, and then performing multi-modal encoding processing on the original image data in the original training data, the original image vector is obtained, and the original text data in the original training data is multi-modally encoded.
  • the original text vector is obtained by state encoding processing, the original summary data is obtained according to the original text data and the original image data, and then the original summary data is vectorized to obtain the original summary vector, and then according to the original summary vector and the original text corresponding to the original summary vector
  • Vectors construct the first positive example pair, construct the second positive example pair according to the original text vector and the original image vector corresponding to the original text vector, and finally pass the original summary data, multiple first positive example pairs and multiple second positive example pairs
  • the original summarization generation model is trained by comparative learning, and the text summarization generation model is obtained.
  • the resulting text summary generation model can not only have the ability to generate text summaries, but also enhance the semantic representation capabilities of text and image multi-modal data, and, through the original summary data, multiple first positive example pairs and multiple The second positive example pairs the original summary generation model with comparative learning training, so that the text summary generation model can fully consider the connection between the text and the image, and the connection between the text and the target summary when generating the target summary, thus improving the text The accuracy of the text summarization generated by the summarizer model.
  • some embodiments of the present application also propose a model training device, the training device is used to train the text summary generation model, the training device includes a first acquisition module 1000, an encoding module 1100, a first processing module 1200 , a second processing module 1300 , a first building module 1400 , a second building module 1500 and a training module 1600 .
  • the first obtaining module 1000 is used to obtain at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one by one.
  • the encoding module 1100 is configured to perform multimodal encoding processing on original image data to obtain an original image vector, and perform multimodal encoding processing on original text data to obtain an original text vector.
  • the first processing module 1200 is configured to obtain original summary data according to the original text data and the original image data.
  • the second processing module 1300 is configured to perform vectorization processing on the original summary data to obtain the original summary vector.
  • the first construction module 1400 is configured to construct a first positive example pair according to the original summary vector and the original text vector.
  • the second construction module 1500 is configured to construct a second positive example pair according to the original text vector and the original image vector.
  • the training module 1600 is configured to perform comparative learning and training on the original summary generation model by using the original summary data, multiple first positive example pairs and multiple second positive example pairs to obtain a text summary generation model.
  • the training device of the model of the embodiment of the present application obtains the original training data, and then performs multimodal encoding processing on the original image data in the original training data to obtain the original image vector, and performs multimodal encoding on the original text data in the original training data.
  • the original text vector is obtained by state encoding processing, the original summary data is obtained according to the original text data and the original image data, and then the original summary data is vectorized to obtain the original summary vector, and then according to the original summary vector and the original text corresponding to the original summary vector
  • Vectors construct the first positive example pair, construct the second positive example pair according to the original text vector and the original image vector corresponding to the original text vector, and finally pass the original summary data, multiple first positive example pairs and multiple second positive example pairs
  • the original summarization generation model is trained by comparative learning, and the text summarization generation model is obtained.
  • the resulting text summary generation model can not only have the ability to generate text summaries, but also enhance the semantic representation capabilities of text and image multi-modal data, and, through the original summary data, multiple first positive example pairs and multiple The second positive example pairs the original summary generation model with comparative learning training, so that the text summary generation model can fully consider the connection between the text and the image, and the connection between the text and the target summary when generating the target summary, thus improving the text The accuracy of the text summarization generated by the summarizer model.
  • model training device in the embodiment of the present application corresponds to the above-mentioned model training method, and the specific training steps or processing steps can refer to the above-mentioned model training method, which will not be repeated here.
  • some embodiments of the present application further provide a text summary generation device, and the text summary generation device includes a second acquisition module and a summary generation module.
  • the second acquisition module is used to acquire text data to be generated and image data to be generated.
  • the summary generation module is used to input the text data to be generated and the image data to be generated into the text summary generation model to generate the target text summary; wherein, the text summary generation model is trained according to a training method of the model; wherein, the training method of the model Including: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one by one; performing multimodal encoding processing on the original image data to obtain the original image vector, Perform multimodal encoding processing on the original text data to obtain the original text vector; obtain the original abstract data according to the original text data and original image data; perform vectorization processing on the original abstract data to obtain the original abstract vector; obtain the original abstract vector according to the original abstract vector and the original text Construct the first positive example pair from the vector; construct the second positive example pair from the original text vector and the original image vector; compare the original summary generation model with the original abstract data, multiple first positive example pairs, and multiple second positive example pairs Learn and train to get a text summary generation model
  • the text summary generation device of the embodiment of the present application obtains the original training data, and then performs multi-modal encoding processing on the original image data in the original training data to obtain the original image vector, and performs multi-modal coding on the original text data in the original training data.
  • the original text vector is obtained by state encoding processing, the original summary data is obtained according to the original text data and the original image data, and then the original summary data is vectorized to obtain the original summary vector, and then according to the original summary vector and the original text corresponding to the original summary vector
  • Vectors construct the first positive example pair, construct the second positive example pair according to the original text vector and the original image vector corresponding to the original text vector, and finally pass the original summary data, multiple first positive example pairs and multiple second positive example pairs
  • the original summarization generation model is trained by comparative learning, and the text summarization generation model is obtained.
  • the resulting text summary generation model can not only have the ability to generate text summaries, but also enhance the semantic representation capabilities of text and image multi-modal data, and, through the original summary data, multiple first positive example pairs and multiple The second positive example pairs the original summary generation model with comparative learning training, so that the text summary generation model can fully consider the connection between the text and the image, and the connection between the text and the target summary when generating the target summary, thus improving the text The accuracy of the text summarization generated by the summarizer model.
  • the device for generating a text abstract in the embodiment of the present application corresponds to the aforementioned method for generating a text abstract. Please refer to the aforementioned method for generating a text abstract for specific operation steps or processes, which will not be repeated here.
  • the embodiment of the present application also provides an electronic device, including:
  • the programs are stored in the memory, and the processor executes at least one program to realize a method for training a model or a method for generating a text summary.
  • the training method of the model includes: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one by one; performing multimodal encoding processing on the original image data , to obtain the original image vector, and perform multimodal encoding processing on the original text data to obtain the original text vector; obtain the original abstract data according to the original text data and original image data; perform vectorization processing on the original abstract data to obtain the original abstract vector; according to Construct the first positive example pair from the original abstract vector and the original text vector; construct the second positive example pair from the original text vector and original image vector; use the original abstract data, multiple first positive example pairs and multiple second positive example pairs
  • the original summarization generation model is trained by comparative learning, and the text summarization generation model is obtained.
  • the text summarization method includes: obtaining the text data to be generated and the image data to be generated; inputting the text data to be generated and the image data to be generated into the text summarization generation model to generate the target text summarization; wherein, the text summarization generation model is based on the The training method is trained to get.
  • the electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a vehicle-mounted computer, and the like.
  • the electronic device in the embodiment of the present application obtains the original image vector by executing the above-mentioned model training method or text abstract generation method, by obtaining the original training data, and then performing multi-modal encoding processing on the original image data in the original training data,
  • the original text data in the original training data is multimodally encoded to obtain the original text vector
  • the original abstract data is obtained according to the original text data and original image data
  • the original abstract data is vectorized to obtain the original abstract vector, and then Construct the first positive example pair according to the original summary vector and the original text vector corresponding to the original summary vector, and construct the second positive example pair according to the original text vector and the original image vector corresponding to the original text vector.
  • the positive example pair and multiple second positive example pairs perform comparative learning and training on the original summarization generation model to obtain a text summarization generation model.
  • the resulting text summary generation model can not only have the ability to generate text summaries, but also enhance the semantic representation capabilities of text and image multi-modal data, and, through the original summary data, multiple first positive example pairs and multiple The second positive example pairs the original summary generation model with comparative learning training, so that the text summary generation model can fully consider the connection between the text and the image, and the connection between the text and the target summary when generating the target summary, thus improving the text The accuracy of the text summarization generated by the summarizer model.
  • Figure 8 illustrates the hardware structure of an electronic device in another embodiment, the electronic device includes:
  • the processor 1700 may be implemented by a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute Relevant programs to realize the technical solutions provided by the embodiments of the present application;
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 1800 may be implemented in the form of a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1800 can store operating systems and other application programs.
  • the relevant program codes are stored in the memory 1800 and called by the processor 1700 to execute the implementation of this application.
  • Input/output interface 1900 used to realize information input and output
  • the communication interface 2000 is used to realize the communication and interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.);
  • bus 2100 for transferring information between various components of the device (eg, processor 1700, memory 1800, input/output interface 1900, and communication interface 2000);
  • the processor 1700 , the memory 1800 , the input/output interface 1900 and the communication interface 2000 are connected to each other within the device through the bus 2100 .
  • the embodiment of the present application also provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to perform training of a model method or a text summarization method.
  • the training method of the model includes: obtaining at least two original training data; wherein the original training data includes original image data and original text data, and the original image data corresponds to the original text data one by one; performing multimodal encoding processing on the original image data , to obtain the original image vector, and perform multimodal encoding processing on the original text data to obtain the original text vector; obtain the original abstract data according to the original text data and original image data; perform vectorization processing on the original abstract data to obtain the original abstract vector; according to Construct the first positive example pair from the original abstract vector and the original text vector; construct the second positive example pair from the original text vector and original image vector; use the original abstract data, multiple first positive example pairs and multiple second positive example pairs
  • the original summarization generation model is trained by comparative learning, and the text summarization generation model is obtained.
  • the text summarization method includes: obtaining the text data to be generated and the image data to be generated; inputting the text data to be generated and the image data to be generated into the text summarization generation model to generate the target text summarization; wherein, the text summarization generation model is based on the The training method is trained to get.
  • the storage medium in the embodiment of the present application by executing the above-mentioned model training method or a text summary generation method, by obtaining the original training data, and then performing multi-modal encoding processing on the original image data in the original training data, the obtained The original image vector performs multimodal encoding processing on the original text data in the original training data to obtain the original text vector, obtains the original abstract data according to the original text data and original image data, and then performs vectorization processing on the original abstract data to obtain the original The summary vector, and then construct the first positive example pair according to the original summary vector and the original text vector corresponding to the original summary vector, and construct the second positive example pair according to the original text vector and the original image vector corresponding to the original text vector, and finally pass the original summary data, Multiple first positive example pairs and multiple second positive example pairs perform comparative learning and training on the original summarization generation model to obtain a text summarization generation model.
  • the resulting text summary generation model can not only have the ability to generate text summaries, but also enhance the semantic representation capabilities of text and image multi-modal data, and, through the original summary data, multiple first positive example pairs and multiple The second positive example pairs the original summary generation model with comparative learning training, so that the text summary generation model can fully consider the connection between the text and the image, and the connection between the text and the target summary when generating the target summary, thus improving the text The accuracy of the text summarization generated by the summarizer model.
  • the computer readable storage medium can be nonvolatile or volatile.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device or other non-transitory solid-state storage device.
  • the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions to make an electronic device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store programs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

一种模型的训练方法和装置、文本摘要生成方法和装置、设备。模型的训练方法通过对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量;根据原始文本数据和原始图像数据得到原始摘要数据;将原始摘要数据进行向量化处理,得到原始摘要向量;根据原始摘要向量和对应的原始文本向量构建第一正例对;根据原始文本向量和对应的原始图像向量构建第二正例对;通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。

Description

模型的训练方法和装置、文本摘要生成方法和装置、设备
本申请要求于2022年02月22日提交中国专利局、申请号为202210160816.5,发明名称为“模型的训练方法和装置、文本摘要生成方法和装置、设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种模型的训练方法和装置、文本摘要生成方法和装置、设备。
背景技术
随着视频共享平台数量的增加,人们可以随时随地的观看图像、视频和文本。多模态摘要旨在将多个模态中的信息浓缩成简短、简洁且可读的文本摘要,以便于用户快速地、便捷地理解图像或者视频中的主体信息。
技术问题
以下是发明人意识到的现有技术的技术问题:在生成多模态文本摘要时,只孤立地考虑文本和图像的特征,导致生成的多模态文本摘要不准确。
技术解决方案
第一方面,本申请实施例提出了一种模型的训练方法,所述训练方法用于训练文本摘要生成模型,所述方法包括:
获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
将所述原始摘要数据进行向量化处理,得到原始摘要向量;
根据所述原始摘要向量和所述原始文本向量构建第一正例对;
根据所述原始文本向量和所述原始图像向量构建第二正例对;
通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
第二方面,本申请实施例提出了一种文本摘要生成方法,所述方法包括:
获取待生成文本数据和待生成图像数据;
将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据一种模型的训练方法训练得到;其中,所述模型的训练方法包括:获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;根据所述原始文本数据和所述原始图像数据得到原始摘要数据;将所述原始摘要数据进行向量化处理,得到原始摘要向量;根据所述原始摘要向量和所述原始文本向量构建第 一正例对;根据所述原始文本向量和所述原始图像向量构建第二正例对;通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
第三方面,本申请实施例提出了一种模型的训练装置,所述训练装置用于训练文本摘要生成模型,所述训练装置包括:
第一获取模块,用于获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
编码模块,用于对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
第一处理模块,用于根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
第二处理模块,用于将所述原始摘要数据进行向量化处理,得到原始摘要向量;
第一构建模块,用于根据所述原始摘要向量和所述原始文本向量构建第一正例对;
第二构建模块,用于根据所述原始文本向量和所述原始图像向量构建第二正例对;
训练模块,用于通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
第四方面,本申请实施例提出了一种文本摘要生成装置,所述文本摘要生成装置包括:
第二获取模块,用于获取待生成文本数据和待生成图像数据;
摘要生成模块,用于将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据一种模型的训练方法训练得到;其中,所述模型的训练方法包括:获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;根据所述原始文本数据和所述原始图像数据得到原始摘要数据;将所述原始摘要数据进行向量化处理,得到原始摘要向量;根据所述原始摘要向量和所述原始文本向量构建第一正例对;根据所述原始文本向量和所述原始图像向量构建第二正例对;通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
第五方面,本申请实施例提出了一种电子设备,包括;至少一个存储器;
至少一个处理器;
至少一个程序;
所述程序被存储在所述存储器中,处理器执行所述至少一个程序以实现一种模型的训练方法,或者一种文本摘要生成方法;
其中,所述模型的训练方法包括:获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;根据所述原始文本数据和所述原始图像数据得到原始摘要数据;将所述原始摘要数据进行向量化处理,得到原始摘要向量;根据所述原始摘要向量和所述原始文本向量构建第一正例对;根据所述原始文本向量和所述原始图像向量构建第二正例对;通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型;
其中,所述文本摘要生成方法包括:获取待生成文本数据和待生成图像数据;将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据所述模型的训练方法训练得到。
第六方面,本申请实施例提出了一种存储介质,所述存储介质为计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种模型的训练方法,或者一种文本摘要生成方法;
其中,所述模型的训练方法包括:获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;根据所述原始文本数据和所述原始图像数据得到原始摘要数据;将所述原始摘要数据进行向量化处理,得到原始摘要向量;根据所述原始摘要向量和所述原始文本向量构建第一正例对;根据所述原始文本向量和所述原始图像向量构建第二正例对;通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型;
其中,所述文本摘要生成方法包括:获取待生成文本数据和待生成图像数据;将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据所述模型的训练方法训练得到。
有益效果
本申请提出一种模型的训练方法和装置、文本摘要生成方法和装置、设备,得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
附图说明
图1是本申请实施例提供的模型的训练方法的流程图;
图2是图1中步骤S200的具体方法的流程图;
图3是图1中步骤S300的具体方法的流程图;
图4是图2中步骤S230的具体方法的流程图;
图5是图1中步骤S700的具体方法的流程图;
图6是本申请实施例提供的文本摘要生成方法的流程图;
图7是本申请实施例提供的模型的训练装置的模块框图;
图8是本申请实施例提供的电子设备的硬件结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
首先,对本申请中涉及的若干名词进行解析:
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机 控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
自然语言处理(natural language processing,NLP):NLP用计算机来处理、理解以及运用人类语言(如中文、英文等),NLP属于人工智能的一个分支,是计算机科学与语言学的交叉学科,又常被称为计算语言学。自然语言处理包括语法分析、语义分析、篇章理解等。自然语言处理常用于机器翻译、手写体和印刷体字符识别、语音识别及文语转换、信息检索、信息抽取与过滤、文本分类与聚类、舆情分析和观点挖掘等技术领域,它涉及与语言处理相关的数据挖掘、机器学习、知识获取、知识工程、人工智能研究和与语言计算相关的语言学研究等。
编码(encoder):将输入序列转化成一个固定长度的向量。
跨模态编码器(CrossModal Encoder):跨模态编码指的是输入序列之间的交互编码。如语言和图像之间的交互编码,通过语言编码器和图像编码器实现语言和图像之间的跨模态编码。
解码(decoder):将之前生成的固定向量再转化成输出序列;其中,输入序列可以是文字、语音、图像、视频;输出序列可以是文字、图像。
均值池化(MeanPooling):均值池化是指对局部接受域中的所有值求均值。
多模态摘要(Multimodal Abstractive Summarization,MAS):多模态摘要旨在将多个模态(如图像、文本与音频)中的信息浓缩成简短、简洁且可读的文本摘要。
随着视频共享平台数量的增加,用户可以随时随地的观看图像、视频和文本,然而,这些图像、视频和文本所夹杂的信息比较泛,用户很难直接抓住信息的重点,因此需要对图像、视频和文本进行分析,获取其中的重要信息,并以文本的形式呈现给用户,而多模态摘要刚好满足这一需求。
多模态摘要的任务包括识别主体和基于对输入的理解生成单词。相关技术中结合语言和视觉特征,采取基于Seq2Seq分层注意方法生成摘要。Seq2Seq是目前自然语言处理技术中非常重要而且非常流行的一个模型,其包括一个编码器(Encoder)来使得输入信息向量化并整合语义,一个解码器(Decoder)通过解码中间向量得到文本输出,通过有监督的训练数据可以训练编码器和解码器中的参数,最终使得目标模型拟合。这种方法没有考虑不同模态数据间语义对齐的问题,同时,直接进行信息融合容易导致冗余信息的堆积。这种方式在生成多模态文本摘要时,只孤立地考虑文本和图像的特征,而没有考虑文本和图像之间的连贯性,导致生成的多模态文本摘要不准确。
基于此,本申请实施例提出了一种模型的训练方法和装置、文本摘要生成方法和装置、设备,能够提高模型生成摘要的准确性。
本申请实施例提供模型的训练方法和装置、文本摘要生成方法和装置、设备,具体通过如下实施例进行说明,首先描述本申请实施例中的模型的训练方法。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的模型的训练方法、文本摘要生成方法,涉及人工智能技术领域,尤其涉及数据挖掘技术领域。本申请实施例提供的模型的训练方法或文本摘要生成方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network, CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现模型的训练方法或文本摘要生成方法的应用等,但并不局限于以上形式。
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的历程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
下面结合附图对本申请的实施例做进一步阐述。
请参照图1,第一方面本申请的一些实施例提供了一种模型的训练方法,该训练方法用于训练文本摘要生成模型,训练方法包括但不限于步骤S100至步骤S700,下面结合图1对这七个步骤进行具体描述。
步骤S100:获取至少两个原始训练数据;其中原始训练数据包括原始图像数据和原始文本数据,原始图像数据与原始文本数据一一对应。
在一些实施例的步骤S100中,原始训练数据为训练集,原始训练数据包括原始图像数据、原始文本数据,原始图像数据和原始文本数据一一对应。原始图像数据用I表示,原始文本数据用T表示,在本申请的实施例中对原始图像数据和原始文本数据的具体内容不做具体限定。
步骤S200:对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量。
步骤S300:根据原始文本数据和原始图像数据得到原始摘要数据。
步骤S400:将原始摘要数据进行向量化处理,得到原始摘要向量。
在一些实施例的步骤S400中,用Y表示原始摘要数据,首先通过跨模态编码器对原始摘要数据Y进行编码处理,得到原始摘要矩阵,然后将原始摘要矩阵进行均值池化和映射处理,得到原始摘要数据对应的特征向量,即原始摘要向量,用
Figure PCTCN2022090729-appb-000001
表示。
步骤S500:根据原始摘要向量和原始文本向量构建第一正例对。
在一些实施例的步骤S500中,在训练模型的输入输出文本一致性时,输入的训练集是一个batch(一批),一个batch包括M个sample(样本),因此为了保证输入的原始文本数据和原始摘要数据具有主题一致性,需要将原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,而其他的原始文本数据和原始图像数据则可以看作该第一正例对的第一负例对。例如,某一个batch可以表示为:B={D 1,D 2,...,D M},对于B中的每一个D i都包括原始文本数据T i和对应的原始摘要数据Y i,在同一个batch中,某一个sample的原始文本数据T i和对应的原始摘要数据Y i可以看做一个第一正例对,而batch中的其他sample的原始文本数据和原始摘要数据则为对应的第一负例对。通过这样设置,能够让模型学习到原始文本数据和原始摘要数据的主题一致性,从而完成输入输出文本一致性的训练目标。
步骤S600:根据原始文本向量和原始图像向量构建第二正例对。
在一些实施例的步骤S600中,与前述步骤的S500类似,在训练模型的输入文本和输入图像的一致性时,输入的训练集是一个batch(一批),一个batch包括M个sample(样本),因此为了保证输入的原始文本数据和原始图像数据具有主题一致性,需要将原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,而其他的原始图像数据和原始文本数据则可以看作该第二正例对的第二负例对。例如,某一个batch可以表示为:B={D 1,D 2,...,D M},对于B中的每一个D i都包括原始文本数据T i和对应的原始图像数据I i,在同一个batch中, 某一个sample的原始文本数据T i和对应的原始图像数据I i可以看做一个第二正例对,而batch中的其他sample的原始文本数据和原始图像数据则为对应的第二负例对。通过这样设置,能够让模型学习到原始文本数据和原始图像数据的主题一致性,从而完成输入文本图像一致性的训练目标。
步骤S700:通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。
在一些实施例的步骤S700中,目标摘要即为文本摘要,原始摘要生成模型可以用公式(1)表示,公式(1)具体为:
Figure PCTCN2022090729-appb-000002
在公式(1)中,D表示训练过程中的一个sample,y j表示目标摘要的第j个单词,在本申请的实施例中,目标摘要是由一个一个的词组成的,通过生成若干个词得到目标摘要。本申请实施例通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型中的参数θ进行对比学习训练,从而得到能够同时表征输入文本输入图像一致性和输入文本和输出摘要一致性的模型,使得模型在生成目标文本摘要时,能够兼顾图像数据和文本数据,提高文本摘要生成模型所生成的文本摘要的准确度。
本申请实施例的模型的训练方法,通过获取原始训练数据,然后对原始训练数据中的原始图像数据进行多模态编码处理,得到原始图像向量,对原始训练数据中的原始文本数据进行多模态编码处理,得到原始文本向量,根据原始文本数据和原始图像数据得到原始摘要数据,再对原始摘要数据进行向量化处理,得到原始摘要向量,再根据原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,根据原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,最后通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。通过这样设置,使得得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且,通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
请参照图2,在本申请的一些实施例中,步骤S200包括步骤S210、步骤S220、步骤S230和步骤S240,应理解,步骤S200包括但不限于步骤S210至步骤S240,下面结合图2对这四个步骤进行详细的介绍。
步骤S210:根据预设的跨模态编码器对原始文本数据进行跨模态编码,得到原始文本矩阵。
在一些实施例的步骤S210中,通过跨模态编码器对原始文本数据进行跨模态编码,得到原始文本矩阵,用H T表示原始文本矩阵,H T∈R L×D,其中L表示原始文本数据的长度,而D表示映射向量的维度。
步骤S220:对原始文本矩阵进行池化映射处理,得到原始文本向量。
在一些实施例的步骤S220中,将前述步骤S210得到的原始文本矩阵进行均值池化处理,然后再进行映射操作,得到原始文本向量,用
Figure PCTCN2022090729-appb-000003
表示原始文本向量。
步骤S230:根据跨模态编码器和原始文本数据对原始图像数据进行跨模态编码,得到原始图像矩阵。
步骤S240:对原始图像矩阵进行池化映射处理,得到原始图像向量。
在步骤S230至步骤S240中,与前述步骤类似,将得到的原始图像矩阵用H I表示,H I∈R L×D,将原始图像向量用
Figure PCTCN2022090729-appb-000004
表示。
请参照图3,在本申请的一些实施例中,步骤S300包括但不限于步骤S310和步骤S320, 下面结合图3对这两个步骤进行详细介绍。
步骤S310:将原始文本矩阵和原始图像矩阵进行拼接处理,得到目标摘要矩阵;
步骤S320:根据预设的解码器对目标摘要矩阵进行解码,得到原始摘要数据。
在一些实施例的步骤S310至步骤S320中,将原始文本矩阵H T和原始图像矩阵H I进行向量的拼接处理,得到目标摘要矩阵,用H表示,通过拼接操作,将原始文本数据和原始图像数据的特征结合在一起,以便于后续的原始生成模型能够学习到原始文本数据和原始图像数据的特征。然后,将目标摘要矩阵H通过解码器进行解码操作,得到原始摘要数据Y。但是,这样处理得到的原始摘要数据没有考虑到原始文本数据和原始图像数据之间的主题一致性、以及原始文本数据和原始摘要数据之间的主题一致性,因此,需要将原始摘要数据和后续构建的第一正例对、第二正例对进行对比学习训练,以使文本摘要生成模型学习到原始文本数据和原始图像数据之间的主题一致性、以及原始文本数据和原始摘要数据之间的主题一致性。
请参照图4,在本申请的一些实施例中,步骤S230包括但不限于步骤S231、步骤S232、步骤S233和步骤S234,下面结合图4对这三个步骤进行详细描述。
步骤S231:对原始文本数据进行预编码处理,得到文本次向量矩阵;
步骤S232:对原始图像数据进行预编码处理,得到图像次向量矩阵;
步骤S233:获取图像次向量矩阵的转置矩阵,得到图像次向量转置矩阵;
步骤S234:根据文本次向量矩阵、图像次向量转置矩阵和图像次向量矩阵进行迭代处理,得到原始图像矩阵。
具体地,在本实施例中,假设原始文本数据中包括L个词,首先将原始文本数据进行预编码,得到文本次向量矩阵H T1,对原始图像数据进行预编码,得到图像次向量矩阵H I1,获取图像次向量矩阵的转置矩阵,得到图像次向量转置矩阵
Figure PCTCN2022090729-appb-000005
然后根据文本次向量矩阵、图像次向量转置矩阵和图像次向量矩阵进行迭代处理,得到原始图像矩阵。具体可以通过以下公式表示:
Figure PCTCN2022090729-appb-000006
在公式(2)中,公式左侧的向量被重复更新N次,最终得到原始图像矩阵H I,第一次的输入是上述的H T1和H I1,后面N-1层均采用上一层的输出结果,
Figure PCTCN2022090729-appb-000007
用来做向量的规范化操作,即用来平滑数值,如果维度很大的话算出来的值会很大,除以这个根号D后数值就会小一些,显得更加平滑。
类似的,采取公式(3)得到原始文本矩阵,公式(3)具体为:
Figure PCTCN2022090729-appb-000008
请参照图5,在本申请的一些实施例中,步骤S700包括但不限于步骤S710、步骤S720、步骤S730和步骤S740,下面结合图5对这四个步骤进行详细描述。
步骤S710:根据原始摘要数据、第一正例对和对应的第一负例对构建第一损失函数。
具体的,在一些实施例的步骤S710中,在训练模型的输入输出文本一致性时,输入的训练集是一个batch(一批),一个batch包括M个sample(样本),因此为了保证输入的原始文本数据和原始摘要数据具有主题一致性,需要将原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,而其他的原始文本数据和原始图像数据则可以看作该第一正例对的第一负例对。例如,某一个batch可以表示为:B={D 1,D 2,...,D M},对于B中的每一个D i都包括原始文本数据T i和对应的原始摘要数据Y i,在同一个batch中,某一个sample的原始文本数据T i和对应的原始摘要数据Y i可以看做一个第一正例对,而batch中的其他sample的原始文本数据和原始摘要数据则为对应的第一负例对。
根据上述步骤得到的原始摘要数据、第一正例对和对应的第一负例对构建第一损失函数,第一损失函数用公式(4)表示,公式(4)为:
Figure PCTCN2022090729-appb-000009
在公式(4)中,sim()表示计算两个向量间cos的计算函数,τ表示一个超参数,用来控制模型拟合的速度,
Figure PCTCN2022090729-appb-000010
表示原始文本向量,
Figure PCTCN2022090729-appb-000011
表示原始摘要向量。
步骤S720:根据原始摘要数据、第二正例对和对应的第二负例对构建第二损失函数。
在一些步骤的S720中,在训练模型的输入文本和输入图像的一致性时,输入的训练集是一个batch(一批),一个batch包括M个sample(样本),因此为了保证输入的原始文本数据和原始图像数据具有主题一致性,需要将原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,而其他的原始图像数据和原始文本数据则可以看做该第二正例对的第二负例对。例如,某一个batch可以表示为:B={D 1,D 2,...,D M},对于B中的每一个D i都包括原始文本数据T i和对应的原始图像数据I i,在同一个batch中,某一个sample的原始文本数据T i和对应的原始图像数据I i可以看做一个第二正例对,而batch中的其他sample的原始文本数据和原始图像数据则为对应的第二负例对。
与前述步骤类似,根据前述步骤得到的原始摘要数据、第二正例对和对应的第二负例对构建第二损失函数,第二损失函数用公式(5)表示,公式(5)为:
Figure PCTCN2022090729-appb-000012
在公式(5)中,sim()表示计算两个向量间cos的计算函数,τ表示一个超参数,用来控制模型拟合的速度,
Figure PCTCN2022090729-appb-000013
表示原始文本向量,
Figure PCTCN2022090729-appb-000014
表示原始图像向量。
步骤S730:根据第一损失函数和第二损失函数得到目标损失函数。
在一些实施例的步骤S730中,根据第一损失函数和第二损失函数构建目标损失函数,目标损失函数用公式(6)表示,公式(6)具体为:
Figure PCTCN2022090729-appb-000015
步骤S740:根据目标损失函数对原始摘要生成模型进行参数精调,得到文本摘要生成模型。
在一些实施例的步骤S740中,通过目标损失函数对公式(1)示出的原始摘要生成模型中的参数θ进行精调,从而得到文本摘要生成模型。
请参照图6,第二方面,本申请的一些实施例还提出一种文本摘要生成方法,该方法包括步骤S800和步骤S900,应理解,本申请实施例的文本摘要生成方法包括但不限于步骤S800和步骤S900,下面结合图6对这两个步骤进行详细介绍。
步骤S800:获取待生成文本数据和待生成图像数据;
步骤S900:将待生成文本数据和待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,文本摘要生成模型根据一种模型的训练方法训练得到;其中,模型的训练方法包括:获取至少两个原始训练数据;其中原始训练数据包括原始图像数据和原始文本数据,原始图像数据与原始文本数据一一对应;对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量;根据原始文本数据和原始图像数据得到原始摘要数据;将原始摘要数据进行向量化处理,得到原始摘要向量;根据原始摘要向量和原始文本向量构建第一正例对;根据原始文本向量和原始图像向量构建第二正例对;通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。
在本实施例中,将需要生成多模态摘要的待生成文本数据和待生成图像数据输入到第一 方面实施例训练得到的文本摘要生成模型中,从而得到待生成文本数据和待生成图像数据对应的多模态的目标文本摘要。
本申请实施例的文本摘要生成方法,通过获取原始训练数据,然后对原始训练数据中的原始图像数据进行多模态编码处理,得到原始图像向量,对原始训练数据中的原始文本数据进行多模态编码处理,得到原始文本向量,根据原始文本数据和原始图像数据得到原始摘要数据,再对原始摘要数据进行向量化处理,得到原始摘要向量,再根据原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,根据原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,最后通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。通过这样设置,使得得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且,通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
请参照图7,本申请的一些实施例还提出了一种模型的训练装置,该训练装置用于训练文本摘要生成模型,该训练装置包括第一获取模块1000、编码模块1100、第一处理模块1200、第二处理模块1300、第一构建模块1400、第二构建模块1500和训练模块1600。
第一获取模块1000,用于获取至少两个原始训练数据;其中原始训练数据包括原始图像数据和原始文本数据,原始图像数据与原始文本数据一一对应。
编码模块1100,用于对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量。
第一处理模块1200,用于根据原始文本数据和原始图像数据得到原始摘要数据。
第二处理模块1300,用于将原始摘要数据进行向量化处理,得到原始摘要向量。
第一构建模块1400,用于根据原始摘要向量和原始文本向量构建第一正例对。
第二构建模块1500,用于根据原始文本向量和原始图像向量构建第二正例对。
训练模块1600,用于通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。
本申请实施例的模型的训练装置,通过获取原始训练数据,然后对原始训练数据中的原始图像数据进行多模态编码处理,得到原始图像向量,对原始训练数据中的原始文本数据进行多模态编码处理,得到原始文本向量,根据原始文本数据和原始图像数据得到原始摘要数据,再对原始摘要数据进行向量化处理,得到原始摘要向量,再根据原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,根据原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,最后通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。通过这样设置,使得得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且,通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
需要说明的是,本申请实施例的模型的训练装置与前述的模型的训练方法相对应,具体的训练步骤或者处理步骤可以参照前述的模型的训练方法,在此不一一赘述。
第三方面,本申请的一些实施例还提供了一种文本摘要生成装置,该文本摘要生成装置包括第二获取模块和摘要生成模块。
第二获取模块,用于获取待生成文本数据和待生成图像数据。
摘要生成模块,用于将待生成文本数据和待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,文本摘要生成模型根据一种模型的训练方法训练得到;其中,模型 的训练方法包括:获取至少两个原始训练数据;其中原始训练数据包括原始图像数据和原始文本数据,原始图像数据与原始文本数据一一对应;对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量;根据原始文本数据和原始图像数据得到原始摘要数据;将原始摘要数据进行向量化处理,得到原始摘要向量;根据原始摘要向量和原始文本向量构建第一正例对;根据原始文本向量和原始图像向量构建第二正例对;通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。
本申请实施例的文本摘要生成装置,通过获取原始训练数据,然后对原始训练数据中的原始图像数据进行多模态编码处理,得到原始图像向量,对原始训练数据中的原始文本数据进行多模态编码处理,得到原始文本向量,根据原始文本数据和原始图像数据得到原始摘要数据,再对原始摘要数据进行向量化处理,得到原始摘要向量,再根据原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,根据原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,最后通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。通过这样设置,使得得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且,通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
需要说明的是,本申请实施例的文本摘要生成装置与前述的文本摘要生成方法相对应,具体的操作步骤或者流程请参照前述的文本摘要生成方法,在此不一一赘述。
本申请实施例还提供了一种电子设备,包括:
至少一个存储器;
至少一个处理器;
至少一个程序;
程序被存储在存储器中,处理器执行至少一个程序以实现一种模型的训练方法或者一种文本摘要生成方法。其中,模型的训练方法包括:获取至少两个原始训练数据;其中原始训练数据包括原始图像数据和原始文本数据,原始图像数据与原始文本数据一一对应;对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量;根据原始文本数据和原始图像数据得到原始摘要数据;将原始摘要数据进行向量化处理,得到原始摘要向量;根据原始摘要向量和原始文本向量构建第一正例对;根据原始文本向量和原始图像向量构建第二正例对;通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。其中,文本摘要生成方法包括:获取待生成文本数据和待生成图像数据;将待生成文本数据和待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,文本摘要生成模型根据模型的训练方法训练得到。该电子设备可以为包括手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)、车载电脑等任意智能终端。
本申请实施例的电子设备,通过执行上述的模型的训练方法或者文本摘要生成方法,通过获取原始训练数据,然后对原始训练数据中的原始图像数据进行多模态编码处理,得到原始图像向量,对原始训练数据中的原始文本数据进行多模态编码处理,得到原始文本向量,根据原始文本数据和原始图像数据得到原始摘要数据,再对原始摘要数据进行向量化处理,得到原始摘要向量,再根据原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,根据原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,最后通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。通过这样设置,使得得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且,通过原始摘要数据、多个 第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
下面结合图8对本申请实施例的电子设备进行详细介绍。
如图8,图8示意了另一实施例的电子设备的硬件结构,电子设备包括:
处理器1700,可以采用通用的中央处理器(Central Processing Unit,CPU)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;
存储器1800,可以采用只读存储器(Read Only Memory,ROM)、静态存储设备、动态存储设备或者随机存取存储器(Random Access Memory,RAM)等形式实现。存储器1800可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器1800中,并由处理器1700来调用执行本申请实施例的模型的训练方法或者文本摘要生成方法;
输入/输出接口1900,用于实现信息输入及输出;
通信接口2000,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;
总线2100,在设备的各个组件(例如处理器1700、存储器1800、输入/输出接口1900和通信接口2000)之间传输信息;
其中处理器1700、存储器1800、输入/输出接口1900和通信接口2000通过总线2100实现彼此之间在设备内部的通信连接。
本申请实施例还提供了一种存储介质,该存储介质是计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行一种模型的训练方法或者一种文本摘要生成方法。其中,模型的训练方法包括:获取至少两个原始训练数据;其中原始训练数据包括原始图像数据和原始文本数据,原始图像数据与原始文本数据一一对应;对原始图像数据进行多模态编码处理,得到原始图像向量,对原始文本数据进行多模态编码处理,得到原始文本向量;根据原始文本数据和原始图像数据得到原始摘要数据;将原始摘要数据进行向量化处理,得到原始摘要向量;根据原始摘要向量和原始文本向量构建第一正例对;根据原始文本向量和原始图像向量构建第二正例对;通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。其中,文本摘要生成方法包括:获取待生成文本数据和待生成图像数据;将待生成文本数据和待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,文本摘要生成模型根据模型的训练方法训练得到。
本申请实施例的存储介质,通过执行上述的一种模型的训练方法或者一种文本摘要生成方法,通过获取原始训练数据,然后对原始训练数据中的原始图像数据进行多模态编码处理,得到原始图像向量,对原始训练数据中的原始文本数据进行多模态编码处理,得到原始文本向量,根据原始文本数据和原始图像数据得到原始摘要数据,再对原始摘要数据进行向量化处理,得到原始摘要向量,再根据原始摘要向量和对应原始摘要向量的原始文本向量构建第一正例对,根据原始文本向量和对应原始文本向量的原始图像向量构建第二正例对,最后通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型。通过这样设置,使得得到的文本摘要生成模型在具备文本摘要生成能力的同时还能够增强文本、图像多模态数据的语义表征能力,并且,通过原始摘要数据、多个第一正例对和多个第二正例对对原始摘要生成模型进行对比学习训练,使得文本摘要生成模型在生成目标摘要时能够充分考虑文本和图像之间的联系、文本和目标摘要之间的联系,从而提高了文本摘要生成模型所生成的文本摘要的准确度。
该计算机可读存储介质可以是非易失性,也可以是易失性。存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储 器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台电子设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random  Access Memory,RAM)、磁碟或者光盘等各种可以存储程序的介质。
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。

Claims (20)

  1. 一种模型的训练方法,其中,所述训练方法用于训练文本摘要生成模型,所述方法包括:
    获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
    对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
    根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
    将所述原始摘要数据进行向量化处理,得到原始摘要向量;
    根据所述原始摘要向量和所述原始文本向量构建第一正例对;
    根据所述原始文本向量和所述原始图像向量构建第二正例对;
    通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
  2. 根据权利要求1所述的训练方法,其中,所述对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量,包括:
    根据预设的跨模态编码器对所述原始文本数据进行跨模态编码,得到原始文本矩阵;
    对所述原始文本矩阵进行池化映射处理,得到所述原始文本向量;
    根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵;
    对所述原始图像矩阵进行池化映射处理,得到所述原始图像向量。
  3. 根据权利要求2所述的训练方法,其中,所述根据所述原始文本数据和所述原始图像数据得到原始摘要数据,包括:
    将所述原始文本矩阵和所述原始图像矩阵进行拼接处理,得到目标摘要矩阵;
    根据预设的解码器对所述目标摘要矩阵进行解码,得到所述原始摘要数据。
  4. 根据权利要求2所述的训练方法,其中,所述根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵,包括:
    对所述原始文本数据进行预编码处理,得到文本次向量矩阵;
    对所述原始图像数据进行预编码处理,得到图像次向量矩阵;
    获取所述图像次向量矩阵的转置矩阵,得到图像次向量转置矩阵;
    根据所述文本次向量矩阵、所述图像次向量转置矩阵和所述图像次向量矩阵进行迭代处理,得到所述原始图像矩阵。
  5. 根据权利要求1至4任意一项所述的训练方法,其中,所述通过所述原始摘要数据、多个所述第一正例对和所述第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型,包括:
    根据所述原始摘要数据、所述第一正例对和对应的第一负例对构建第一损失函数;
    根据所述原始摘要数据、所述第二正例对和对应的第二负例对构建第二损失函数;
    根据所述第一损失函数和所述第二损失函数得到目标损失函数;
    根据所述目标损失函数对所述原始摘要生成模型进行参数精调,得到所述文本摘要生成模型。
  6. 一种文本摘要生成方法,其中,所述方法包括:
    获取待生成文本数据和待生成图像数据;
    将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据一种模型的训练方法训练得到;其中,所述模型的训练方法包括:
    获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
    对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
    根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
    将所述原始摘要数据进行向量化处理,得到原始摘要向量;
    根据所述原始摘要向量和所述原始文本向量构建第一正例对;
    根据所述原始文本向量和所述原始图像向量构建第二正例对;
    通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
  7. 根据权利要求6所述的文本摘要生成方法,其中,所述对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量,包括:
    根据预设的跨模态编码器对所述原始文本数据进行跨模态编码,得到原始文本矩阵;
    对所述原始文本矩阵进行池化映射处理,得到所述原始文本向量;
    根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵;
    对所述原始图像矩阵进行池化映射处理,得到所述原始图像向量。
  8. 根据权利要求7所述的文本摘要生成方法,其中,所述根据所述原始文本数据和所述原始图像数据得到原始摘要数据,包括:
    将所述原始文本矩阵和所述原始图像矩阵进行拼接处理,得到目标摘要矩阵;
    根据预设的解码器对所述目标摘要矩阵进行解码,得到所述原始摘要数据。
  9. 根据权利要求7所述的文本摘要生成方法,其中,所述根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵,包括:
    对所述原始文本数据进行预编码处理,得到文本次向量矩阵;
    对所述原始图像数据进行预编码处理,得到图像次向量矩阵;
    获取所述图像次向量矩阵的转置矩阵,得到图像次向量转置矩阵;
    根据所述文本次向量矩阵、所述图像次向量转置矩阵和所述图像次向量矩阵进行迭代处理,得到所述原始图像矩阵。
  10. 根据权利要求6至9任意一项所述的文本摘要生成方法,其中,所述通过所述原始摘要数据、多个所述第一正例对和所述第二正例对对原始摘要生成模型进行对比学习训练,得到文本摘要生成模型,包括:
    根据所述原始摘要数据、所述第一正例对和对应的第一负例对构建第一损失函数;
    根据所述原始摘要数据、所述第二正例对和对应的第二负例对构建第二损失函数;
    根据所述第一损失函数和所述第二损失函数得到目标损失函数;
    根据所述目标损失函数对所述原始摘要生成模型进行参数精调,得到所述文本摘要生成模型。
  11. 一种模型的训练装置,其中,所述训练装置用于训练文本摘要生成模型,所述训练装置包括:
    第一获取模块,用于获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
    编码模块,用于对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
    第一处理模块,用于根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
    第二处理模块,用于将所述原始摘要数据进行向量化处理,得到原始摘要向量;
    第一构建模块,用于根据所述原始摘要向量和所述原始文本向量构建第一正例对;
    第二构建模块,用于根据所述原始文本向量和所述原始图像向量构建第二正例对;
    训练模块,用于通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
  12. 一种文本摘要生成装置,其中,所述文本摘要生成装置包括:
    第二获取模块,用于获取待生成文本数据和待生成图像数据;
    摘要生成模块,用于将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据一种模型的训练方法训练得到;其中,所述模型的训练方法包括:
    获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
    对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
    根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
    将所述原始摘要数据进行向量化处理,得到原始摘要向量;
    根据所述原始摘要向量和所述原始文本向量构建第一正例对;
    根据所述原始文本向量和所述原始图像向量构建第二正例对;
    通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型。
  13. 一种电子设备,其中,包括:
    至少一个存储器;
    至少一个处理器;
    至少一个程序;
    所述程序被存储在所述存储器中,处理器执行所述至少一个程序以实现一种模型的训练方法,或者一种文本摘要生成方法;
    其中,所述模型的训练方法包括:
    获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
    对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
    根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
    将所述原始摘要数据进行向量化处理,得到原始摘要向量;
    根据所述原始摘要向量和所述原始文本向量构建第一正例对;
    根据所述原始文本向量和所述原始图像向量构建第二正例对;
    通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型;
    其中,所述文本摘要生成方法包括:
    获取待生成文本数据和待生成图像数据;
    将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据所述模型的训练方法训练得到。
  14. 根据权利要求13所述的一种电子设备,其中,所述对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量,包括:
    根据预设的跨模态编码器对所述原始文本数据进行跨模态编码,得到原始文本矩阵;
    对所述原始文本矩阵进行池化映射处理,得到所述原始文本向量;
    根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵;
    对所述原始图像矩阵进行池化映射处理,得到所述原始图像向量。
  15. 根据权利要求14所述的一种电子设备,其中,所述根据所述原始文本数据和所述原始图像数据得到原始摘要数据,包括:
    将所述原始文本矩阵和所述原始图像矩阵进行拼接处理,得到目标摘要矩阵;
    根据预设的解码器对所述目标摘要矩阵进行解码,得到所述原始摘要数据。
  16. 根据权利要求14所述的一种电子设备,其中,所述根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵,包括:
    对所述原始文本数据进行预编码处理,得到文本次向量矩阵;
    对所述原始图像数据进行预编码处理,得到图像次向量矩阵;
    获取所述图像次向量矩阵的转置矩阵,得到图像次向量转置矩阵;
    根据所述文本次向量矩阵、所述图像次向量转置矩阵和所述图像次向量矩阵进行迭代处理,得到所述原始图像矩阵。
  17. 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种模型的训练方法,或者一种文本摘要生成方法;
    其中,所述模型的训练方法包括:
    获取至少两个原始训练数据;其中所述原始训练数据包括原始图像数据和原始文本数据,所述原始图像数据与所述原始文本数据一一对应;
    对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量;
    根据所述原始文本数据和所述原始图像数据得到原始摘要数据;
    将所述原始摘要数据进行向量化处理,得到原始摘要向量;
    根据所述原始摘要向量和所述原始文本向量构建第一正例对;
    根据所述原始文本向量和所述原始图像向量构建第二正例对;
    通过所述原始摘要数据、多个所述第一正例对和多个所述第二正例对对原始摘要生成模型进行对比学习训练,得到所述文本摘要生成模型;
    其中,所述文本摘要生成方法包括:
    获取待生成文本数据和待生成图像数据;
    将所述待生成文本数据和所述待生成图像数据输入到文本摘要生成模型中生成目标文本摘要;其中,所述文本摘要生成模型根据所述模型的训练方法训练得到。
  18. 根据权利要求17所述的一种存储介质,其中,所述对所述原始图像数据进行多模态编码处理,得到原始图像向量,对所述原始文本数据进行多模态编码处理,得到原始文本向量,包括:
    根据预设的跨模态编码器对所述原始文本数据进行跨模态编码,得到原始文本矩阵;
    对所述原始文本矩阵进行池化映射处理,得到所述原始文本向量;
    根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵;
    对所述原始图像矩阵进行池化映射处理,得到所述原始图像向量。
  19. 根据权利要求18所述的一种存储介质,其中,所述根据所述原始文本数据和所述原始图像数据得到原始摘要数据,包括:
    将所述原始文本矩阵和所述原始图像矩阵进行拼接处理,得到目标摘要矩阵;
    根据预设的解码器对所述目标摘要矩阵进行解码,得到所述原始摘要数据。
  20. 根据权利要求18所述的一种存储介质,其中,所述根据所述跨模态编码器和所述原始文本数据对所述原始图像数据进行跨模态编码,得到原始图像矩阵,包括:
    对所述原始文本数据进行预编码处理,得到文本次向量矩阵;
    对所述原始图像数据进行预编码处理,得到图像次向量矩阵;
    获取所述图像次向量矩阵的转置矩阵,得到图像次向量转置矩阵;
    根据所述文本次向量矩阵、所述图像次向量转置矩阵和所述图像次向量矩阵进行迭代处理,得到所述原始图像矩阵。
PCT/CN2022/090729 2022-02-22 2022-04-29 模型的训练方法和装置、文本摘要生成方法和装置、设备 WO2023159763A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210160816.5A CN114519395B (zh) 2022-02-22 2022-02-22 模型的训练方法和装置、文本摘要生成方法和装置、设备
CN202210160816.5 2022-02-22

Publications (1)

Publication Number Publication Date
WO2023159763A1 true WO2023159763A1 (zh) 2023-08-31

Family

ID=81598766

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090729 WO2023159763A1 (zh) 2022-02-22 2022-04-29 模型的训练方法和装置、文本摘要生成方法和装置、设备

Country Status (2)

Country Link
CN (1) CN114519395B (zh)
WO (1) WO2023159763A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133241A (zh) * 2024-05-07 2024-06-04 中国科学院自动化研究所 多模态预训练模型的训练方法、装置、设备和存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033683B (zh) * 2022-06-17 2024-05-07 平安科技(深圳)有限公司 摘要生成方法、装置、设备及存储介质
CN115410212B (zh) * 2022-11-02 2023-02-07 平安科技(深圳)有限公司 多模态模型的训练方法、装置、计算机设备及存储介质
CN115934933B (zh) * 2023-03-09 2023-07-04 合肥工业大学 基于双端对比学习的文本摘要生成方法和系统
CN117094367B (zh) * 2023-10-19 2024-03-29 腾讯科技(深圳)有限公司 内容生成方法、模型训练方法、装置、电子设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428025A (zh) * 2020-06-10 2020-07-17 科大讯飞(苏州)科技有限公司 文本摘要方法、装置、电子设备和存储介质
EP3754548A1 (en) * 2019-06-17 2020-12-23 Sap Se A method for recognizing an object in an image using features vectors of an encoding neural network
CN112464993A (zh) * 2020-11-05 2021-03-09 苏州浪潮智能科技有限公司 一种多模态模型训练方法、装置、设备及存储介质
CN113987169A (zh) * 2021-10-14 2022-01-28 润联软件系统(深圳)有限公司 基于语义块的文本摘要生成方法、装置、设备及存储介质
CN114005012A (zh) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 多模态预训练模型的训练方法、装置、设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4181892B2 (ja) * 2003-02-21 2008-11-19 キヤノン株式会社 画像処理方法
US9129216B1 (en) * 2013-07-15 2015-09-08 Xdroid Kft. System, method and apparatus for computer aided association of relevant images with text
CN113052149B (zh) * 2021-05-20 2021-08-13 平安科技(深圳)有限公司 视频摘要生成方法、装置、计算机设备及介质
CN113204670B (zh) * 2021-05-24 2022-12-09 合肥工业大学 一种基于注意力模型的视频摘要描述生成方法及装置
CN113408208B (zh) * 2021-06-25 2023-06-09 成都欧珀通信科技有限公司 模型训练方法、信息提取方法、相关装置及存储介质
CN113935315A (zh) * 2021-10-26 2022-01-14 平安科技(深圳)有限公司 句子向量生成方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3754548A1 (en) * 2019-06-17 2020-12-23 Sap Se A method for recognizing an object in an image using features vectors of an encoding neural network
CN111428025A (zh) * 2020-06-10 2020-07-17 科大讯飞(苏州)科技有限公司 文本摘要方法、装置、电子设备和存储介质
CN112464993A (zh) * 2020-11-05 2021-03-09 苏州浪潮智能科技有限公司 一种多模态模型训练方法、装置、设备及存储介质
CN113987169A (zh) * 2021-10-14 2022-01-28 润联软件系统(深圳)有限公司 基于语义块的文本摘要生成方法、装置、设备及存储介质
CN114005012A (zh) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 多模态预训练模型的训练方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZIJIAN ZHANG; CHENXI ZHANG; QINPEI ZHAO; JIANGFENG LI: "Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference", ARXIV.ORG, 11 August 2021 (2021-08-11), XP091030841 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133241A (zh) * 2024-05-07 2024-06-04 中国科学院自动化研究所 多模态预训练模型的训练方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN114519395B (zh) 2024-05-14
CN114519395A (zh) 2022-05-20

Similar Documents

Publication Publication Date Title
WO2023159763A1 (zh) 模型的训练方法和装置、文本摘要生成方法和装置、设备
CN111444340B (zh) 文本分类方法、装置、设备及存储介质
WO2023108993A1 (zh) 基于深度聚类算法的产品推荐方法、装置、设备及介质
CN114359810B (zh) 视频摘要生成方法、装置、电子设备及存储介质
CN114298121A (zh) 基于多模态的文本生成方法、模型训练方法和装置
CN114387567A (zh) 一种视频数据的处理方法、装置、电子设备及存储介质
CN113705315A (zh) 视频处理方法、装置、设备及存储介质
CN114841146B (zh) 文本摘要生成方法和装置、电子设备及存储介质
CN116578688A (zh) 基于多轮问答的文本处理方法、装置、设备及存储介质
CN110188158A (zh) 关键词及话题标签生成方法、装置、介质及电子设备
CN114282055A (zh) 视频特征提取方法、装置、设备及计算机存储介质
CN116821781A (zh) 分类模型的训练方法、文本分析方法及相关设备
CN116050352A (zh) 文本编码方法和装置、计算机设备及存储介质
CN117093687A (zh) 问题应答方法和装置、电子设备、存储介质
CN111767697A (zh) 文本处理方法、装置、计算机设备以及存储介质
CN113065027A (zh) 视频推荐的方法、装置、电子设备和存储介质
CN116432705A (zh) 文本生成模型构建、文本生成方法和装置、设备及介质
CN116543798A (zh) 基于多分类器的情感识别方法和装置、电子设备、介质
CN114611529B (zh) 意图识别方法和装置、电子设备及存储介质
CN116775875A (zh) 问题语料库构建方法和装置、问答方法、设备及存储介质
CN114398903B (zh) 意图识别方法、装置、电子设备及存储介质
CN114625877A (zh) 文本分类方法和装置、电子设备、存储介质
CN114201604A (zh) 文档向量化方法、装置、计算设备和计算机可读存储介质
CN114792388A (zh) 图像描述文字生成方法、装置及计算机可读存储介质
CN113657092A (zh) 识别标签的方法、装置、设备以及介质