CN114648110A

CN114648110A - Model training method and device, electronic equipment and computer storage medium

Info

Publication number: CN114648110A
Application number: CN202011505035.2A
Authority: CN
Inventors: 桂敏
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-06-21

Abstract

The embodiment of the application provides a model training method, a model training device, electronic equipment and a computer storage medium, wherein the model training method comprises the following steps: acquiring pre-training sample data, wherein the pre-training sample data comprises multi-mode data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after a pre-training encoder processes pre-training sample data and a pre-training reference sample corresponding to the feature representation; the decoder in the neural network model is pre-trained using the feature representations and the pre-training reference samples. And the encoder and the decoder are trained in stages, so that the training effect and the training efficiency of the model are improved.

Description

Model training method and device, electronic equipment and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a model training method and device, electronic equipment and a computer storage medium.

Background

With the development of science and technology, the information dissemination on the internet is more favored to the more intuitive modes such as images and videos. Multimodal data is increasingly being applied to various areas of information dissemination and storage. Data can be divided into characters, images, videos, voice and the like according to different carrier types, and the multi-mode data is data which comprises multiple carrier types. Although multimodal data is more intuitive, in some scenarios, much of the information requires manual configuration by the user in text based on the multimodal data. For example, in the electronic market scene, the selling points of the commodities, the common problems and the like need to be manually filled according to the image, the video or the text description of the commodities; for another example, in a live scene, pages such as live topics and keyword descriptions need to be manually filled by a user according to the content of an image or a video, which may consume a large amount of labor time cost. Taking a commodity selling point as an example, the image, the text description and the like of a commodity usually reveal the selling point of the commodity, the commodity selling point can be obtained only by intelligently identifying and extracting the image and the text, if the image and the text are intelligently filled by using a neural network model, manual annotation is needed when the model is trained, and the cost of manual annotation is too high because the commodity selling point is extracted from multi-mode data of the image, the text description and the like of the commodity and the image and the text description are needed to be respectively annotated. Therefore, for the neural network model for processing multi-modal data, the model training efficiency is low, and the model training effect is poor.

Disclosure of Invention

In view of the above, embodiments of the present application provide a model training method, apparatus, electronic device and computer storage medium to solve some or all of the above problems.

According to a first aspect of embodiments of the present application, there is provided a model training method, including: acquiring pre-training sample data, wherein the pre-training sample data comprises multi-mode data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after a pre-training encoder processes pre-training sample data and a pre-training reference sample corresponding to the feature representation; the decoder in the neural network model is pre-trained using the feature representations and the pre-training reference samples.

According to a second aspect of embodiments of the present application, there is provided a model training apparatus, including: the system comprises a sample module, a pre-training module and a pre-training module, wherein the sample module is used for acquiring pre-training sample data which comprises multi-modal data; the encoder module is used for pre-training an encoder in the neural network model by utilizing pre-training sample data to obtain a pre-trained encoder; the characteristic representation module is used for acquiring the characteristic representation output after the pre-training completed encoder processes the pre-training sample data and the pre-training reference sample corresponding to the characteristic representation; and the decoder module is used for pre-training a decoder in the neural network model by utilizing the feature representation and the pre-training reference sample.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including: the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the model training method of the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the model training method as in the first aspect.

According to the model training method, the model training device, the electronic equipment and the computer storage medium, pre-training sample data is obtained, wherein the pre-training sample data comprises multi-modal data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after pre-training sample data is processed by an encoder after pre-training is finished and a pre-training reference sample corresponding to the feature representation; and pre-training a decoder in the neural network model by using the feature representation and the pre-training reference sample. The encoder is pre-trained firstly, the feature representation of the output of the pre-trained encoder is utilized again, the decoder is trained, staged training is adopted for the encoder and the decoder, model training can be easier to converge, the model training effect is improved, manual marking is not needed, the labor cost is reduced, and the training efficiency of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic view of a scenario of a model training method according to an embodiment of the present application;

FIG. 2 is a flowchart of a model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an encoder according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating an effect of a decoder according to an embodiment of the present application;

fig. 5 is a schematic view of an application scenario of a neural network model according to an embodiment of the present application;

fig. 6 is a block diagram of a structure of a model training apparatus according to a second embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

Example one

For convenience of understanding, an application scenario of the model training method is described, and as shown in fig. 1, fig. 1 is a scenario diagram of the model training method provided in the first embodiment of the present application. The scenario shown in fig. 1 includes a model training apparatus 101, and the model training apparatus 101 may be an apparatus for executing the model training method provided in the first embodiment of the present application.

The model training apparatus 101 may be a terminal device such as a notebook computer or a desktop computer, or the model training apparatus 101 may be a server or the like. As shown in fig. 1, the model training apparatus 101 may acquire pre-training sample data including multi-modal data, pre-train an encoder in the neural network model using the pre-training sample data, and then pre-train a decoder in the neural network model using a feature representation output by the pre-trained encoder and a corresponding pre-training reference sample. And pre-training the encoder and the decoder in stages to obtain a pre-trained neural network model.

With reference to the scenario shown in fig. 1, a model training method provided in the first embodiment of the present application is described in detail, it should be noted that fig. 1 is only an application scenario of the model training method provided in the first embodiment of the present application, and does not represent that the model training method must be applied to the scenario shown in fig. 1, referring to fig. 2, fig. 2 is a flowchart of a model training method provided in the first embodiment of the present application, and the method includes the following steps:

step 201, pre-training sample data is obtained.

The pre-training sample data comprises multi-modal data. In this application, the multimodal data may comprise a plurality of carrier types of data or the multimodal data may comprise multimodal data, e.g. the multimodal data may comprise at least two types of data from text, images, video, speech.

Step 202, pre-training the encoder in the neural network model by using pre-training sample data to obtain the pre-trained encoder.

The neural network model includes an encoder, which may be a noise reduction self-encoder, and a decoder, and the encoder is pre-trained for advancement in step 202. The encoder may be configured to perform feature extraction on input data and output a feature representation, where data output by the encoder is defined as a feature representation in this application. It should be noted that the number of pre-training sample data may be multiple, the multiple pre-training sample data are input into the encoder, a loss function value is calculated according to the output feature representation, and the parameters of the encoder are adjusted according to the loss function value until the loss function value is less than or equal to the preset function value.

Optionally, in an embodiment of the present application, pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder, includes: processing the data of at least one mode in the pre-training sample data by adding noise to obtain pre-training sample data containing noise; and inputting pre-training sample data containing noise into an encoder, and pre-training the encoder. Adding noise to the pre-training sample data may be to mask partial data, and further, performing processing of adding noise to data of at least one mode in the pre-training sample data to obtain pre-training sample data containing noise, including: and covering data of at least one mode in the pre-training sample data to obtain the pre-training sample data containing noise.

As shown in fig. 3, fig. 3 is a schematic diagram of an encoder according to an embodiment of the present application, in fig. 3, the pre-training sample data includes text data and image data, the text data is divided into 6 parts (i.e., 6 phrases) and respectively represented by X1-X6, and the image data is divided into 4 parts and respectively represented by Y1-Y4, and noise may be added to one type of data or noise may be added to both types of data. In fig. 3, for example, noise is added to text data, a Phrase-based Masked Language model (PMLM) is introduced into a Phrase structure tree to extract a Phrase structure in pre-training sample data, the Phrase structure is Masked with a Phrase as a minimum granularity, X2 and X3 are Masked, and then X1, X4, X5, X6, and Y1-Y4 are input to an encoder to train the encoder. After partial data is covered, the semantics of pre-training sample data become imperfect, missing character data can be predicted by using image data, and the learning ability of an encoder can be improved.

Step 203, obtaining a feature representation output after the pre-training encoder processes the pre-training sample data, and a pre-training reference sample corresponding to the feature representation.

It should be noted that the decoder is configured to parse the feature representation output by the encoder to obtain an output of the decoder, where the output of the decoder may include text content. The pre-training reference samples are the output of the decoder, and may be said to be expected data. The pre-training reference samples corresponding to the feature representation are the expected output of the decoder after inputting the feature representation into the decoder.

And step 204, pre-training a decoder in the neural network model by using the feature representation and the pre-training reference sample.

It should be noted that the pre-training sample data for pre-training the encoder and the decoder may be the same or different. Optionally, in an embodiment of the present application, pre-training a decoder in a neural network model using the feature representations and the pre-training reference samples includes: carrying out noise adding treatment on the pre-training reference sample to obtain a pre-training reference sample containing noise; and inputting the pre-training reference sample containing noise and the corresponding feature representation into a decoder, and pre-training the decoder. After noise is added to the pre-training reference samples, the learning capability of the decoder can be improved, and the effect of the decoder is enhanced. Further, inputting the pre-training reference sample containing noise and the corresponding feature representation into a decoder, and pre-training the decoder, comprising: and inputting the feature representation into a decoder to obtain the output of the corresponding decoder, comparing the output of the decoder with the pre-training reference sample, and adjusting the parameters of the decoder on the premise of fixing the parameters of the pre-trained encoder according to the comparison result. Because the pre-training of the encoder is completed, the parameters of the encoder are fixed, only the parameters of the decoder are adjusted, and the model convergence can be accelerated and the consistency of the pre-training and the fine-tuning stages can be ensured.

As shown in fig. 4, fig. 4 is a schematic diagram for providing an effect of a decoder according to an embodiment of the present application, in fig. 4, a pre-training sample data is input into an encoder and then output to a feature representation, the feature representation is divided into 6 phrases, which are respectively represented by Z1-Z6, and correspondingly, reference data is also divided into 6 phrases, which are respectively represented by C1-C6, C2 and C3 may be masked, Z1-Z6 is input into the decoder, and then C2 and C3 are predicted, and output of the decoder is obtained, and parameter adjustment is performed on the decoder.

It should be noted that, alternatively, noise may be added to the pre-training reference sample by using a Masked Region Classification Model (MRC) with Linguistic cues, which exemplifies two implementations of how to add noise to the pre-training reference sample.

Optionally, in a first implementation manner, the processing of adding noise to the pre-training reference sample to obtain the pre-training reference sample containing noise includes: and dividing the pre-training reference sample into at least two phrases, disordering the at least two phrases, and obtaining the pre-training reference sample containing noise.

Optionally, in a second implementation manner, the processing of adding noise to the pre-training reference sample to obtain the pre-training reference sample containing noise includes: and performing phrase deletion or phrase masking on the pre-training reference sample, and obtaining the pre-training reference sample containing noise.

Optionally, after step 201 and step 204, the decoder may be further trained according to different decoding tasks to implement different decoding tasks. For example, after pre-training a decoder in a neural network model by using the feature representation and the pre-training reference sample, the method further includes: obtaining model training sample data, and inputting the model training sample data into a pre-trained encoder to obtain characteristic representation; and inputting the model reference sample with the characteristic representation corresponding to the decoding task into the pre-trained decoder, and training the pre-trained decoder. After the decoder is further trained according to a specific decoding task, the decoder can solve different problems, for example, commodity selling points, search result pushing, question and answer information pushing and the like can be automatically generated. Here, three specific examples are listed for explanation:

optionally, in the first example, the neural network model is used to generate a commodity selling point, the model training sample data includes multi-modal data of the commodity, and the model reference sample includes selling point data of the commodity; inputting model reference samples corresponding to the characteristic representation and the decoding task into a pre-trained decoder, and training the pre-trained decoder, wherein the method comprises the following steps: and inputting the feature representation and the selling point data of the corresponding commodity into a pre-trained decoder, and training the decoder. Inputting the multi-modal data of the commodity into the encoder to obtain the feature representation of the corresponding commodity, and inputting the feature representation of the commodity and the selling point data of the corresponding commodity into the decoder to train the decoder. After the decoder is trained, the trained neural network model can be obtained, and after the multi-mode data of the commodity is input into the neural network model, the selling point data of the commodity can be automatically generated.

Optionally, in a second example, the neural network model is used for intelligent search, the model training sample data comprises multi-modal data for search, and the model reference sample comprises search results; inputting model reference samples with characteristic representation corresponding to a decoding task into a pre-trained decoder, and training the pre-trained decoder, wherein the method comprises the following steps: and inputting the feature representation and the corresponding search result into a pre-trained decoder, and training the decoder. Inputting the multi-modal data for searching into the encoder to obtain corresponding feature representation, and inputting the feature representation and the corresponding search result into the decoder, namely training the decoder. After the decoder is trained, the trained neural network model can be obtained, and after multi-modal data for searching are input into the neural network model, a searching result can be automatically obtained.

Optionally, in a third example, the neural network model is used for intelligent question answering, the model training sample data comprises multi-modal data for questioning, and the model reference sample comprises question answering data; inputting model reference samples corresponding to the characteristic representation and the decoding task into a pre-trained decoder, and training the pre-trained decoder, wherein the method comprises the following steps: and inputting the feature representation and the corresponding question and answer data into a pre-trained decoder, and training the decoder. Inputting the multi-modal data for question answering into the encoder to obtain corresponding feature representation, and inputting the feature representation and the corresponding question answering data into the decoder to train the decoder. After the decoder is trained, a trained neural network model can be obtained, and after multi-modal data for question answering are input into the neural network model, question answering data can be obtained, namely relevant questions and answers are automatically pushed according to the multi-modal data input by a user.

Based on the above three examples, after the pre-training of the encoder and the decoder is completed, the decoder may be further trained according to the decoding task, and after the training is completed, different decoding tasks may be performed by using the trained neural network model, where a specific application scenario is listed here for description. As shown in fig. 5, the application scenario shows a terminal device 501, a cloud 502, and a user 503, and it should be noted that the terminal device 501 may access a network and be connected to the cloud 502 through the network. In the present application, the Network includes a Local Area Network (LAN), a Wide Area Network (WAN), and a mobile communication Network; such as the World Wide Web (WWW), Long Term Evolution (LTE) networks, 2G networks (2 th Generation Mobile Network), 3G networks (3 th Generation Mobile Network), 5G networks (5 th Generation Mobile Network), etc. Of course, this is merely an example and does not represent a limitation of the present application. The cloud 502 may include the model training apparatus 101 shown in fig. 1, which may be a server, a relay Device, a Device-to-Device (D2D) Device, and so on.

The user 503 inputs multimodal data on the terminal device 501, the terminal device 501 transmits the multimodal data to the cloud 502, the cloud 502 processes the multimodal data by using the trained neural network model, specifically, a coder is used for performing feature extraction on the multimodal data to obtain corresponding feature representations, then a decoder is used for analyzing the feature representations to obtain output of corresponding models, the cloud 502 returns the output of the models to the terminal device 501, and the user can check the output of the models on the terminal device 501. For example, a user inputs pictures and characters of a commodity on the terminal device 501, and the terminal device 501 interacts with the cloud 502 to show selling points of the commodity to the user; for another example, a user inputs multimodal data for searching on the terminal device 501, and the terminal device 501 interacts with the cloud 502 to display a search result to the user; for another example, a user inputs multimodal data for question answering on the terminal device 501, and the terminal device 501 interacts with the cloud 502 to display question answering data to the user, that is, automatically push questions and answers. Of course, the above is merely illustrative.

According to the model training method provided by the embodiment of the application, pre-training sample data is obtained, wherein the pre-training sample data comprises multi-mode data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after a pre-training encoder processes pre-training sample data and a pre-training reference sample corresponding to the feature representation; the decoder in the neural network model is pre-trained using the feature representations and the pre-training reference samples. The encoder is pre-trained firstly, the characteristic representation of the output of the pre-trained encoder is utilized again, the decoder is trained, staged training is adopted for the encoder and the decoder, model training can be more easily converged, the model training effect is improved, manual marking is not needed, the labor cost is reduced, and the training efficiency of the model is improved.

Example two

Based on the method described in the first embodiment, a second embodiment of the present application provides a model training apparatus for performing the method described in the first embodiment, and referring to fig. 6, the model training apparatus 60 includes:

a sample module 601, configured to obtain pre-training sample data, where the pre-training sample data includes multi-modal data;

an encoder module 602, configured to pre-train an encoder in the neural network model by using pre-training sample data to obtain an encoder with completed pre-training;

a feature representation module 603, configured to obtain a feature representation output after the pre-training encoder processes pre-training sample data, and a pre-training reference sample corresponding to the feature representation;

a decoder module 604 for pre-training a decoder in the neural network model using the feature representation and the pre-training reference samples.

Optionally, in an embodiment of the present application, the encoder module 602 is configured to perform processing of adding noise to data of at least one modality in the pre-training sample data to obtain pre-training sample data containing noise; and inputting pre-training sample data containing noise into an encoder, and pre-training the encoder to obtain the pre-trained encoder.

Optionally, in an embodiment of the present application, the encoder module 602 is configured to mask data of at least one modality in the pre-training sample data to obtain the pre-training sample data containing noise.

Optionally, in an embodiment of the present application, the decoder module 604 is configured to perform a noise adding process on the pre-training reference sample to obtain a pre-training reference sample containing noise; and inputting the pre-training reference sample containing noise and the corresponding feature representation into a decoder, and pre-training the decoder.

Optionally, in an embodiment of the present application, the decoder module 604 is configured to input the feature representation into the decoder to obtain an output of the corresponding decoder, compare the output of the decoder with the pre-training reference sample, and adjust a parameter of the decoder according to a comparison result on the premise that a parameter of the pre-trained encoder is fixed.

Optionally, in an embodiment of the present application, the decoder module 604 is configured to divide the pre-training reference sample into at least two phrases, shuffle the at least two phrases, and obtain the pre-training reference sample containing noise.

Optionally, in an embodiment of the present application, the decoder module 604 is configured to perform phrase deletion or phrase masking on the pre-training reference samples and obtain the pre-training reference samples containing noise.

Optionally, in an embodiment of the present application, as shown in fig. 6, the model training apparatus 60 further includes a training module 605, configured to obtain multi-modal data, input the multi-modal data into the pre-trained encoder to obtain a feature representation; and inputting the model reference sample with the characteristic representation corresponding to the decoding task into the pre-trained decoder, and training the pre-trained decoder.

Optionally, in one embodiment of the present application, the multimodal data comprises multimodal data for the good, and the model reference samples comprise selling point data for the good; the training module 605 is configured to input the feature representation and the selling point data of the corresponding commodity into the pre-trained decoder, and train the decoder.

Optionally, in one embodiment of the application, the multimodal data comprises multimodal data for searching, the model reference sample comprises search results; a training module 605, configured to input the feature representation and the corresponding search result into a pre-trained decoder, and train the decoder.

Optionally, in one embodiment of the present application, the multimodal data comprises multimodal data for questioning, and the model reference samples comprise question and answer data; a training module 605, configured to input the feature representation and the corresponding question and answer data into a pre-trained decoder, and train the decoder.

The model training device provided by the embodiment of the application acquires pre-training sample data, wherein the pre-training sample data comprises multi-mode data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after a pre-training encoder processes pre-training sample data and a pre-training reference sample corresponding to the feature representation; the decoder in the neural network model is pre-trained using the feature representations and the pre-training reference samples. The encoder is pre-trained firstly, the characteristic representation of the output of the pre-trained encoder is utilized again, the decoder is trained, staged training is adopted for the encoder and the decoder, model training can be more easily converged, the model training effect is improved, manual marking is not needed, the labor cost is reduced, and the training efficiency of the model is improved.

EXAMPLE III

Based on the method described in the first embodiment, a third embodiment of the present application provides an electronic device, configured to execute the method described in the first embodiment, and referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device provided in a fifth embodiment of the present application, where a specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 7, the electronic device may include: a processor (processor)702, a Communications Interface 704, a memory 706, and a communication bus 708.

Wherein:

the processor 702, communication interface 704, and memory 706 communicate with each other via a communication bus 708.

A communication interface 704 for communicating with other electronic devices, such as a terminal device or a server.

The processor 702 is configured to execute the program 710, and may specifically execute the relevant steps in the foregoing method embodiments.

In particular, the program 710 may include program code that includes computer operating instructions.

The processor 702 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present application. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 706 stores a program 710. The memory 706 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as disk memory.

The program 710 may be specifically configured to cause the processor 702 to execute any one of the methods of the first embodiment.

For specific implementation of each step in the program 710, reference may be made to corresponding steps and corresponding descriptions in units in the above embodiment of the model training method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The electronic equipment provided by the embodiment of the application acquires pre-training sample data, wherein the pre-training sample data comprises multi-mode data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after a pre-training encoder processes pre-training sample data and a pre-training reference sample corresponding to the feature representation; the decoder in the neural network model is pre-trained using the feature representations and the pre-training reference samples. The encoder is pre-trained firstly, the characteristic representation of the output of the pre-trained encoder is utilized again, the decoder is trained, staged training is adopted for the encoder and the decoder, model training can be more easily converged, the model training effect is improved, manual marking is not needed, the labor cost is reduced, and the training efficiency of the model is improved.

Example four

Based on the method described in the first embodiment, a fourth embodiment of the present application provides a computer storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method described in the first embodiment.

The computer storage medium provided by the embodiment of the application acquires pre-training sample data, wherein the pre-training sample data comprises multi-mode data; pre-training an encoder in a neural network model by using pre-training sample data to obtain a pre-trained encoder; acquiring a feature representation output after a pre-training encoder processes pre-training sample data and a pre-training reference sample corresponding to the feature representation; the decoder in the neural network model is pre-trained using the feature representations and the pre-training reference samples. The encoder is pre-trained firstly, the characteristic representation of the output of the pre-trained encoder is utilized again, the decoder is trained, staged training is adopted for the encoder and the decoder, model training can be more easily converged, the model training effect is improved, manual marking is not needed, the labor cost is reduced, and the training efficiency of the model is improved.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the model training methods described herein. Further, when a general-purpose computer accesses code for implementing the model training methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the model training methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A method of model training, comprising:

acquiring pre-training sample data, wherein the pre-training sample data comprises multi-modal data;

pre-training an encoder in a neural network model by using the pre-training sample data to obtain a pre-trained encoder;

acquiring a feature representation output after the pre-training completed encoder processes the pre-training sample data, and a pre-training reference sample corresponding to the feature representation;

pre-training a decoder in the neural network model using the feature representation and the pre-training reference samples.

2. The method of claim 1, wherein said pre-training an encoder in a neural network model with the pre-training sample data to obtain a pre-trained encoder comprises:

processing the data of at least one mode in the pre-training sample data by adding noise to obtain pre-training sample data containing noise;

and inputting the pre-training sample data containing the noise into the encoder, and pre-training the encoder to obtain the encoder with the pre-training completed.

3. The method according to claim 2, wherein the processing of adding noise to the data of at least one modality in the pre-training sample data to obtain pre-training sample data containing noise comprises:

and covering data of at least one mode in the pre-training sample data to obtain the pre-training sample data containing the noise.

4. The method of claim 1, wherein said pre-training a decoder in the neural network model using the feature representations and the pre-training reference samples comprises:

adding noise to the pre-training reference sample to obtain a pre-training reference sample containing noise;

and inputting the pre-training reference sample containing the noise and the corresponding feature representation into the decoder to pre-train the decoder.

5. The method of claim 4, wherein said inputting the pre-training reference samples containing noise and the corresponding feature representations into the decoder, pre-training the decoder, comprises:

and inputting the feature representation into the decoder to obtain the output of the corresponding decoder, comparing the output of the decoder with the pre-training reference sample, and adjusting the parameters of the decoder on the premise of fixing the parameters of the coder which is pre-trained according to the comparison result.

6. The method of claim 4, wherein the subjecting the pre-training reference sample to noise addition to obtain a pre-training reference sample containing noise comprises:

and dividing the pre-training reference sample into at least two phrases, disordering the at least two phrases, and obtaining the pre-training reference sample containing the noise.

7. The method of claim 4, wherein the subjecting the pre-training reference sample to noise addition to obtain a pre-training reference sample containing noise comprises:

and carrying out phrase deletion or phrase covering on the pre-training reference sample, and obtaining the pre-training reference sample containing the noise.

8. The method of claim 1, wherein after the pre-training a decoder in the neural network model using the feature representations and the pre-training reference samples, further comprising:

obtaining model training sample data, inputting the model training sample data into the pre-trained encoder to obtain characteristic representation; and inputting the model reference sample corresponding to the feature representation and the decoding task into the pre-trained decoder, and training the pre-trained decoder.

9. The method of claim 8, wherein the multi-model training sample data comprises multi-modal data for a commodity, the model reference sample comprises selling point data for a commodity;

inputting the model reference sample corresponding to the feature representation and the decoding task into the pre-trained decoder, and training the pre-trained decoder, wherein the training comprises:

and inputting the feature representation and the corresponding selling point data of the commodity into the pre-trained decoder, and training the decoder.

10. The method of claim 8, wherein the model training sample data comprises multimodal data for searching, the model reference sample comprises search results;

and inputting the feature representation and the corresponding search result into the pre-trained decoder, and training the decoder.

11. The method of claim 8, wherein the model training sample data comprises multimodal data for questioning, the model reference sample comprises question and answer data;

and inputting the feature representation and the corresponding question and answer data into the pre-trained decoder, and training the decoder.

12. A model training apparatus comprising:

the device comprises a sample module, a pre-training module and a pre-training module, wherein the sample module is used for acquiring pre-training sample data which comprises multi-modal data;

the encoder module is used for pre-training an encoder in the neural network model by using the pre-training sample data to obtain a pre-trained encoder;

the characteristic representation module is used for acquiring a characteristic representation output after the pre-training completed encoder processes the pre-training sample data and a pre-training reference sample corresponding to the characteristic representation;

a decoder module for pre-training a decoder in the neural network model using the feature representation and the pre-training reference sample.

13. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the model training method according to any one of claims 1-11.

14. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the model training method as claimed in any one of claims 1 to 11.