CN114819188A

CN114819188A - Model training method and device, electronic equipment and readable storage medium

Info

Publication number: CN114819188A
Application number: CN202210554642.0A
Authority: CN
Inventors: 田鑫; 陈泽裕; 刘佳琪
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-07-29

Abstract

The disclosure provides a model training method and device, electronic equipment and a readable storage medium, and relates to the technical field of computers, in particular to the technical field of deep learning. The specific implementation scheme is as follows: acquiring N training sentences, and respectively inputting the N training sentences into a first model and a second model, wherein N is an integer greater than 1; acquiring a first self-attention relationship value and a second self-attention relationship value output by the first model, and a third self-attention relationship value and a fourth self-attention relationship value output by the second model; acquiring a first similarity between the third self-attention relationship value and the first self-attention relationship value, and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value; training the second model based on the first similarity and the second similarity.

Description

Model training method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for model training, an electronic device, and a readable storage medium.

Background

At present, the model compression mainly adopts three modes of knowledge distillation, clipping and quantification. The model compression based on the knowledge distillation method is to use a large model (or a teacher model) with good effect to guide the training of a small model (or a student model), and to transfer the knowledge of the teacher model to the student model to compress the model. Generally, the student model is significantly smaller than the teacher model, and the model is reduced while the model precision is lost.

Disclosure of Invention

The disclosure provides a model training method, a model training device, an electronic device and a readable storage medium.

According to a first aspect of the present disclosure, there is provided a model training method, comprising:

acquiring N training sentences, and respectively inputting the N training sentences into a first model and a second model, wherein N is an integer greater than 1;

acquiring a first self-attention relationship value and a second self-attention relationship value output by the first model, and a third self-attention relationship value and a fourth self-attention relationship value output by the second model;

acquiring a first similarity between the third self-attention relationship value and the first self-attention relationship value, and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value;

training the second model based on the first similarity and the second similarity;

the first and third self-attention relationship values are self-attention relationship values calculated from words in a first target sentence, the second and fourth self-attention relationship values are self-attention relationship values calculated from words in the first target sentence and words in a second target sentence, and the first and second target sentences are different sentences in the N training sentences.

According to a second aspect of the present disclosure, there is provided a model training apparatus comprising:

the input module is used for acquiring N training sentences and inputting the N training sentences into a first model and a second model respectively, wherein N is an integer greater than 1;

a first obtaining module, configured to obtain a first self-attention relationship value and a second self-attention relationship value output by the first model, and a third self-attention relationship value and a fourth self-attention relationship value output by the second model;

a second obtaining module, configured to obtain a first similarity between the third self-attention relationship value and the first self-attention relationship value, and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value;

a training module for training the second model based on the first similarity and the second similarity;

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

In the embodiment of the disclosure, not only the self-attention relationship value between words in the sentence in the training sentence is concerned, but also the self-attention relationship value between words in the sentence is concerned, so that the output of the model is effectively enriched, the training precision of the model is improved, and the output precision of the second model is ensured to be higher.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a model training method provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram of a model training apparatus provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, fig. 1 is a flowchart of a model training method according to an embodiment of the present disclosure, and as shown in fig. 1, the method includes the following steps:

s101, obtaining N training sentences, and inputting the N training sentences into a first model and a second model respectively, wherein N is an integer larger than 1.

It should be noted that the model training method provided by the embodiment of the present disclosure may be applied to electronic devices such as a computer, a tablet computer, a mobile phone, and the like. For better understanding, the technical solutions provided by the embodiments of the present disclosure will be described below by taking an electronic device as an example.

In the embodiment of the present disclosure, the electronic device obtains N training sentences, where N is an integer greater than 1, that is, obtains at least two training sentences. Optionally, the electronic device may be configured to crawl a large amount of internet open data such as web pages, news, encyclopedia, and the like from the internet based on a web crawler, analyze natural language data, perform preprocessing to generate a plurality of sentences, and use the sentences as the training sentences. Or, the electronic device may also acquire preset N sentences as training sentences, that is, the training sentences are preset and specific sentences.

After obtaining the N training sentences, the electronic device inputs the N training sentences into the first model and the second model, i.e., the N training sentences are input into the first model, and the N training sentences are also input into the second model. The first model and the second model may be the same type of network model, or may be different types of network models.

Step S102, a first self-attention relationship value and a second self-attention relationship value output by the first model, and a third self-attention relationship value and a fourth self-attention relationship value output by the second model are obtained.

It should be noted that the first self-attention relationship value and the third self-attention relationship value are self-attention relationship values obtained by pairwise calculation of all words in any one of the N training sentences. For example, N is 3, that is, it is necessary to calculate the self-attention relationship value obtained by pairwise calculation for all words in one of the sentences, then calculate the self-attention relationship value obtained by pairwise calculation for all words in the other sentence, and then calculate the self-attention relationship value obtained by pairwise calculation for all words in the last sentence. The second self-attention relationship value and the fourth self-attention relationship value are self-attention relationship values obtained by respectively selecting a word from every two sentences in the N training sentences, for example, N is 3, self-attention relationship values obtained by respectively calculating a first word in the first sentence and every word in the second sentence are calculated, then self-attention relationship values … … obtained by respectively calculating a second word in the first sentence and every word in the second sentence are calculated until all words in the three sentences are calculated pairwise to obtain all self-attention relationship values.

For example, the N training sentences include a sentence a and a sentence B, where the sentence a includes 3 words and the sentence B includes 4 words, that is, the sentence a and the sentence B are respectively input into the first model and the second model, and the first model and the second model respectively process the input training sentences, for example, obtain a semantic representation vector of each word in the sentence a and the sentence B, and calculate and output a self-attention relationship value between the word and the word based on the semantic representation vector of each word.

The first target statement may be statement a or may also be statement B; taking the first target sentence as the sentence a as an example, the second target sentence is the sentence B, the first model and the second model respectively process the sentence a and the sentence B based on the model structure and the model parameters of the first model and the second model, two-by-two calculation is performed on all words in the sentence a to obtain first self-attention relationship values, that is, two-by-two calculation is performed on 3 words in the sentence a to obtain 9 first self-attention relationship values in total, that is, the first model outputs 9 first self-attention relationship values, and the second model outputs 9 third self-attention relationship values; in addition, the first model and the second model respectively calculate self-attention relationship values between words in the sentence a and words in the sentence B, that is, 3 words in the sentence a and 4 words in the sentence B are combined pairwise to calculate 12 self-attention relationship values, that is, the first model outputs 12 second self-attention relationship values, and the second model outputs 12 fourth self-attention relationship values.

It should be noted that, in some embodiments, the self-attention relationship value may also be referred to as self-attention relationship knowledge, self-attention relationship vector, and the like, which is not specifically limited by the present disclosure.

In the embodiment of the present disclosure, after N training sentences are input to a first model and a second model respectively, the first model and the second model process the N training sentences respectively to obtain and output self-attention relationship values (i.e., a first self-attention relationship value and a third self-attention relationship value) obtained by pairwise calculation of all words in the same sentence, and obtain and output self-attention relationship values (i.e., a second self-attention relationship value and a fourth self-attention relationship value) between words of different sentences. Thus, not only the self-attention relationship value between words in the same sentence but also the self-attention relationship value between words in different sentences can be obtained.

Step S103, obtaining a first similarity between the third self-attention relationship value and the first self-attention relationship value, and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value.

It can be understood that, although the inputs of the first model and the second model are both N training sentences, even if two models with the same structure type are processed, the outputs of the two models are different. In the embodiment of the present disclosure, after N training sentences are respectively input into the first model and the second model, a first self-attention relationship value output by the first model and a third self-attention relationship value output by the second model may have a certain difference, and a second self-attention relationship value output by the first model and a fourth self-attention relationship value output by the second model may also have a certain difference.

In this embodiment of the disclosure, after a first self-attention relationship value and a second self-attention relationship value output by a first model and a third self-attention relationship value and a fourth self-attention relationship value output by a second model are acquired, a first similarity between the third self-attention relationship value and the first self-attention relationship value and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value are calculated.

And S104, training the second model based on the first similarity and the second similarity.

It can be understood that a degree of difference between the third self-attention relationship value and the first self-attention relationship value can be obtained based on the first similarity, a degree of difference between the fourth self-attention relationship value and the second self-attention relationship value can be obtained based on the second similarity, the third self-attention relationship value and the fourth self-attention relationship value are output of the second model after the N training sentences are processed, and the first self-attention relationship value and the second self-attention relationship value are output of the first model after the N training sentences are processed. Therefore, the difference between the first model and the first model in the processing of the same input can be obtained based on the first similarity and the second similarity, and the second model is trained based on the first similarity and the second similarity.

For example, the first similarity and the second similarity may be reversely transmitted to the second model, so that the second model can perform self-learning based on the first similarity and the second similarity to achieve model training.

It should be noted that the purpose of the training of the second model is to make the output of the trained second model as similar as possible to the output of the first model, so that the migration from the first model to the second model can be realized. For example, the first model may be a model with a larger model structure and model parameters, the second model may be a model with a smaller model structure and model parameters, N training sentences that are the same as the first model and the second model are input to the first model and the second model, the respective outputs of the first model and the second model are obtained, the similarity between the outputs of the two models is calculated, and the second model is trained based on the similarity, so that the trained second model can realize a processing effect similar to the first model with a smaller model structure and model parameters, and the second model is ensured to have a better model precision, which is equivalent to compressing the first model into the second model, thereby realizing model compression, so that the second model can be applied to a terminal with limited computation capability, and effectively improving the application range of the second model.

In this disclosure, the N training sentences are respectively input into a first model and a second model, a first self-attention relationship value and a second self-attention relationship value output by the first model, a third self-attention relationship value and a fourth self-attention relationship value output by the second model are obtained, then a first similarity between the third self-attention relationship value and the first self-attention relationship value, and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value are calculated, and the second model is trained based on the first similarity and the second similarity, so that the trained second model can achieve a similar processing effect to the first model. In addition, in the embodiment of the disclosure, not only the self-attention relationship value between words in sentences in the training sentences is concerned, but also the self-attention relationship value between words in sentences is concerned, so that the output of the model is effectively enriched, the training precision of the model is improved, and the output precision of the second model is ensured to be higher and closer to the first model.

Optionally, the step S102 may include:

obtaining semantic expression vectors of the words in the N training sentences;

and acquiring a first self-attention relationship value and a second self-attention relationship value which are output by the first model based on the semantic representation vector, and acquiring a third self-attention relationship value and a fourth self-attention relationship value which are output by the second model based on the semantic representation vector.

It should be noted that, the electronic device may obtain, through the first model and the second model, the semantic representation vector of each word in the N training sentences after the N training sentences are input into the first model and the second model, respectively. For example, the first model obtains the semantic representation vector of each word in the N training sentences, and the second model also obtains the semantic representation vector of each word in the N training sentences.

Optionally, after the N training sentences are input into the first model and the second model respectively, the first model and the second model may also be obtained by filling (padding) the input N training sentences to a fixed length. Wherein the fixed length may be a preset length. It will be appreciated that of the N training sentences, different sentences may be of different lengths, e.g. contain different numbers of words. The first model and the second model may perform padding (padding) processing on the input N training sentences to ensure that each training sentence is padded to a fixed length, thereby facilitating the processing of the training sentences by the first model and the second model.

In the embodiment of the disclosure, after a semantic representation vector of each word in the N training sentences is obtained, the first model calculates, based on the semantic representation vector, a first self-attention relationship value calculated pairwise for all the words in each training sentence, and a second self-attention relationship value between words in different training sentences, where the first self-attention relationship value and the second self-attention relationship value are vectors. Meanwhile, the second model may also calculate a third self-attention relationship value and a fourth self-attention relationship value based on the semantic representation vector, and the third self-attention relationship value and the fourth self-attention relationship value may also be vectors.

In the embodiment of the disclosure, by obtaining the semantic expression vectors of the words in the N training sentences, the first model and the second model can perform processing and model operation based on the semantic expression vectors, so that the training sentences can be more conveniently processed by the first model and the second model.

Optionally, the first model and the second model are the same type of model, for example, both are depth self-attention transformation network models (transformers). In this embodiment of the disclosure, before the N training sentences are input into the first model and the second model respectively, the method may further include:

determining a first model layer number of a second model based on the first model, wherein the first model layer number is smaller than the model layer number of the first model;

and randomly initializing the model parameters of the second model containing the first model layer number.

It should be noted that the number of model layers of the second model is smaller than that of the first model, that is, the model volume of the second model is smaller than that of the first model. Alternatively, the number of model layers of the first model may be known, and in order to implement model compression, the number of model layers of the second model may be determined based on the number of model layers of the first model, that is, it is only necessary that the number of model layers of the second model is smaller than the number of model layers of the first model.

In the embodiment of the disclosure, based on the number of model layers of the first model, the number of model layers of the second model is determined as the number of model layers of the first model, that is, the model mass of the second model is determined; further, model parameters of the second model including the first model layer number are initialized randomly, so that the second model after random initialization can better realize the processing of the input N training sentences, and the output of the second model is ensured to be close to the output of the first model as much as possible.

Optionally, after the step S104, the method further includes:

and optimizing a second model comprising the first model layer number based on the training result of the second model so as to adjust the first model layer number to the second model layer number.

In the embodiment of the present disclosure, after the second model is trained based on the first similarity and the second similarity, the output of the trained second model is obtained, and the difference between the output and the output of the first model is determined. If the difference is small, it is indicated that the output of the trained second model is closer to the output of the first model, the number of model layers of the second model can be further reduced, that is, the number of the adjusted second model layers is smaller than the number of the first model layers, so that the model volume of the second model can be further reduced, and the compression of the second model can be better realized. If the difference is large, which indicates that the output of the trained second model is greatly different from the output of the first model, and the model output accuracy may be reduced due to the fact that the number of model layers of the second model is small, the number of model layers of the second model may be adjusted to be the number of second model layers, and the number of second model layers is larger than the number of first model layers, and the adjusted second model is trained based on the model training method, so that the output of the second model adjusted to the number of second model layers can be ensured to be close to the model output of the first model as much as possible.

The second model may be trained a plurality of times based on the model training method until the output of the second model is as close as possible to the output of the first model, thereby ensuring the output accuracy of the second model.

In the embodiment of the disclosure, the second model including the number of layers of the first model can be optimized based on the training result of the second model, so as to ensure that the optimized second model has better output accuracy.

Optionally, the training the second model based on the first similarity and the second similarity includes:

judging the first similarity and the second similarity based on relative entropy so as to train the second model;

and the similarity between a third self-attention relationship value output by the trained second model and the first self-attention relationship value is greater than a first preset value, and the similarity between a fourth self-attention relationship value output by the trained second model and the second self-attention relationship value is greater than a second preset value.

The relative entropy may also be referred to as Kullback-Leibler (K-L divergence), or may also be referred to as information divergence.

In the embodiment of the disclosure, the second model may be trained based on the K-L divergence as a criterion for determining the first similarity and the second similarity, for example, it is determined whether the first similarity and the second similarity reach a preset value based on the K-L divergence, if the first similarity and the second similarity reach the preset value, the training of the second model may be stopped, and if the first similarity and the second similarity do not reach the preset value, the training is continued to optimize the second model, so as to ensure that the second model has better output accuracy.

The similarity between the third self-attention relationship value output by the trained second model or the trained second model and the first self-attention relationship value output by the first model is larger than a first preset value, and the similarity between the fourth self-attention relationship value output by the trained second model and the second self-attention relationship value output by the first model is larger than a second preset value. Therefore, the output accuracy of the trained second model can be ensured to be higher, and then the knowledge transfer from the first model to the second model can be completed under the condition of ensuring the output accuracy of the second model, so that the popularization and the application of the second model are facilitated.

Referring to fig. 2, fig. 2 is a structural diagram of a model training device according to an embodiment of the disclosure, and as shown in fig. 2, the model training device 200 includes:

an input module 201, configured to obtain N training sentences, and input the N training sentences into a first model and a second model respectively, where N is an integer greater than 1;

a first obtaining module 202, configured to obtain a first self-attention relationship value and a second self-attention relationship value output by the first model, and a third self-attention relationship value and a fourth self-attention relationship value output by the second model;

a second obtaining module 203, configured to obtain a first similarity between the third self-attention relationship value and the first self-attention relationship value, and a second similarity between the fourth self-attention relationship value and the second self-attention relationship value;

a training module 204, configured to train the second model based on the first similarity and the second similarity;

Optionally, the first obtaining module 202 is further configured to:

obtaining semantic expression vectors of the words in the N training sentences;

Optionally, the first model and the second model are the same type of model; the device further comprises:

the determining module is used for determining the number of first model layers of a second model based on the first model, wherein the number of the first model layers is smaller than that of the first model layers;

and the initialization module is used for carrying out random initialization on the model parameters of the second model containing the first model layer number.

Optionally, the apparatus further comprises:

and the optimization module is used for optimizing the second model containing the first model layer number based on the training result of the second model so as to adjust the first model layer number to the second model layer number.

Optionally, the training module 204 is further configured to:

The model training device 200 provided by the embodiment of the present disclosure not only focuses on the self-attention relationship value between words in sentences in a training sentence, but also focuses on the self-attention relationship value between words between sentences, thereby effectively enriching the model output, improving the training precision of the model, and ensuring that the output precision of the second model is higher.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 3 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the various methods and processes described above, such as the model training method. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more steps of the model training method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the model training method described above in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model training method, comprising:

2. The method of claim 1, wherein said obtaining a first and second self-attention relationship value of the first model output and a third and fourth self-attention relationship value of the second model output comprises:

obtaining semantic expression vectors of words in the N training sentences;

3. The method of claim 1, wherein the first model and the second model are the same type of model;

before the inputting the N training sentences into the first model and the second model, respectively, the method further includes:

4. The method of claim 3, wherein after the training of the second model based on the first and second similarities, the method further comprises:

5. The method of any of claims 1-4, wherein the training the second model based on the first and second similarities comprises:

6. A model training apparatus comprising:

7. The apparatus of claim 6, wherein the first obtaining means is further configured to:

obtaining semantic expression vectors of the words in the N training sentences;

8. The apparatus of claim 6, wherein the first model and the second model are a same type of model; the device further comprises:

9. The apparatus of claim 6, wherein the apparatus further comprises:

10. The apparatus of any of claims 6-9, wherein the training module is further to:

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.