CN112507209B

CN112507209B - Sequence recommendation method for knowledge distillation based on land moving distance

Info

Publication number: CN112507209B
Application number: CN202011245696.6A
Authority: CN
Inventors: 陈磊; 杨敏; 原发杰; 李成明; 姜青山
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Zhuhai Institute Of Advanced Technology Chinese Academy Of Sciences Co ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2022-07-05
Anticipated expiration: 2040-11-10
Also published as: CN112507209A

Abstract

The invention discloses a sequence recommendation method for knowledge distillation based on land mobile distance, which comprises the following steps: constructing a teacher model and a corresponding student model; training a teacher model by taking the set loss function as a target to obtain a pre-training teacher model; fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model to distill knowledge of the pre-training teacher model into the student model, wherein in the knowledge distillation process, a many-to-many mapping relation between a middle hidden layer of the pre-training teacher model and a middle hidden layer of the student model is adaptively learned by using a land movement distance; and using the trained student model and taking the historical browsing sequence of the user as input to provide a sequence recommendation service for the user. The student model obtained by the invention can obviously reduce the parameter scale without losing precision, thereby providing quick and accurate recommendation service for users.

Description

Sequence recommendation method for knowledge distillation based on land moving distance

Technical Field

The invention relates to the technical field of sequence recommendation, in particular to a sequence recommendation method for knowledge distillation based on land mobile distance.

Background

The recommendation system is a very prosperous field in recent years, attracts attention due to its wide application scenario and huge commercial value, and is defined as providing commodity information and suggestions to customers by using an e-commerce website, helping the customers decide what products should be purchased, simulating sales staff to help the customers to complete a purchase process, and personalized recommendation is to recommend interesting information and commodities to the users according to the interest characteristics and purchase behaviors of the users. The sequence recommendation system is an important branch in the recommendation system, and aims to accurately recommend a user by analyzing a historical browsing sequence of the user, and is always a hot research problem concerned by academia and industry.

At present, a sequence recommendation system realizes rapid development by means of a deep learning technology, and strong feature extraction and feature modeling capabilities of a deep neural network are utilized to fully model user preference representation, so that accurate recommendation service is provided for users.

Taking a commonly-used sequence recommendation model NextItNet as an example, the method combines a hole convolutional neural network and a residual error network, and can better model a user history browsing sequence, thereby better providing recommendation service for the user and playing an excellent effect in a sequence recommendation system.

The model structure of NextItNet is shown in figure 1, and is generally formed by stacking a plurality of hole convolution residual blocks with the same structure, a user history browsing sequence is input into the whole network for modeling, a user preference representation is obtained after the last hole convolution residual block passes through, and finally an item (item) recommended to a user at the next moment is predicted through a Softmax classifier.

The output of the hollow convolution residual block in NextItNet is represented as:

X_l+1＝X_l+F(X_l)

i.e. the output X of each hole convolution residual block_l+1Is input X_lResult F (X) after addition of the present residual block processing_l)。F(X_l) The processing procedure includes sequentially inputting a hole convolution Layer 1 (scaled Conv1), a Layer normalization Layer 1(Layer Norm1), a ReLU active Layer 1(ReLU1), a hole convolution Layer 2 (scaled Conv2), a Layer normalization Layer 2(Layer Norm2), and a ReLU active Layer 2(ReL Norm2)U2) and then output after processing.

The defects of the prior art are that when recommendation service is carried out, the number of model parameters is large, the inference time is long, and the requirements in the real world are difficult to meet. NextItNet can exert better effect only by stacking a large number of hollow convolution residual blocks, so that the model parameter quantity is huge, and output prediction can be completed only by a complete model aiming at each input user history browsing sequence, so that the trained model is difficult to deploy in practical application, the calculation cost is high, the time spent in inference is long, and the actual requirements of users are difficult to meet.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a sequence recommendation method for knowledge distillation based on the land mobile Distance (EMD), which reduces the scale of model parameters without losing the precision of the model by distilling the knowledge of a teacher model to a student model, thereby accelerating the sequence inference process.

According to the sequence recommendation method for knowledge distillation based on land moving distance. The method comprises the following steps:

constructing a teacher model and a corresponding student model, wherein the teacher model and the student model comprise an Embedding input layer, a middle hidden layer stacked by cavity convolution residual blocks and a classification output layer, and the teacher model comprises more middle hidden layers relative to the student model;

training a teacher model by using the training data and taking the set loss function as a target to obtain a pre-training teacher model;

fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model to distill knowledge of the pre-training teacher model into the student model, wherein in the knowledge distillation process, a many-to-many mapping relation between a middle hidden layer of the pre-training teacher model and a middle hidden layer of the student model is adaptively learned by utilizing land movement distance;

and using the trained student model and taking the historical browsing sequence of the user as input to provide a sequence recommendation service for the user.

Compared with the prior art, the sequence recommendation knowledge distillation method based on the EMD distance has the advantages that the knowledge of a large model (teacher model) is well distilled into a small model (student model), the difference between the teacher model and the student model is well measured by using the EMD distance in the distillation process of the middle hidden layer, the many-to-many mapping between the hidden layers is completed in a self-adaptive manner, the information loss and the information misleading caused by the artificial designated hierarchical mapping relation are avoided, the model parameter scale can be remarkably reduced, the model precision is not lost, the inference is accelerated, the rapid and accurate recommendation service is provided for users in the practical application, and the method has very important practical significance and wide application prospect.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of a NextItNet model architecture according to the prior art;

FIG. 2 is a flow diagram of a sequence recommendation method for knowledge distillation based on EMD distance according to one embodiment of the present invention;

fig. 3 is a schematic diagram of a knowledge distillation process for a teacher model and a student model according to one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In brief, the sequence recommendation method for knowledge distillation based on EMD distance provided by the invention comprises the following steps: training a teacher model for guiding the learning process of the student model; guiding the student models to train and learn by using the teacher model, so that the knowledge learned by the teacher model is distilled to the student models; subsequent sequence recommendation applications are deployed using the student model. Hereinafter, a model structure of NextItNet will be described as an example, in which the teacher model and the student model include an Embedding input layer (Embedding layer), intermediate hidden layers stacked by hole convolution residual blocks, and a classification output layer, and the teacher model contains more intermediate hidden layers relative to the student model.

Specifically, referring to fig. 2, the sequence recommendation method for knowledge distillation based on EMD distance provided by this embodiment includes the following steps.

And step S210, training the teacher model by taking the set loss function as a target to obtain a pre-training teacher model.

Firstly, a teacher model with good effect needs to be trained in advance so as to effectively guide the training of the subsequent student models.

Taking a sequence recommendation model NextItNet as an example, the constructed teacher model generally consists of an Embedding input layer, a middle hidden layer stacked by N hollow convolution residual blocks with the same structure and a final classification Softmax output layer, wherein two hollow convolution layers are packaged into one hollow convolution residual block. In order to enable the teacher model to have stronger feature characterization capability, more hole convolution residual blocks need to be stacked. And inputting the user history browsing sequence in the data set into the whole network in the pre-training process, modeling, and obtaining the user preference representation, thereby accurately recommending the user at the next moment.

For example, in the training process of the teacher model, the input is a historical browsing sequence of the user, the output is an item (item) recommended to the user at the next moment, and the loss function is the cross entropy between the correct item and the predicted item. Total loss L_pretrainThe calculation is as follows:

wherein

To correct item tag, y_iTo predict the item tag, T is the total number of training samples.

And S220, fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model so as to distill the knowledge of the pre-training teacher model to the student model.

After the pre-training stage of step S210 is completed, although the teacher model can achieve a higher recommendation accuracy, because the model is large in scale, the parameter quantity is large, the inference time is long, and it is difficult to meet the requirements in the real world, it is necessary to distill knowledge of the teacher model, and distill the knowledge learned by the teacher model into the student model well, so as to reduce the model scale, accelerate inference, and better complete the recommendation service without reducing the model accuracy.

The student model also adopts a NextItNet model structure, the model is generally composed of an Embedding input layer, a middle hidden layer formed by stacking M cavity convolution residual blocks with the same structure and a final Softmax output layer, wherein two cavity convolution layers are packaged into one cavity convolution residual block, and the difference from the teacher model is that the number M of the cavity convolution residual blocks which need to be stacked in the student model is far smaller than the number N of the cavity convolution residual blocks which need to be stacked in the teacher model.

In step S220, the pre-training teacher model is used, the student models are added for collaborative training, and the knowledge distillation teacher model guides the training and learning of the student models.

In one embodiment, in order to distill the knowledge of the teacher model into the student model well, the process of knowledge distillation includes distillation of three parts, an Embedding input layer, an intermediate hiding layer, and a Softmax output layer.

For example, referring to fig. 3, it generally includes two parts of a pre-trained teacher model T and a student model S, parameters of the teacher model T are fixed during training, and the student model S is added for collaborative training, and only parameters of the student model S are optimized. The overall knowledge distillation process comprises the following steps:

step S221, distilling the Embedding input layer

The Embedding input layer can well change the user historical browsing sequence X to (X)₁,x₂,...,x_n) Into an Embedding matrix E ═ E₁,e₂,...,e_n]Meaning of a user's historical browsing sequence that can be understood by a computer, each column e of the matrix_iThe Embedding vector representing the corresponding item.

The purpose of Embedding input layer distillation is to complete teacher model Embedding matrix E^TFor student model Embedding matrix E^SKnowledge of (1) distillation.

In one embodiment, the distillation process of the Embedding input layer is optimized by minimizing the Mean Square Error (MSE), at the loss of L_embThe calculation is as follows:

L_emb＝MSE(E^SW_e,E^T) (2)

where MSE (-) represents the mean square error calculation, W_eIs a learnable linear mapping matrix.

Step S222, distilling the intermediate hiding layer

Knowledge distillation of the middle hidden layer can enable the student model to better learn the behavior of the teacher model, but because the number N of the cavity convolution residual blocks stacked by the teacher model is inconsistent with the number M of the cavity convolution residual blocks stacked by the student model, how to complete many-to-many mapping between different layers becomes a key problem.

In one embodiment, the method and the device propose to measure the difference between a teacher model and a student model better by using an EMD (Earth Mover's Distance) Distance, thereby completing many-to-many mapping between hidden layers in a self-adaptive manner and avoiding information loss and information misleading caused by artificially specifying a hierarchical mapping relationship.

In particular, by

Representing the output between different layers of the hidden layer in the middle of the teacher model,

representing the output of different layers of the middle hidden layer of the student model, wherein N and M respectively represent the number of cavity convolution residual blocks stacked by the teacher model and the student model,

an output vector representing the jth intermediate hidden layer of the teacher model with a weight coefficient of

Indicate that initialization is

And optimizes the weighting coefficients during the training process. In the same way as above, the first and second,

an output vector representing the ith intermediate hidden layer of the student model with a weight coefficient of

Indicate that initialization is

And can be optimized during the training process. Defining a land mobile distance matrix or ground distance matrix

Wherein

Output vector representing jth intermediate hidden layer from teacher model

Output vector transferred to ith intermediate hidden layer of student model

The distance of (c).

In one embodiment, the distance is calculated using Mean Square Error (MSE)

Expressed as:

wherein, W_hIs a learnable linear mapping matrix.

In addition, a mapping transfer matrix is defined

Wherein

Output vector representing jth intermediate hidden layer from teacher model

Output vector transferred to ith intermediate hidden layer of student model

The amount of map transfer. Thus, the overall transition loss between the intermediate hidden layers of the teacher model and the student model can be calculated as:

corresponding to the following constraints:

the optimization problem can be solved by a linear programming method to obtain an optimal mapping transfer matrix

Then, the EMD (Earth Mover's Distance) Distance can be defined as:

finally, the output matrix H of the middle hidden layer of the teacher model can be optimized^TOutput matrix H of middle hidden layer of student model^SThe knowledge distillation of the middle hidden layer is carried out according to the EMD distance between the two layers, and the knowledge distillation loss L of the middle hidden layer_hiddenExpressed as:

L_hidden＝EMD(H^S，H^T) (6)

in conclusion, by designing the mapping transfer matrix and measuring the difference between the teacher model and the student model by using the EMD distance, the many-to-many mapping relation between the middle hidden layer of the teacher model and the middle hidden layer of the student model can be adaptively completed, information loss and information misleading caused by manually specifying the hierarchical mapping relation are avoided, and the loss minimization of the knowledge distillation process is ensured by combining the designed mapping transfer matrix with the EMD distance, namely the precision of the student model is ensured.

Step S223, distilling the Softmax output layer

And after the final cavity convolution residual block passes through the model, inputting the output vector z into a Softmax classification layer to obtain final item prediction probability distribution and finish an item prediction task.

The purpose of Softmax output layer distillation is to approximate the final item prediction probability distribution of the student model to the final item prediction probability distribution of the teacher model, so as to learn the prediction behavior of the teacher model.

In one embodiment, the distillation process of the Softmax output layer is optimized by minimizing the cross entropy between the final item prediction probability distribution of the student model and the final item prediction probability distribution of the teacher model, the distillation loss L_predExpressed as:

L_pred＝-softmax(z^T)·log(softmax(z^S/t)) (7)

wherein z is^TOutput vector, z, representing the last hole convolution residual block of the teacher model^SAnd t is used for controlling the variation range of the output vector of the student model. In practical application, the convergence rate and the prediction accuracy of the training student model can be controlled by setting a proper t value.

Through knowledge distillation of the Embedding input layer, the middle hidden layer and the Softmax output layer, knowledge learned by a teacher model can be well distilled into student models, and the total loss L of the whole knowledge distillation process_distillComprises the following steps:

L_distill＝L_emb+L_hidden+L_pred (8)

the whole knowledge distillation model is trained by utilizing the training data until the model converges, and a student model with small model scale, small parameter quantity and good effect can be obtained for subsequent deployment and use.

Step S230, using the trained student model, and taking the historical browsing sequence of the given user as an input, providing a sequence recommendation service for the user.

In practical application, the sequence recommendation is equivalent to a test process of a model, and the final trained student model is used for recommendation service. When the historical browsing sequence of the user is given, item which is most likely to be interested in the user at the next moment is found through the trained student model, and the quick and accurate recommendation service is provided for the user.

The student model obtained according to the invention can achieve small model scale, small parameter quantity and short deduction time when the deployment sequence recommendation system is actually applied, can ensure higher model accuracy, can better meet the user requirements, and has very important practical significance and wide application prospect. For example, the invention can recommend the items which may be interested according to the attributes of the user (such as gender, age, academic calendar, region, occupation) and the past behaviors of the user in the system (such as browsing, clicking, searching, purchasing, collecting and the like).

Further, in order to verify the effectiveness and the advancement of the method provided by the invention, extensive experiments are carried out on the public data set MovieLens in the field of sequence recommendation systems. Experimental results show that the sequence recommendation method for knowledge distillation based on the EMD distance achieves the best current effect on model parameter quantity, inference time and model performance, can provide quick and accurate recommendation service for users, is very suitable for being deployed and applied to a sequence recommendation system, and has very important practical significance and wide application prospect.

In conclusion, the invention measures the difference between the teacher model and the student model by distilling the Embedding input layer, the middle hidden layer and the Softmax output layer and utilizing the EMD distance in the distillation of the middle hidden layer, thereby completing the many-to-many mapping between the hidden layers in a self-adaptive manner and avoiding the information loss and the information misleading caused by the artificial designated hierarchical mapping relationship.

It is to be understood that those skilled in the art can appropriately change or modify the above-described embodiments without departing from the spirit and scope of the present invention. For example, distillation may be performed only on the intermediate hidden layer, or knowledge distillation may be performed only on the intermediate hidden layer and the Softmax output layer. For example, for other sequence recommendation models than the NextItNet model, the method for knowledge distillation based on EMD distance proposed by the present invention is also applicable. In addition, the loss involved in the training process can also be measured as a squared loss or an exponential loss, which is not limited by the present invention.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A sequence recommendation method for knowledge distillation based on land movement distance comprises the following steps:

training a teacher model by taking the set loss function as a target to obtain a pre-training teacher model;

providing a sequence recommendation service for a user by using a trained student model and taking a user historical browsing sequence as an input;

wherein the knowledge distillation process of the intermediate hidden layer comprises:

representing the output between different layers of the middle hidden layer of the teacher model as

Representing the output between different layers of the middle hidden layer of the student model as

Where N and M represent the number of hole convolution residual blocks stacked by the teacher model and the student model respectively,

an output vector representing the jth intermediate hidden layer of the teacher model,

which represents the corresponding weight coefficient(s),

an output vector representing the ith intermediate hidden layer of the student model,

representing the corresponding weight coefficients;

defining a mapping transfer matrix

Wherein

Output vector representing jth intermediate hidden layer from teacher model

Output vector transferred to ith intermediate hidden layer of student model

The mapping transfer amount of (2);

obtaining an optimal mapping transition matrix by solving for overall transition losses between intermediate hidden layers of a teacher model and a student model

Wherein the calculation of the overall transfer loss is represented as:

corresponding to the following constraints:

in obtaining the optimal mapping transfer matrix

Then, define the land movement distance as:

wherein,

output vector representing jth intermediate hidden layer from teacher model

Output vector transferred to ith intermediate hidden layer of student model

The distance of (d);

by optimizing the teacher model intermediate hidden layer output matrix H^TOutput matrix H of middle hidden layer of student model^SThe land movement distance between the teacher model and the student model is obtained, and the land movement distance between the teacher model and the student model is obtained through the land movement distance between the teacher model and the student model.

2. The method of claim 1, wherein the Embedding input layer knowledge distillation process comprises:

setting the user history browsing sequence X as (X)₁,x₂,...,x_n) Expressed as the Embedding matrix E ═ E₁,e₂,...,e_n]Each column e in the matrix_iAn Embedding vector representing the corresponding item;

representing the Embedding input layer of the student model as an Embedding matrix E^S；

In the process of cooperative training, a teacher model Embedding matrix E is learned by taking the distillation loss of a set Embedding input layer as a target^TAnd student model Embedding matrix E^SA linear mapping matrix between.

3. The method of claim 2, wherein the distillation loss of the Embedding input layer is represented by minimizing the mean square error:

L_emb＝MSE(E^SW_e,E^T)

MSE (-) represents the mean square error calculation, W_eIs a learnable linear mapping matrix.

4. The method of claim 1, wherein the land mobile distance is calculated using mean square error

Expressed as:

wherein, W_hIs a learnable linear mapping matrix.

5. The method of claim 1, wherein the classification output layer is a Softmax output layer in which the distillation process is controlled by minimizing cross entropy between the recommendation prediction probability distributions of the student model and the teacher model, the distillation loss L being_predExpressed as:

L_pred＝-softmax(z^T)·log(softmax(z^S/t))

wherein z is^TOutput vector, z, representing the final hole convolution residual block of the teacher model^SAnd t is used for controlling the variation range of the output vector of the student model.

6. The method of claim 1, wherein during pre-training of the teacher model, the input is a historical browsing sequence of the user, the output is an item recommended to the user at a next time, and the loss function is a cross-entropy between correct recommended items and predicted recommended items, expressed as:

wherein

To correctly recommend item tags, y_iTo predict the recommended item label, T is the total number of training samples.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

8. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when executing the program.