Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In brief, the sequence recommendation method for knowledge distillation based on EMD distance provided by the invention comprises the following steps: training a teacher model for guiding the learning process of the student model; guiding the student models to train and learn by using the teacher model, so that the knowledge learned by the teacher model is distilled to the student models; subsequent sequence recommendation applications are deployed using the student model. Hereinafter, a model structure of NextItNet will be described as an example, in which the teacher model and the student model include an Embedding input layer (Embedding layer), intermediate hidden layers stacked by hole convolution residual blocks, and a classification output layer, and the teacher model contains more intermediate hidden layers relative to the student model.
Specifically, referring to fig. 2, the sequence recommendation method for knowledge distillation based on EMD distance provided by this embodiment includes the following steps.
And step S210, training the teacher model by taking the set loss function as a target to obtain a pre-training teacher model.
Firstly, a teacher model with good effect needs to be trained in advance so as to effectively guide the training of the subsequent student models.
Taking a sequence recommendation model NextItNet as an example, the constructed teacher model generally consists of an Embedding input layer, a middle hidden layer stacked by N hollow convolution residual blocks with the same structure and a final classification Softmax output layer, wherein two hollow convolution layers are packaged into one hollow convolution residual block. In order to enable the teacher model to have stronger feature characterization capability, more hole convolution residual blocks need to be stacked. And inputting the user history browsing sequence in the data set into the whole network in the pre-training process, modeling, and obtaining the user preference representation, thereby accurately recommending the user at the next moment.
For example, in the training process of the teacher model, the input is a historical browsing sequence of the user, the output is an item (item) recommended to the user at the next moment, and the loss function is the cross entropy between the correct item and the predicted item. Total loss LpretrainThe calculation is as follows:
wherein
To correct item tag, y
iTo predict the item tag, T is the total number of training samples.
And S220, fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model so as to distill the knowledge of the pre-training teacher model to the student model.
After the pre-training stage of step S210 is completed, although the teacher model can achieve a higher recommendation accuracy, because the model is large in scale, the parameter quantity is large, the inference time is long, and it is difficult to meet the requirements in the real world, it is necessary to distill knowledge of the teacher model, and distill the knowledge learned by the teacher model into the student model well, so as to reduce the model scale, accelerate inference, and better complete the recommendation service without reducing the model accuracy.
The student model also adopts a NextItNet model structure, the model is generally composed of an Embedding input layer, a middle hidden layer formed by stacking M cavity convolution residual blocks with the same structure and a final Softmax output layer, wherein two cavity convolution layers are packaged into one cavity convolution residual block, and the difference from the teacher model is that the number M of the cavity convolution residual blocks which need to be stacked in the student model is far smaller than the number N of the cavity convolution residual blocks which need to be stacked in the teacher model.
In step S220, the pre-training teacher model is used, the student models are added for collaborative training, and the knowledge distillation teacher model guides the training and learning of the student models.
In one embodiment, in order to distill the knowledge of the teacher model into the student model well, the process of knowledge distillation includes distillation of three parts, an Embedding input layer, an intermediate hiding layer, and a Softmax output layer.
For example, referring to fig. 3, it generally includes two parts of a pre-trained teacher model T and a student model S, parameters of the teacher model T are fixed during training, and the student model S is added for collaborative training, and only parameters of the student model S are optimized. The overall knowledge distillation process comprises the following steps:
step S221, distilling the Embedding input layer
The Embedding input layer can well change the user historical browsing sequence X to (X)1,x2,...,xn) Into an Embedding matrix E ═ E1,e2,...,en]Meaning of a user's historical browsing sequence that can be understood by a computer, each column e of the matrixiThe Embedding vector representing the corresponding item.
The purpose of Embedding input layer distillation is to complete teacher model Embedding matrix ETFor student model Embedding matrix ESKnowledge of (1) distillation.
In one embodiment, the distillation process of the Embedding input layer is optimized by minimizing the Mean Square Error (MSE), at the loss of LembThe calculation is as follows:
Lemb=MSE(ESWe,ET) (2)
where MSE (-) represents the mean square error calculation, WeIs a learnable linear mapping matrix.
Step S222, distilling the intermediate hiding layer
Knowledge distillation of the middle hidden layer can enable the student model to better learn the behavior of the teacher model, but because the number N of the cavity convolution residual blocks stacked by the teacher model is inconsistent with the number M of the cavity convolution residual blocks stacked by the student model, how to complete many-to-many mapping between different layers becomes a key problem.
In one embodiment, the method and the device propose to measure the difference between a teacher model and a student model better by using an EMD (Earth Mover's Distance) Distance, thereby completing many-to-many mapping between hidden layers in a self-adaptive manner and avoiding information loss and information misleading caused by artificially specifying a hierarchical mapping relationship.
In particular, by
Representing the output between different layers of the hidden layer in the middle of the teacher model,
representing the output of different layers of the middle hidden layer of the student model, wherein N and M respectively represent the number of cavity convolution residual blocks stacked by the teacher model and the student model,
an output vector representing the jth intermediate hidden layer of the teacher model with a weight coefficient of
Indicate that initialization is
And optimizes the weighting coefficients during the training process. In the same way as above, the first and second,
an output vector representing the ith intermediate hidden layer of the student model with a weight coefficient of
Indicate that initialization is
And can be optimized during the training process. Defining a land mobile distance matrix or ground distance matrix
Wherein
Output vector representing jth intermediate hidden layer from teacher model
Output vector transferred to ith intermediate hidden layer of student model
The distance of (c).
In one embodiment, the distance is calculated using Mean Square Error (MSE)
Expressed as:
wherein, WhIs a learnable linear mapping matrix.
In addition, a mapping transfer matrix is defined
Wherein
Output vector representing jth intermediate hidden layer from teacher model
Output vector transferred to ith intermediate hidden layer of student model
The amount of map transfer. Thus, the overall transition loss between the intermediate hidden layers of the teacher model and the student model can be calculated as:
corresponding to the following constraints:
the optimization problem can be solved by a linear programming method to obtain an optimal mapping transfer matrix
Then, the EMD (Earth Mover's Distance) Distance can be defined as:
finally, the output matrix H of the middle hidden layer of the teacher model can be optimizedTOutput matrix H of middle hidden layer of student modelSThe knowledge distillation of the middle hidden layer is carried out according to the EMD distance between the two layers, and the knowledge distillation loss L of the middle hidden layerhiddenExpressed as:
Lhidden=EMD(HS,HT) (6)
in conclusion, by designing the mapping transfer matrix and measuring the difference between the teacher model and the student model by using the EMD distance, the many-to-many mapping relation between the middle hidden layer of the teacher model and the middle hidden layer of the student model can be adaptively completed, information loss and information misleading caused by manually specifying the hierarchical mapping relation are avoided, and the loss minimization of the knowledge distillation process is ensured by combining the designed mapping transfer matrix with the EMD distance, namely the precision of the student model is ensured.
Step S223, distilling the Softmax output layer
And after the final cavity convolution residual block passes through the model, inputting the output vector z into a Softmax classification layer to obtain final item prediction probability distribution and finish an item prediction task.
The purpose of Softmax output layer distillation is to approximate the final item prediction probability distribution of the student model to the final item prediction probability distribution of the teacher model, so as to learn the prediction behavior of the teacher model.
In one embodiment, the distillation process of the Softmax output layer is optimized by minimizing the cross entropy between the final item prediction probability distribution of the student model and the final item prediction probability distribution of the teacher model, the distillation loss LpredExpressed as:
Lpred=-softmax(zT)·log(softmax(zS/t)) (7)
wherein z isTOutput vector, z, representing the last hole convolution residual block of the teacher modelSAnd t is used for controlling the variation range of the output vector of the student model. In practical application, the convergence rate and the prediction accuracy of the training student model can be controlled by setting a proper t value.
Through knowledge distillation of the Embedding input layer, the middle hidden layer and the Softmax output layer, knowledge learned by a teacher model can be well distilled into student models, and the total loss L of the whole knowledge distillation processdistillComprises the following steps:
Ldistill=Lemb+Lhidden+Lpred (8)
the whole knowledge distillation model is trained by utilizing the training data until the model converges, and a student model with small model scale, small parameter quantity and good effect can be obtained for subsequent deployment and use.
Step S230, using the trained student model, and taking the historical browsing sequence of the given user as an input, providing a sequence recommendation service for the user.
In practical application, the sequence recommendation is equivalent to a test process of a model, and the final trained student model is used for recommendation service. When the historical browsing sequence of the user is given, item which is most likely to be interested in the user at the next moment is found through the trained student model, and the quick and accurate recommendation service is provided for the user.
The student model obtained according to the invention can achieve small model scale, small parameter quantity and short deduction time when the deployment sequence recommendation system is actually applied, can ensure higher model accuracy, can better meet the user requirements, and has very important practical significance and wide application prospect. For example, the invention can recommend the items which may be interested according to the attributes of the user (such as gender, age, academic calendar, region, occupation) and the past behaviors of the user in the system (such as browsing, clicking, searching, purchasing, collecting and the like).
Further, in order to verify the effectiveness and the advancement of the method provided by the invention, extensive experiments are carried out on the public data set MovieLens in the field of sequence recommendation systems. Experimental results show that the sequence recommendation method for knowledge distillation based on the EMD distance achieves the best current effect on model parameter quantity, inference time and model performance, can provide quick and accurate recommendation service for users, is very suitable for being deployed and applied to a sequence recommendation system, and has very important practical significance and wide application prospect.
In conclusion, the invention measures the difference between the teacher model and the student model by distilling the Embedding input layer, the middle hidden layer and the Softmax output layer and utilizing the EMD distance in the distillation of the middle hidden layer, thereby completing the many-to-many mapping between the hidden layers in a self-adaptive manner and avoiding the information loss and the information misleading caused by the artificial designated hierarchical mapping relationship.
It is to be understood that those skilled in the art can appropriately change or modify the above-described embodiments without departing from the spirit and scope of the present invention. For example, distillation may be performed only on the intermediate hidden layer, or knowledge distillation may be performed only on the intermediate hidden layer and the Softmax output layer. For example, for other sequence recommendation models than the NextItNet model, the method for knowledge distillation based on EMD distance proposed by the present invention is also applicable. In addition, the loss involved in the training process can also be measured as a squared loss or an exponential loss, which is not limited by the present invention.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.
While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.