CN112507209B - Sequence recommendation method for knowledge distillation based on land moving distance - Google Patents

Sequence recommendation method for knowledge distillation based on land moving distance Download PDF

Info

Publication number
CN112507209B
CN112507209B CN202011245696.6A CN202011245696A CN112507209B CN 112507209 B CN112507209 B CN 112507209B CN 202011245696 A CN202011245696 A CN 202011245696A CN 112507209 B CN112507209 B CN 112507209B
Authority
CN
China
Prior art keywords
model
student
student model
teacher model
teacher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011245696.6A
Other languages
Chinese (zh)
Other versions
CN112507209A (en
Inventor
陈磊
杨敏
原发杰
李成明
姜青山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Institute Of Advanced Technology Chinese Academy Of Sciences Co ltd
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011245696.6A priority Critical patent/CN112507209B/en
Publication of CN112507209A publication Critical patent/CN112507209A/en
Application granted granted Critical
Publication of CN112507209B publication Critical patent/CN112507209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Abstract

The invention discloses a sequence recommendation method for knowledge distillation based on land mobile distance, which comprises the following steps: constructing a teacher model and a corresponding student model; training a teacher model by taking the set loss function as a target to obtain a pre-training teacher model; fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model to distill knowledge of the pre-training teacher model into the student model, wherein in the knowledge distillation process, a many-to-many mapping relation between a middle hidden layer of the pre-training teacher model and a middle hidden layer of the student model is adaptively learned by using a land movement distance; and using the trained student model and taking the historical browsing sequence of the user as input to provide a sequence recommendation service for the user. The student model obtained by the invention can obviously reduce the parameter scale without losing precision, thereby providing quick and accurate recommendation service for users.

Description

Sequence recommendation method for knowledge distillation based on land moving distance
Technical Field
The invention relates to the technical field of sequence recommendation, in particular to a sequence recommendation method for knowledge distillation based on land mobile distance.
Background
The recommendation system is a very prosperous field in recent years, attracts attention due to its wide application scenario and huge commercial value, and is defined as providing commodity information and suggestions to customers by using an e-commerce website, helping the customers decide what products should be purchased, simulating sales staff to help the customers to complete a purchase process, and personalized recommendation is to recommend interesting information and commodities to the users according to the interest characteristics and purchase behaviors of the users. The sequence recommendation system is an important branch in the recommendation system, and aims to accurately recommend a user by analyzing a historical browsing sequence of the user, and is always a hot research problem concerned by academia and industry.
At present, a sequence recommendation system realizes rapid development by means of a deep learning technology, and strong feature extraction and feature modeling capabilities of a deep neural network are utilized to fully model user preference representation, so that accurate recommendation service is provided for users.
Taking a commonly-used sequence recommendation model NextItNet as an example, the method combines a hole convolutional neural network and a residual error network, and can better model a user history browsing sequence, thereby better providing recommendation service for the user and playing an excellent effect in a sequence recommendation system.
The model structure of NextItNet is shown in figure 1, and is generally formed by stacking a plurality of hole convolution residual blocks with the same structure, a user history browsing sequence is input into the whole network for modeling, a user preference representation is obtained after the last hole convolution residual block passes through, and finally an item (item) recommended to a user at the next moment is predicted through a Softmax classifier.
The output of the hollow convolution residual block in NextItNet is represented as:
Xl+1=Xl+F(Xl)
i.e. the output X of each hole convolution residual blockl+1Is input XlResult F (X) after addition of the present residual block processingl)。F(Xl) The processing procedure includes sequentially inputting a hole convolution Layer 1 (scaled Conv1), a Layer normalization Layer 1(Layer Norm1), a ReLU active Layer 1(ReLU1), a hole convolution Layer 2 (scaled Conv2), a Layer normalization Layer 2(Layer Norm2), and a ReLU active Layer 2(ReL Norm2)U2) and then output after processing.
The defects of the prior art are that when recommendation service is carried out, the number of model parameters is large, the inference time is long, and the requirements in the real world are difficult to meet. NextItNet can exert better effect only by stacking a large number of hollow convolution residual blocks, so that the model parameter quantity is huge, and output prediction can be completed only by a complete model aiming at each input user history browsing sequence, so that the trained model is difficult to deploy in practical application, the calculation cost is high, the time spent in inference is long, and the actual requirements of users are difficult to meet.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a sequence recommendation method for knowledge distillation based on the land mobile Distance (EMD), which reduces the scale of model parameters without losing the precision of the model by distilling the knowledge of a teacher model to a student model, thereby accelerating the sequence inference process.
According to the sequence recommendation method for knowledge distillation based on land moving distance. The method comprises the following steps:
constructing a teacher model and a corresponding student model, wherein the teacher model and the student model comprise an Embedding input layer, a middle hidden layer stacked by cavity convolution residual blocks and a classification output layer, and the teacher model comprises more middle hidden layers relative to the student model;
training a teacher model by using the training data and taking the set loss function as a target to obtain a pre-training teacher model;
fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model to distill knowledge of the pre-training teacher model into the student model, wherein in the knowledge distillation process, a many-to-many mapping relation between a middle hidden layer of the pre-training teacher model and a middle hidden layer of the student model is adaptively learned by utilizing land movement distance;
and using the trained student model and taking the historical browsing sequence of the user as input to provide a sequence recommendation service for the user.
Compared with the prior art, the sequence recommendation knowledge distillation method based on the EMD distance has the advantages that the knowledge of a large model (teacher model) is well distilled into a small model (student model), the difference between the teacher model and the student model is well measured by using the EMD distance in the distillation process of the middle hidden layer, the many-to-many mapping between the hidden layers is completed in a self-adaptive manner, the information loss and the information misleading caused by the artificial designated hierarchical mapping relation are avoided, the model parameter scale can be remarkably reduced, the model precision is not lost, the inference is accelerated, the rapid and accurate recommendation service is provided for users in the practical application, and the method has very important practical significance and wide application prospect.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of a NextItNet model architecture according to the prior art;
FIG. 2 is a flow diagram of a sequence recommendation method for knowledge distillation based on EMD distance according to one embodiment of the present invention;
fig. 3 is a schematic diagram of a knowledge distillation process for a teacher model and a student model according to one embodiment of the invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In brief, the sequence recommendation method for knowledge distillation based on EMD distance provided by the invention comprises the following steps: training a teacher model for guiding the learning process of the student model; guiding the student models to train and learn by using the teacher model, so that the knowledge learned by the teacher model is distilled to the student models; subsequent sequence recommendation applications are deployed using the student model. Hereinafter, a model structure of NextItNet will be described as an example, in which the teacher model and the student model include an Embedding input layer (Embedding layer), intermediate hidden layers stacked by hole convolution residual blocks, and a classification output layer, and the teacher model contains more intermediate hidden layers relative to the student model.
Specifically, referring to fig. 2, the sequence recommendation method for knowledge distillation based on EMD distance provided by this embodiment includes the following steps.
And step S210, training the teacher model by taking the set loss function as a target to obtain a pre-training teacher model.
Firstly, a teacher model with good effect needs to be trained in advance so as to effectively guide the training of the subsequent student models.
Taking a sequence recommendation model NextItNet as an example, the constructed teacher model generally consists of an Embedding input layer, a middle hidden layer stacked by N hollow convolution residual blocks with the same structure and a final classification Softmax output layer, wherein two hollow convolution layers are packaged into one hollow convolution residual block. In order to enable the teacher model to have stronger feature characterization capability, more hole convolution residual blocks need to be stacked. And inputting the user history browsing sequence in the data set into the whole network in the pre-training process, modeling, and obtaining the user preference representation, thereby accurately recommending the user at the next moment.
For example, in the training process of the teacher model, the input is a historical browsing sequence of the user, the output is an item (item) recommended to the user at the next moment, and the loss function is the cross entropy between the correct item and the predicted item. Total loss LpretrainThe calculation is as follows:
Figure BDA0002769944560000041
wherein
Figure BDA0002769944560000042
To correct item tag, yiTo predict the item tag, T is the total number of training samples.
And S220, fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model so as to distill the knowledge of the pre-training teacher model to the student model.
After the pre-training stage of step S210 is completed, although the teacher model can achieve a higher recommendation accuracy, because the model is large in scale, the parameter quantity is large, the inference time is long, and it is difficult to meet the requirements in the real world, it is necessary to distill knowledge of the teacher model, and distill the knowledge learned by the teacher model into the student model well, so as to reduce the model scale, accelerate inference, and better complete the recommendation service without reducing the model accuracy.
The student model also adopts a NextItNet model structure, the model is generally composed of an Embedding input layer, a middle hidden layer formed by stacking M cavity convolution residual blocks with the same structure and a final Softmax output layer, wherein two cavity convolution layers are packaged into one cavity convolution residual block, and the difference from the teacher model is that the number M of the cavity convolution residual blocks which need to be stacked in the student model is far smaller than the number N of the cavity convolution residual blocks which need to be stacked in the teacher model.
In step S220, the pre-training teacher model is used, the student models are added for collaborative training, and the knowledge distillation teacher model guides the training and learning of the student models.
In one embodiment, in order to distill the knowledge of the teacher model into the student model well, the process of knowledge distillation includes distillation of three parts, an Embedding input layer, an intermediate hiding layer, and a Softmax output layer.
For example, referring to fig. 3, it generally includes two parts of a pre-trained teacher model T and a student model S, parameters of the teacher model T are fixed during training, and the student model S is added for collaborative training, and only parameters of the student model S are optimized. The overall knowledge distillation process comprises the following steps:
step S221, distilling the Embedding input layer
The Embedding input layer can well change the user historical browsing sequence X to (X)1,x2,...,xn) Into an Embedding matrix E ═ E1,e2,...,en]Meaning of a user's historical browsing sequence that can be understood by a computer, each column e of the matrixiThe Embedding vector representing the corresponding item.
The purpose of Embedding input layer distillation is to complete teacher model Embedding matrix ETFor student model Embedding matrix ESKnowledge of (1) distillation.
In one embodiment, the distillation process of the Embedding input layer is optimized by minimizing the Mean Square Error (MSE), at the loss of LembThe calculation is as follows:
Lemb=MSE(ESWe,ET) (2)
where MSE (-) represents the mean square error calculation, WeIs a learnable linear mapping matrix.
Step S222, distilling the intermediate hiding layer
Knowledge distillation of the middle hidden layer can enable the student model to better learn the behavior of the teacher model, but because the number N of the cavity convolution residual blocks stacked by the teacher model is inconsistent with the number M of the cavity convolution residual blocks stacked by the student model, how to complete many-to-many mapping between different layers becomes a key problem.
In one embodiment, the method and the device propose to measure the difference between a teacher model and a student model better by using an EMD (Earth Mover's Distance) Distance, thereby completing many-to-many mapping between hidden layers in a self-adaptive manner and avoiding information loss and information misleading caused by artificially specifying a hierarchical mapping relationship.
In particular, by
Figure BDA0002769944560000061
Representing the output between different layers of the hidden layer in the middle of the teacher model,
Figure BDA0002769944560000062
representing the output of different layers of the middle hidden layer of the student model, wherein N and M respectively represent the number of cavity convolution residual blocks stacked by the teacher model and the student model,
Figure BDA0002769944560000063
an output vector representing the jth intermediate hidden layer of the teacher model with a weight coefficient of
Figure BDA0002769944560000064
Indicate that initialization is
Figure BDA0002769944560000065
And optimizes the weighting coefficients during the training process. In the same way as above, the first and second,
Figure BDA0002769944560000066
an output vector representing the ith intermediate hidden layer of the student model with a weight coefficient of
Figure BDA0002769944560000067
Indicate that initialization is
Figure BDA0002769944560000068
And can be optimized during the training process. Defining a land mobile distance matrix or ground distance matrix
Figure BDA0002769944560000069
Wherein
Figure BDA00027699445600000610
Output vector representing jth intermediate hidden layer from teacher model
Figure BDA00027699445600000611
Output vector transferred to ith intermediate hidden layer of student model
Figure BDA00027699445600000612
The distance of (c).
In one embodiment, the distance is calculated using Mean Square Error (MSE)
Figure BDA00027699445600000613
Expressed as:
Figure BDA00027699445600000614
wherein, WhIs a learnable linear mapping matrix.
In addition, a mapping transfer matrix is defined
Figure BDA00027699445600000615
Wherein
Figure BDA00027699445600000616
Output vector representing jth intermediate hidden layer from teacher model
Figure BDA00027699445600000617
Output vector transferred to ith intermediate hidden layer of student model
Figure BDA00027699445600000618
The amount of map transfer. Thus, the overall transition loss between the intermediate hidden layers of the teacher model and the student model can be calculated as:
Figure BDA00027699445600000619
corresponding to the following constraints:
Figure BDA0002769944560000071
Figure BDA0002769944560000072
Figure BDA0002769944560000073
Figure BDA0002769944560000074
the optimization problem can be solved by a linear programming method to obtain an optimal mapping transfer matrix
Figure BDA0002769944560000075
Then, the EMD (Earth Mover's Distance) Distance can be defined as:
Figure BDA0002769944560000076
finally, the output matrix H of the middle hidden layer of the teacher model can be optimizedTOutput matrix H of middle hidden layer of student modelSThe knowledge distillation of the middle hidden layer is carried out according to the EMD distance between the two layers, and the knowledge distillation loss L of the middle hidden layerhiddenExpressed as:
Lhidden=EMD(HS,HT) (6)
in conclusion, by designing the mapping transfer matrix and measuring the difference between the teacher model and the student model by using the EMD distance, the many-to-many mapping relation between the middle hidden layer of the teacher model and the middle hidden layer of the student model can be adaptively completed, information loss and information misleading caused by manually specifying the hierarchical mapping relation are avoided, and the loss minimization of the knowledge distillation process is ensured by combining the designed mapping transfer matrix with the EMD distance, namely the precision of the student model is ensured.
Step S223, distilling the Softmax output layer
And after the final cavity convolution residual block passes through the model, inputting the output vector z into a Softmax classification layer to obtain final item prediction probability distribution and finish an item prediction task.
The purpose of Softmax output layer distillation is to approximate the final item prediction probability distribution of the student model to the final item prediction probability distribution of the teacher model, so as to learn the prediction behavior of the teacher model.
In one embodiment, the distillation process of the Softmax output layer is optimized by minimizing the cross entropy between the final item prediction probability distribution of the student model and the final item prediction probability distribution of the teacher model, the distillation loss LpredExpressed as:
Lpred=-softmax(zT)·log(softmax(zS/t)) (7)
wherein z isTOutput vector, z, representing the last hole convolution residual block of the teacher modelSAnd t is used for controlling the variation range of the output vector of the student model. In practical application, the convergence rate and the prediction accuracy of the training student model can be controlled by setting a proper t value.
Through knowledge distillation of the Embedding input layer, the middle hidden layer and the Softmax output layer, knowledge learned by a teacher model can be well distilled into student models, and the total loss L of the whole knowledge distillation processdistillComprises the following steps:
Ldistill=Lemb+Lhidden+Lpred (8)
the whole knowledge distillation model is trained by utilizing the training data until the model converges, and a student model with small model scale, small parameter quantity and good effect can be obtained for subsequent deployment and use.
Step S230, using the trained student model, and taking the historical browsing sequence of the given user as an input, providing a sequence recommendation service for the user.
In practical application, the sequence recommendation is equivalent to a test process of a model, and the final trained student model is used for recommendation service. When the historical browsing sequence of the user is given, item which is most likely to be interested in the user at the next moment is found through the trained student model, and the quick and accurate recommendation service is provided for the user.
The student model obtained according to the invention can achieve small model scale, small parameter quantity and short deduction time when the deployment sequence recommendation system is actually applied, can ensure higher model accuracy, can better meet the user requirements, and has very important practical significance and wide application prospect. For example, the invention can recommend the items which may be interested according to the attributes of the user (such as gender, age, academic calendar, region, occupation) and the past behaviors of the user in the system (such as browsing, clicking, searching, purchasing, collecting and the like).
Further, in order to verify the effectiveness and the advancement of the method provided by the invention, extensive experiments are carried out on the public data set MovieLens in the field of sequence recommendation systems. Experimental results show that the sequence recommendation method for knowledge distillation based on the EMD distance achieves the best current effect on model parameter quantity, inference time and model performance, can provide quick and accurate recommendation service for users, is very suitable for being deployed and applied to a sequence recommendation system, and has very important practical significance and wide application prospect.
In conclusion, the invention measures the difference between the teacher model and the student model by distilling the Embedding input layer, the middle hidden layer and the Softmax output layer and utilizing the EMD distance in the distillation of the middle hidden layer, thereby completing the many-to-many mapping between the hidden layers in a self-adaptive manner and avoiding the information loss and the information misleading caused by the artificial designated hierarchical mapping relationship.
It is to be understood that those skilled in the art can appropriately change or modify the above-described embodiments without departing from the spirit and scope of the present invention. For example, distillation may be performed only on the intermediate hidden layer, or knowledge distillation may be performed only on the intermediate hidden layer and the Softmax output layer. For example, for other sequence recommendation models than the NextItNet model, the method for knowledge distillation based on EMD distance proposed by the present invention is also applicable. In addition, the loss involved in the training process can also be measured as a squared loss or an exponential loss, which is not limited by the present invention.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.
While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (8)

1. A sequence recommendation method for knowledge distillation based on land movement distance comprises the following steps:
constructing a teacher model and a corresponding student model, wherein the teacher model and the student model comprise an Embedding input layer, a middle hidden layer stacked by cavity convolution residual blocks and a classification output layer, and the teacher model comprises more middle hidden layers relative to the student model;
training a teacher model by taking the set loss function as a target to obtain a pre-training teacher model;
fixing parameters of the pre-training teacher model, adding the student model for collaborative training, and only optimizing parameters of the student model to distill knowledge of the pre-training teacher model into the student model, wherein in the knowledge distillation process, a many-to-many mapping relation between a middle hidden layer of the pre-training teacher model and a middle hidden layer of the student model is adaptively learned by utilizing land movement distance;
providing a sequence recommendation service for a user by using a trained student model and taking a user historical browsing sequence as an input;
wherein the knowledge distillation process of the intermediate hidden layer comprises:
representing the output between different layers of the middle hidden layer of the teacher model as
Figure FDA0003635406650000011
Representing the output between different layers of the middle hidden layer of the student model as
Figure FDA0003635406650000012
Where N and M represent the number of hole convolution residual blocks stacked by the teacher model and the student model respectively,
Figure FDA0003635406650000013
an output vector representing the jth intermediate hidden layer of the teacher model,
Figure FDA0003635406650000014
which represents the corresponding weight coefficient(s),
Figure FDA0003635406650000015
an output vector representing the ith intermediate hidden layer of the student model,
Figure FDA0003635406650000016
representing the corresponding weight coefficients;
defining a mapping transfer matrix
Figure FDA0003635406650000017
Wherein
Figure FDA0003635406650000018
Output vector representing jth intermediate hidden layer from teacher model
Figure FDA0003635406650000019
Output vector transferred to ith intermediate hidden layer of student model
Figure FDA00036354066500000110
The mapping transfer amount of (2);
obtaining an optimal mapping transition matrix by solving for overall transition losses between intermediate hidden layers of a teacher model and a student model
Figure FDA00036354066500000111
Wherein the calculation of the overall transfer loss is represented as:
Figure FDA00036354066500000112
corresponding to the following constraints:
Figure FDA0003635406650000021
Figure FDA0003635406650000022
Figure FDA0003635406650000023
Figure FDA0003635406650000024
in obtaining the optimal mapping transfer matrix
Figure FDA0003635406650000025
Then, define the land movement distance as:
Figure FDA0003635406650000026
wherein the content of the first and second substances,
Figure FDA0003635406650000027
output vector representing jth intermediate hidden layer from teacher model
Figure FDA0003635406650000028
Output vector transferred to ith intermediate hidden layer of student model
Figure FDA0003635406650000029
The distance of (d);
by optimizing the teacher model intermediate hidden layer output matrix HTOutput matrix H of middle hidden layer of student modelSThe land movement distance between the teacher model and the student model is obtained, and the land movement distance between the teacher model and the student model is obtained through the land movement distance between the teacher model and the student model.
2. The method of claim 1, wherein the Embedding input layer knowledge distillation process comprises:
setting the user history browsing sequence X as (X)1,x2,...,xn) Expressed as the Embedding matrix E ═ E1,e2,...,en]Each column e in the matrixiAn Embedding vector representing the corresponding item;
representing the Embedding input layer of the student model as an Embedding matrix ES
In the process of cooperative training, a teacher model Embedding matrix E is learned by taking the distillation loss of a set Embedding input layer as a targetTAnd student model Embedding matrix ESA linear mapping matrix between.
3. The method of claim 2, wherein the distillation loss of the Embedding input layer is represented by minimizing the mean square error:
Lemb=MSE(ESWe,ET)
MSE (-) represents the mean square error calculation, WeIs a learnable linear mapping matrix.
4. The method of claim 1, wherein the land mobile distance is calculated using mean square error
Figure FDA00036354066500000210
Expressed as:
Figure FDA00036354066500000211
wherein, WhIs a learnable linear mapping matrix.
5. The method of claim 1, wherein the classification output layer is a Softmax output layer in which the distillation process is controlled by minimizing cross entropy between the recommendation prediction probability distributions of the student model and the teacher model, the distillation loss L beingpredExpressed as:
Lpred=-softmax(zT)·log(softmax(zS/t))
wherein z isTOutput vector, z, representing the final hole convolution residual block of the teacher modelSAnd t is used for controlling the variation range of the output vector of the student model.
6. The method of claim 1, wherein during pre-training of the teacher model, the input is a historical browsing sequence of the user, the output is an item recommended to the user at a next time, and the loss function is a cross-entropy between correct recommended items and predicted recommended items, expressed as:
Figure FDA0003635406650000031
wherein
Figure FDA0003635406650000032
To correctly recommend item tags, yiTo predict the recommended item label, T is the total number of training samples.
7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
8. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when executing the program.
CN202011245696.6A 2020-11-10 2020-11-10 Sequence recommendation method for knowledge distillation based on land moving distance Active CN112507209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011245696.6A CN112507209B (en) 2020-11-10 2020-11-10 Sequence recommendation method for knowledge distillation based on land moving distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011245696.6A CN112507209B (en) 2020-11-10 2020-11-10 Sequence recommendation method for knowledge distillation based on land moving distance

Publications (2)

Publication Number Publication Date
CN112507209A CN112507209A (en) 2021-03-16
CN112507209B true CN112507209B (en) 2022-07-05

Family

ID=74957465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011245696.6A Active CN112507209B (en) 2020-11-10 2020-11-10 Sequence recommendation method for knowledge distillation based on land moving distance

Country Status (1)

Country Link
CN (1) CN112507209B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257361B (en) * 2021-05-31 2021-11-23 中国科学院深圳先进技术研究院 Method, device and equipment for realizing self-adaptive protein prediction framework
CN113420123A (en) * 2021-06-24 2021-09-21 中国科学院声学研究所 Language model training method, NLP task processing method and device
CN113590958B (en) * 2021-08-02 2023-10-24 中国科学院深圳先进技术研究院 Continuous learning method of sequence recommendation model based on sample playback
CN113627545B (en) * 2021-08-16 2023-08-08 山东大学 Image classification method and system based on isomorphic multi-teacher guiding knowledge distillation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083829A1 (en) * 2015-09-18 2017-03-23 Samsung Electronics Co., Ltd. Model training method and apparatus, and data recognizing method
US10692602B1 (en) * 2017-09-18 2020-06-23 Deeptradiology, Inc. Structuring free text medical reports with forced taxonomies
CN111368995A (en) * 2020-02-14 2020-07-03 中国科学院深圳先进技术研究院 General network compression framework and compression method based on sequence recommendation system
CN111553479A (en) * 2020-05-13 2020-08-18 鼎富智能科技有限公司 Model distillation method, text retrieval method and text retrieval device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169052B (en) * 2017-04-26 2019-03-05 北京小度信息科技有限公司 Recommended method and device
CN110598120A (en) * 2019-10-16 2019-12-20 信雅达系统工程股份有限公司 Behavior data based financing recommendation method, device and equipment
CN111428130B (en) * 2020-03-06 2023-04-18 云知声智能科技股份有限公司 Method and device for enhancing text data in knowledge distillation process

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083829A1 (en) * 2015-09-18 2017-03-23 Samsung Electronics Co., Ltd. Model training method and apparatus, and data recognizing method
US10692602B1 (en) * 2017-09-18 2020-06-23 Deeptradiology, Inc. Structuring free text medical reports with forced taxonomies
CN111368995A (en) * 2020-02-14 2020-07-03 中国科学院深圳先进技术研究院 General network compression framework and compression method based on sequence recommendation system
CN111553479A (en) * 2020-05-13 2020-08-18 鼎富智能科技有限公司 Model distillation method, text retrieval method and text retrieval device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯卫东.面向移动应用的人体图像多属性分类算法研究.《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》.2020, *
周苏等.基于知识蒸馏的车辆可行驶区域分割算法研究.《汽车技术》.2020,(第01期),1-5. *

Also Published As

Publication number Publication date
CN112507209A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112507209B (en) Sequence recommendation method for knowledge distillation based on land moving distance
CN110366734B (en) Optimizing neural network architecture
US11741355B2 (en) Training of student neural network with teacher neural networks
McClure TensorFlow machine learning cookbook
US20190244604A1 (en) Model learning device, method therefor, and program
CN111339415A (en) Click rate prediction method and device based on multi-interactive attention network
CN111046294A (en) Click rate prediction method, recommendation method, model, device and equipment
CN111931054B (en) Sequence recommendation method and system based on improved residual error structure
US11740879B2 (en) Creating user interface using machine learning
CN111950593A (en) Method and device for recommending model training
CN111931057A (en) Sequence recommendation method and system for self-adaptive output
CN111259647A (en) Question and answer text matching method, device, medium and electronic equipment based on artificial intelligence
Sarang Artificial neural networks with TensorFlow 2
Habib Hands-on Q-learning with python: Practical Q-learning with openai gym, Keras, and tensorflow
Ayyadevara Neural Networks with Keras Cookbook: Over 70 recipes leveraging deep learning techniques across image, text, audio, and game bots
CN111461757B (en) Information processing method and device, computer storage medium and electronic equipment
US20210081792A1 (en) Neural network learning apparatus, neural network learning method and program
CN110505520B (en) Information recommendation method and system, medium and electronic device
CN111967941B (en) Method for constructing sequence recommendation model and sequence recommendation method
US20220027739A1 (en) Search space exploration for deep learning
CN116484868A (en) Cross-domain named entity recognition method and device based on diffusion model generation
KR102401115B1 (en) Artificial neural network Automatic design generation apparatus and method using UX-bit, skip connection structure and channel-wise concatenation structure
Fuentes Mastering Predictive Analytics with scikit-learn and TensorFlow: Implement machine learning techniques to build advanced predictive models using Python
CN117043778A (en) Generating a learned representation of a digital circuit design
Moocarme et al. The TensorFlow Workshop: A hands-on guide to building deep learning models from scratch using real-world datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240222

Address after: 519085 101, Building 5, Longyuan Smart Industrial Park, No. 2, Hagongda Road, Tangjiawan Town, High-tech Zone, Zhuhai City, Guangdong Province

Patentee after: ZHUHAI INSTITUTE OF ADVANCED TECHNOLOGY CHINESE ACADEMY OF SCIENCES Co.,Ltd.

Country or region after: China

Address before: 1068 No. 518055 Guangdong city of Shenzhen province Nanshan District Shenzhen University city academy Avenue

Patentee before: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY CHINESE ACADEMY OF SCIENCES

Country or region before: China

TR01 Transfer of patent right