CN114610500B - Edge caching method based on model distillation - Google Patents

Edge caching method based on model distillation Download PDF

Info

Publication number
CN114610500B
CN114610500B CN202210286001.1A CN202210286001A CN114610500B CN 114610500 B CN114610500 B CN 114610500B CN 202210286001 A CN202210286001 A CN 202210286001A CN 114610500 B CN114610500 B CN 114610500B
Authority
CN
China
Prior art keywords
model
student
output
user
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210286001.1A
Other languages
Chinese (zh)
Other versions
CN114610500A (en
Inventor
吕翊
李富祥
李职杜
吴大鹏
钟艾玲
王汝言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210286001.1A priority Critical patent/CN114610500B/en
Publication of CN114610500A publication Critical patent/CN114610500A/en
Application granted granted Critical
Publication of CN114610500B publication Critical patent/CN114610500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an edge caching method based on model distillation, which belongs to the field of wireless communication, and comprises the steps of firstly converging user side data acquired by an edge server to a cloud center, preprocessing the data, and designing teacher model training to predict the preference of a user to content; the student model is then deployed at the edge server, learning the local user preferences under the base station coverage by sharing the parameters of the teacher model. And finally, according to the obtained user preference, combining the liveness of the group users, formulating a group cache strategy and optimizing the cache hit rate. The invention can save the link resource of wireless communication, improve the resource utilization rate of the edge server and improve the service quality of users.

Description

Edge caching method based on model distillation
Technical Field
The invention belongs to the field of wireless communication, and relates to an edge caching method based on model distillation.
Background
With the rapid development of communication technology, entertainment activities such as movies, live broadcast, short videos and the like gradually become an integral part of daily life of people, and the use of mobile data traffic is increased year by year. In the traditional method, all resources are stored in a cloud center in a concentrated way, and content requested by a user side is distributed to users from the cloud center through an edge base station, but in the video-on-demand scene such as a movie and a short video, the network response time delay and link congestion are greatly increased. To solve the above-mentioned difficulties, edge caching techniques have been developed. The edge caching technology caches a part of the content of the cloud center on an edge base station, user equipment or a vehicle with the capacity in advance, so that the traffic load of a backhaul link can be reduced, the cost is reduced, the transmission delay can be reduced, and the user experience is improved.
On the premise of massive content of big data, screening a part of content from a large number of resources in the cloud center in advance to be cached in an edge base station is a great challenge faced by the edge caching technology. The service object facing the edge caching technology is a user, and caching according to the group characteristics of the user covered by the edge base station becomes a hot spot of research. But it is obvious that it is particularly difficult to mine the characteristics of group users in the face of a group of users and user requests that are constantly changing. The object cached by the edge server is content, and the load pressure of the cloud center can be relieved to a certain extent by caching according to a series of characteristics of the content, such as the update time, the click rate, the watching duration and the like of the content. Therefore, a few researchers start from the content and focus on analyzing and mining the characteristics of the content in the group so as to design a caching strategy.
But the edge servers have less data volume and less computing power than the cloud center and are not able to train as much data and complex models as the cloud center. On the other hand, the interests of users are constantly changing, a lot of time is needed for analyzing the interests of users in the cloud center, and a lightweight model is deployed at the edge base station, so that personalized user preferences under the coverage of the base station can be rapidly analyzed. Finally, how to merge into interest preferences for groups is also a challenge for personalized user preferences.
Therefore, it is needed to deploy lightweight model prediction on the edge server to analyze and predict the dynamic interests of the user, and then obtain the group preference of the user according to the personalized preference of the user, optimize the cache content, and improve the service quality of the edge server.
Disclosure of Invention
In view of the above, the present invention aims to provide a model distillation-based edge caching method, which is aimed at the problem of low cache hit rate caused by weak computing power and small data volume of an edge server in an edge caching scene of a short video, trains a teacher model at a cloud center end, trains a student model at the edge server end by using a model distillation technology, predicts user preference, and improves hit rate and service quality of the edge server.
In order to achieve the above purpose, the present invention provides the following technical solutions:
The edge caching method based on model distillation comprises the steps of firstly converging user side data acquired by an edge server to a cloud center, preprocessing the data, and designing teacher model training to predict the preference of a user to content; the student model is then deployed at the edge server, learning the local user preferences under the base station coverage by sharing the parameters of the teacher model. And finally, according to the obtained user preference, combining the liveness of the group users, formulating a group cache strategy and optimizing the cache hit rate. The method specifically comprises the following steps:
S1: input data acquisition and pretreatment: the data collected by the user end are generally disordered, and after the data are cleaned, the data are mainly composed of two parts of characteristics, namely continuous characteristics and discrete characteristics, and the two types of characteristics are respectively encoded in an edge server and a cloud center by using different encoding modes, so that the training of a subsequent teacher model and a student model is facilitated;
S2: training a teacher model: inputting the data which is output in the step S1 and is preprocessed by the cloud center into a teacher model deployed by the cloud center for training;
s3: training a student model: inputting the data preprocessed by the edge base station output in the step S1 into a student model for training, combining a fully-connected network layer shared by a teacher model to accelerate training speed, and accelerating convergence of the student model by using distillation loss of the teacher model and the student model;
S4: group caching policy: according to personalized preferences of users, which are predicted by the student model in the step S3, combining the liveness of the group users to form interest preferences of the group users, and selecting the content of Top-k with the highest user preference according to the cache capacity of the edge server for caching;
s5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.
Further, the step S1 specifically includes the following steps:
S11: the user features E U and the content features E C input from the input layer may contain a number of discrete features, such as the gender, occupation, equipment model, category of content of the user, including the personalized preferences T B of the predicted target user, etc., which may be encoded using one-hot encoding, assuming that any one of the one-hot encoded discrete features is denoted as f d.
fd=[d1,d2,...,di,d||D||]
Wherein the method comprises the steps ofD represents a set of categories of discrete features f d, thereby encoding a one-dimensional discrete feature into a dimension vector containing only 0 and 1; the input layer after single thermal encoding is characterized by being F d:
Fd=f(EU,EC,TB)
Wherein f (·) represents the one-hot encoding of the discrete features;
s12: for the collected continuous features, such as age, viewing duration, viewing integrity, user behavior sequence and the like, feature embedding is used for encoding, so that low-dimensional dense embedded features Y= [ Y 1,y2,y3...,yk ] are obtained;
Wherein, Parameters representing an overmatrix,/>Represents sparse features of the input, k and m represent the size of the parameter matrix, and k < m,/>Representing the bias vector, thereby converting the high-dimensional sparse features into a low-dimensional dense vector; the characteristic of the input layer after embedded coding is marked as F y:
Fy=g(EU,EC,TC)
where g (-) represents the embedded encoding of the continuous feature.
Further, the step S2 specifically includes the following steps:
S21: gating cycle unit GRU: as the user's interests are dynamically changing over time, the GRU is used to model the user's behavior sequence. GRUs perform better in modeling long sequences of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate, wherein the update gate determines how much previous information in a user behavior sequence needs to be reserved and transmitted to the next layer, and the reset gate determines how much previous information needs to be ignored; the GRU model is expressed as follows:
zt=σ(Wz(Fd+Fy)+Nzht-1+bz)
rt=σ(Wr(Fd+Fy)+Nrht-1+br)
Wherein, z t,rt is a total number, H t denotes update gate, reset gate, candidate hidden state vector, hidden state vector of current time step, σ is sigmoid activation function, W z,Wr,Wh and N z,Nr,Nh are training parameters, b z,br,bh denotes bias, and Σ denotes hadamard product;
S22: multi-head self-attention mechanism: the user's requests are diverse and in order to extract the user's primary interests from the user behavior sequence, a long sequence of users is analyzed using a multi-headed self-attention mechanism. Traditional attention mechanisms may be subject to errors in the extraction of user importance due to other noise information of the sequence when analyzing the user's important interests. The multi-head self-attention mechanism is to analyze and extract information from a target sequence for multiple times, integrate different output results, and increase the accuracy of main interest positioning of a user:
headi=Attention(QWi Q,KWi K,VWi V)
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
wherein q= [ h 1,h2,…,ht ] represents the output sequence of LSTM, k=v represents the key and value of the output sequence, respectively, head i represents the i-th head in multi-head self-attention, concat (·) represents the concatenation operation, W i Q,Wi K,Wi V and W O represent training parameters of the transition matrix, multiHead t (Q, K, V) represents the output of the teacher model multi-head self-attention;
S23: fully connected neural network: since the matrix of the multi-head self-attention output belongs to a high-dimensional sparse matrix, and in order to share training parameters of the teacher model with the student model, a fully-connected neural network layer is added after the multi-head self-attention output, and the fully-connected neural network of the teacher model and the fully-connected neural network of the student model are of the same structure:
……
Wherein the method comprises the steps of Training parameters of j-th layer neural network of teacher model are expressed,/>Bias term representing j-th layer neural network of teacher model,/>Representing the output of the j-th layer neural network of the teacher model;
s24: loss function setting: for the teacher model, fitting is performed using a log-cross entropy loss function because the predicted user preference belongs to the classification problem:
Where y t is the output of the teacher model, σ represents the activation function, user preference is a classification problem, using the softmax activation function, N represents the data volume of the training set, Data representing a training set, f t (·) represents a teacher model.
Further, the step S3 specifically includes:
S31: teacher model parameter sharing: compared with the cloud center, the edge server has smaller data size and weaker computing power, and in order to accelerate training of the student model, parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined, so that training effect of the student model is better.
……
Wherein the method comprises the steps ofTraining parameters representing j-th layer neural network of student model,/>Bias term representing j-th layer neural network of student model MultiHead s (Q, K, V) represents multi-head self-attention output of student model,/>Representing the output of the j-th layer neural network of the student model;
S32: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weaker computing power. Model distillation mainly shortens training time of a student model through parameter sharing and loss distillation between a teacher model and the student model; the student model prediction model predicts the personalized preference of the group user, and therefore the loss function used is still a log-cross entropy loss. Model distillation the loss distillation function in the student model is as follows:
Where y s represents the predicted output of the student model, f s (·) represents the student model, L s represents the log-cross entropy loss of the student model, and L t/s represents the loss distillation of the student model.
Further, the step S4 specifically includes:
s41: obtaining user preference output by student models under coverage of each base station according to model distillation;
s42: and setting a caching standard by combining the liveness of the group users, sequencing, and selecting the content of Top-k for caching.
The invention has the beneficial effects that: aiming at the problem of low cache hit rate caused by weak computing capacity and small data volume of an edge server in an edge cache scene, the invention provides the edge cache method based on model distillation, which is used for training a teacher model at a cloud center end, training a student model at the edge server end by using a model distillation technology, predicting user preference and improving the hit rate and service quality of the edge server.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a scene graph of the present invention;
FIG. 2 is a system flow diagram of the present invention;
FIG. 3 (a) is a diagram of a model of a teacher in model distillation according to the present invention, and (b) is a diagram of a model of a student in model distillation according to the present invention;
FIG. 4 is a schematic diagram of a model distillation training scheme in accordance with the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
Referring to fig. 1 to fig. 4, the edge buffer optimization algorithm based on multitasking training according to the present invention specifically includes the following steps:
Step 1: input data acquisition and pretreatment: the data collected by the user end are generally disordered, and after the data are cleaned, the data are mainly composed of two parts of characteristics, namely continuous characteristics and discrete characteristics, and different coding modes are respectively used for coding the two types of special types at the edge server and the cloud center, so that the subsequent training of a teacher model and a student model is facilitated. The step 1 specifically comprises the following steps:
Step 1.1: first, the user features E U and the content features E C input from the input layer may contain a number of discrete features, such as the gender, occupation, equipment model, category of content of the user, including the personalized preferences T B of the predicted target user, etc., which may be encoded using one-hot encoding. Assume that any one of the unithermally encoded discrete features is denoted as f d.
fd=[d1,d2,...,di,d||D||]
Wherein the method comprises the steps ofD represents a set of categories of discrete features f d. Thus, a one-dimensional discrete feature is encoded into a dimension vector is shown containing only 0 and 1D. The input layer after single thermal coding is marked as F d
Fd=f(EU,EC,TB)
Where f (·) represents the one-hot encoding of the discrete features.
Step 1.2: for the collected continuous features, such as age, viewing duration, viewing integrity, user behavior sequence, etc., feature embedding is used to encode the resulting low-dimensional dense embedded feature y= [ Y 1,y2,y3...,yk ].
Wherein,Parameters representing an overmatrix,/>Represents sparse features of the input, k and m represent the size of the parameter matrix, and k < m,/>The bias vectors are represented such that the high-dimensional sparse features are converted into low-dimensional dense vectors. The characteristic of the input layer after embedded coding is marked as F y
Fy=g(EU,EC,TC)
Where g (-) represents the embedded encoding of the continuous feature.
Step 2: training a teacher model: and (3) inputting the data which is preprocessed by the cloud center and is output by the step (1) into a teacher model deployed by the cloud center for training. The step 2 specifically comprises the following steps:
Step 2.1: gate cycle unit (GRU): since the user's interests are dynamically changing over time, we use the GRU to model the user's behavior sequence. The GRU performs better in modeling long sequences of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate. The update gate determines how much of the previous information in the user behavior sequence needs to be retained and passed on to the next layer. The reset gate determines how much of the previous information should be ignored. The GRU model can be expressed as follows:
zt=σ(Wz(Fd+Fy)+Nzht-1+bz)
rt=σ(Wr(Fd+Fy)+Nrht-1+br)
Wherein, z t,rt is a total number, H t denotes the update gate, reset gate, candidate hidden state vector, hidden state vector of the current time step, σ is the sigmoid activation function, W z,Wr,Wh and N z,Nr, nh are training parameters, b z,br,bh denotes the deviation, and the Hadamard product.
Step 2.2: multi-head self-attention mechanism: the user's requests are diverse and we use a multi-headed self-attention mechanism to analyze a long sequence of users in order to extract their primary interests from the sequence of user behaviors. Traditional attention mechanisms may be subject to errors in the extraction of user importance due to other noise information of the sequence when analyzing the user's important interests. The multi-head self-attention mechanism is to analyze and extract information from a target sequence for multiple times, integrate different output results, and increase the accuracy of positioning main interests of a user.
headi=Attention(QWi Q,KWi K,VWi V)
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
Where q= [ h 1,h2,…,ht ] represents the output sequence of LSTM, k=v represents the key and value of the output sequence, respectively, head i represents the i-th head in multi-head self-attention, concat (·) represents the concatenation operation, W i Q,Wi K,Wi V and W O represent training parameters of the transition matrix, multiHead t (Q, K, V) represents the output of the teacher model multi-head self-attention.
Step 2.3: fully connected neural network: since the matrix of the multi-head self-attention output belongs to a high-dimensional sparse matrix, and in order to share training parameters of the teacher model with the student model, a fully-connected neural network layer is added after the multi-head self-attention output, and the fully-connected neural network of the teacher model and the fully-connected neural network of the student model are of the same structure.
……
Wherein the method comprises the steps ofTraining parameters of j-th layer neural network of teacher model are expressed,/>Bias term representing j-th layer neural network of teacher model,/>And the output of the j-th layer neural network of the teacher model is represented.
Step 2.4: loss function setting: for the teacher model, the fit is made using a log-cross entropy loss function because the predicted user preference belongs to the classification problem.
Where y t is the output of the teacher model, σ represents the activation function, user preference is a classification problem, using the softmax activation function, N represents the data volume of the training set,Data representing a training set, f t (·) represents a teacher model.
S3: training a student model: and (3) inputting the data preprocessed by the edge base station output by the S1 into a student model for training, combining a fully-connected network layer shared by a teacher model to accelerate training speed, and accelerating convergence of the student model by using distillation loss of the teacher model and the student model. The step 3 specifically includes:
Step 3.1: teacher model parameter sharing: compared with the cloud center, the edge server has smaller data size and weaker computing power, and in order to accelerate training of the student model, the parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that the training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined, so that the training effect of the student model is better.
……
Wherein the method comprises the steps ofTraining parameters representing j-th layer neural network of student model,/>Bias term representing j-th layer neural network of student model MultiHead s (Q, K, V) represents multi-head self-attention output of student model,/>Representing the output of the j-th layer neural network of the student model.
Step 3.2: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weaker computing power. Model distillation mainly shortens training time of student models through parameter sharing and loss distillation between teacher models and student models. The student model prediction model predicts the personalized preference of the group user, and therefore the loss function used is still a log-cross entropy loss. Model distillation the loss distillation function in the student model is as follows:
Where y s represents the predicted output of the student model, f s (·) represents the student model, L s represents the log-cross entropy loss of the student model, and L t/s represents the loss distillation of the student model.
Step 4: group caching policy: after the personalized preferences of the user are predicted according to the student model in the step 3, the fusion index of the group liveness and the group preferences is used as the basis of content caching and is sequenced, and then the content placement strategy beta can be solved:
wherein Ac u represents the activity of user U in the group user, U represents the group user covered by the base station, Representing the data quantity of the user u in the training set, arranging beta in a descending order, and selecting the content of Top-k for caching
Step 5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (3)

1. An edge caching method based on model distillation is characterized in that: the method comprises the following steps:
S1: input data acquisition and pretreatment: the method comprises the steps that data continuous features and discrete features collected by a user side are encoded in different encoding modes in an edge server and a cloud center respectively;
S2: training a teacher model: inputting the data which is output in the step S1 and is preprocessed by the cloud center into a teacher model deployed by the cloud center for training;
s3: training a student model: inputting the data preprocessed by the edge base station output in the step S1 into a student model for training, combining a fully-connected network layer shared by a teacher model to accelerate training speed, and accelerating convergence of the student model by using distillation loss of the teacher model and the student model;
S4: group caching policy: according to personalized preferences of users, which are predicted by the student model in the step S3, combining the liveness of the group users to form interest preferences of the group users, and selecting the content of Top-k with the highest user preference according to the cache capacity of the edge server for caching;
S5: optimizing a cache result: optimizing the hit rate of the cache according to the cache strategy;
The step S2 specifically includes the following steps:
s21: using GRU to model the user behavior sequence, wherein the GRU model consists of an update gate and a reset gate, the update gate determines how much previous information in the user behavior sequence needs to be reserved and transferred to the next layer, and the reset gate determines how much previous information should be ignored; the GRU model is expressed as follows:
zt=σ(Wz(Fd+Fy)+Nzht-1+bz)
rt=σ(Wr(Fd+Fy)+Nrht-1+br)
Wherein, z t,rt is a total number, H t denotes update gate, reset gate, candidate hidden state vector, hidden state vector of current time step, σ is sigmoid activation function, W z,Wr,Wh and N z,Nr,Nh are training parameters, b z,br,bh denotes bias, and Σ denotes hadamard product;
S22: analyzing a long sequence of a user by using a multi-head self-attention mechanism, analyzing and extracting information for multiple times for a target sequence, integrating different output results, and increasing the accuracy of main interest positioning for the user:
headi=Attention(QWi Q,KWi K,VWi V)
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
Wherein q= [ h 1,h2,…,ht ] represents the output sequence of LSTM, k=v represents the key and value of the output sequence, respectively, head i represents the i-th head in multi-head self-attention, concat (·) represents the concatenation operation, W i Q,Wi K,Wi V and W O represent training parameters of the transition matrix, multiHead t (Q, K, V) represents the output of the teacher model multi-head self-attention;
s23: the fully connected neural network layer is added after the multi-head self-attention output, and the fully connected neural network of the teacher model and the fully connected neural network of the student model are of the same structure:
……
Wherein the method comprises the steps of Training parameters of j-th layer neural network of teacher model are expressed,/>Bias term representing j-th layer neural network of teacher model,/>Representing the output of the j-th layer neural network of the teacher model;
S24: fitting using a log-cross entropy loss function:
Where y t is the output of the teacher model, σ represents the activation function, user preference is a classification problem, using the softmax activation function, N represents the data volume of the training set, Data representing a training set, f t (·) representing a teacher model;
The step S3 specifically includes:
S31: teacher model parameter sharing: sharing parameters of a fully connected neural network layer of the teacher model with the student model:
……
Wherein the method comprises the steps of Training parameters representing j-th layer neural network of student model,/>Bias term representing j-th layer neural network of student model MultiHead s (Q, K, V) represents multi-head self-attention output of student model,/>Representing the output of the j-th layer neural network of the student model;
s32: model distillation: through parameter sharing and loss distillation between a teacher model and a student model, the loss distillation function of model distillation in the student model is as follows:
Where y s represents the predicted output of the student model, f s (·) represents the student model, L s represents the log-cross entropy loss of the student model, and L t/s represents the loss distillation of the student model.
2. The model distillation based edge caching method according to claim 1, wherein: the step S1 specifically comprises the following steps:
S11: discrete features in the user feature E U and the content feature E C input at the input layer are encoded using one-hot encoding, assuming that any one of the one-hot encoded discrete features is denoted as f d:
fd=[d1,d2,...,di,d||D||]
Wherein the method comprises the steps of D represents a set of categories of discrete features f d, thereby encoding a one-dimensional discrete feature into a dimension vector containing only 0 and 1; the input layer after single thermal encoding is characterized by being F d:
Fd=f(EU,EC,TB)
Wherein f (·) represents the one-hot encoding of the discrete features;
S12: encoding the collected continuous features by using feature embedding to obtain low-dimensional dense embedded features Y= [ Y 1,y2,y3...,yk ]:
Wherein, Parameters representing an overmatrix,/>Represents sparse features of the input, k and m represent the size of the parameter matrix, and k < m,/>Representing the bias vector, thereby converting the high-dimensional sparse features into a low-dimensional dense vector; the characteristic of the input layer after embedded coding is marked as F y:
Fy=g(EU,EC,TC)
where g (-) represents the embedded encoding of the continuous feature.
3. The model distillation based edge caching method according to claim 1, wherein: the step S4 specifically includes:
s41: obtaining user preference output by student models under coverage of each base station according to model distillation;
s42: and setting a caching standard by combining the liveness of the group users, sequencing, and selecting the content of Top-k for caching.
CN202210286001.1A 2022-03-22 2022-03-22 Edge caching method based on model distillation Active CN114610500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210286001.1A CN114610500B (en) 2022-03-22 2022-03-22 Edge caching method based on model distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210286001.1A CN114610500B (en) 2022-03-22 2022-03-22 Edge caching method based on model distillation

Publications (2)

Publication Number Publication Date
CN114610500A CN114610500A (en) 2022-06-10
CN114610500B true CN114610500B (en) 2024-04-30

Family

ID=81865137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210286001.1A Active CN114610500B (en) 2022-03-22 2022-03-22 Edge caching method based on model distillation

Country Status (1)

Country Link
CN (1) CN114610500B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508169A (en) * 2020-11-13 2021-03-16 华为技术有限公司 Knowledge distillation method and system
CN112819155A (en) * 2021-01-22 2021-05-18 中国人民解放军国防科技大学 Deep neural network model hierarchical compression method and device applied to edge equipment
CN113849641A (en) * 2021-09-26 2021-12-28 中山大学 Knowledge distillation method and system for cross-domain hierarchical relationship
CN113850362A (en) * 2021-08-20 2021-12-28 华为技术有限公司 Model distillation method and related equipment
CN113988263A (en) * 2021-10-29 2022-01-28 内蒙古大学 Knowledge distillation-based space-time prediction method in industrial Internet of things edge equipment
WO2022022274A1 (en) * 2020-07-31 2022-02-03 华为技术有限公司 Model training method and apparatus
CN114490447A (en) * 2022-01-24 2022-05-13 重庆邮电大学 Intelligent caching method for multitask optimization

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11580329B2 (en) * 2018-09-18 2023-02-14 Microsoft Technology Licensing, Llc Machine-learning training service for synthetic data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022022274A1 (en) * 2020-07-31 2022-02-03 华为技术有限公司 Model training method and apparatus
CN112508169A (en) * 2020-11-13 2021-03-16 华为技术有限公司 Knowledge distillation method and system
CN112819155A (en) * 2021-01-22 2021-05-18 中国人民解放军国防科技大学 Deep neural network model hierarchical compression method and device applied to edge equipment
CN113850362A (en) * 2021-08-20 2021-12-28 华为技术有限公司 Model distillation method and related equipment
CN113849641A (en) * 2021-09-26 2021-12-28 中山大学 Knowledge distillation method and system for cross-domain hierarchical relationship
CN113988263A (en) * 2021-10-29 2022-01-28 内蒙古大学 Knowledge distillation-based space-time prediction method in industrial Internet of things edge equipment
CN114490447A (en) * 2022-01-24 2022-05-13 重庆邮电大学 Intelligent caching method for multitask optimization

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Head Network Distillation : Splitting Distilled Deep Neural Networks for Resource-Constrained Edge Computing Systems";Yoshitomo Matsubara;《IEEE Access》;20201120;第8卷;第212177-212193页 *
"基于高精度深度学习的模型简化的研究";朱倩倩;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115(2022年第01期);第I140-533页 *
"推荐系统使能的边缘缓存策略研究";李富祥;《中国优秀硕士学位论文全文数据库 信息科技辑》;20231015(2023年第10期);第I138-388页 *

Also Published As

Publication number Publication date
CN114610500A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
Li et al. A deep learning method based on an attention mechanism for wireless network traffic prediction
Liu et al. A deep reinforcement learning approach to proactive content pushing and recommendation for mobile users
CN114143891A (en) FDQL-based multi-dimensional resource collaborative optimization method in mobile edge network
CN113242469A (en) Self-adaptive video transmission configuration method and system
CN113115340B (en) Popularity prediction-based cache optimization method in cellular network
CN110210378A (en) A kind of embedded video method for analyzing image and device based on edge calculations
Hou et al. No-reference video quality evaluation by a deep transfer CNN architecture
CN107277159B (en) Ultra-dense network small station caching method based on machine learning
CN110944349A (en) Heterogeneous wireless network selection method based on intuitive fuzzy number and TOPSIS
Hao et al. Knowledge-centric proactive edge caching over mobile content distribution network
Wan et al. Deep Reinforcement Learning‐Based Collaborative Video Caching and Transcoding in Clustered and Intelligent Edge B5G Networks
CN115587266A (en) Air-space-ground integrated internet intelligent edge caching method
Nguyen et al. User-preference-based proactive caching in edge networks
CN114885388A (en) Multi-service type self-adaptive switching judgment method combined with RSS prediction
CN114610500B (en) Edge caching method based on model distillation
Lin et al. Meta-networking: Beyond the shannon limit with multi-faceted information
Wang et al. Automatic learning-based data optimization method for autonomous driving
Ma et al. APRank: Joint mobility and preference-based mobile video prefetching
Wang et al. Edge video analytics with adaptive information gathering: a deep reinforcement learning approach
Kim et al. HTTP adaptive streaming scheme based on reinforcement learning with edge computing assistance
CN117216255A (en) Classification model training method and related equipment
Wu et al. MANSY: Generalizing Neural Adaptive Immersive Video Streaming With Ensemble and Representation Learning
Zhang et al. Deep reinforcement learning based adaptive 360-degree video streaming with field of view joint prediction
CN116828052A (en) Intelligent data collaborative caching method based on edge calculation
Wang et al. Heterogeneous Edge Caching Based on Actor-Critic Learning With Attention Mechanism Aiding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant