CN114610500B

CN114610500B - Edge caching method based on model distillation

Info

Publication number: CN114610500B
Application number: CN202210286001.1A
Authority: CN
Inventors: 吕翊; 李富祥; 李职杜; 吴大鹏; 钟艾玲; 王汝言
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2024-04-30
Anticipated expiration: 2042-03-22
Also published as: CN114610500A

Abstract

The invention relates to an edge caching method based on model distillation, which belongs to the field of wireless communication, and comprises the steps of firstly converging user side data acquired by an edge server to a cloud center, preprocessing the data, and designing teacher model training to predict the preference of a user to content; the student model is then deployed at the edge server, learning the local user preferences under the base station coverage by sharing the parameters of the teacher model. And finally, according to the obtained user preference, combining the liveness of the group users, formulating a group cache strategy and optimizing the cache hit rate. The invention can save the link resource of wireless communication, improve the resource utilization rate of the edge server and improve the service quality of users.

Description

Edge caching method based on model distillation

Technical Field

The invention belongs to the field of wireless communication, and relates to an edge caching method based on model distillation.

Background

With the rapid development of communication technology, entertainment activities such as movies, live broadcast, short videos and the like gradually become an integral part of daily life of people, and the use of mobile data traffic is increased year by year. In the traditional method, all resources are stored in a cloud center in a concentrated way, and content requested by a user side is distributed to users from the cloud center through an edge base station, but in the video-on-demand scene such as a movie and a short video, the network response time delay and link congestion are greatly increased. To solve the above-mentioned difficulties, edge caching techniques have been developed. The edge caching technology caches a part of the content of the cloud center on an edge base station, user equipment or a vehicle with the capacity in advance, so that the traffic load of a backhaul link can be reduced, the cost is reduced, the transmission delay can be reduced, and the user experience is improved.

On the premise of massive content of big data, screening a part of content from a large number of resources in the cloud center in advance to be cached in an edge base station is a great challenge faced by the edge caching technology. The service object facing the edge caching technology is a user, and caching according to the group characteristics of the user covered by the edge base station becomes a hot spot of research. But it is obvious that it is particularly difficult to mine the characteristics of group users in the face of a group of users and user requests that are constantly changing. The object cached by the edge server is content, and the load pressure of the cloud center can be relieved to a certain extent by caching according to a series of characteristics of the content, such as the update time, the click rate, the watching duration and the like of the content. Therefore, a few researchers start from the content and focus on analyzing and mining the characteristics of the content in the group so as to design a caching strategy.

But the edge servers have less data volume and less computing power than the cloud center and are not able to train as much data and complex models as the cloud center. On the other hand, the interests of users are constantly changing, a lot of time is needed for analyzing the interests of users in the cloud center, and a lightweight model is deployed at the edge base station, so that personalized user preferences under the coverage of the base station can be rapidly analyzed. Finally, how to merge into interest preferences for groups is also a challenge for personalized user preferences.

Therefore, it is needed to deploy lightweight model prediction on the edge server to analyze and predict the dynamic interests of the user, and then obtain the group preference of the user according to the personalized preference of the user, optimize the cache content, and improve the service quality of the edge server.

Disclosure of Invention

In view of the above, the present invention aims to provide a model distillation-based edge caching method, which is aimed at the problem of low cache hit rate caused by weak computing power and small data volume of an edge server in an edge caching scene of a short video, trains a teacher model at a cloud center end, trains a student model at the edge server end by using a model distillation technology, predicts user preference, and improves hit rate and service quality of the edge server.

In order to achieve the above purpose, the present invention provides the following technical solutions:

The edge caching method based on model distillation comprises the steps of firstly converging user side data acquired by an edge server to a cloud center, preprocessing the data, and designing teacher model training to predict the preference of a user to content; the student model is then deployed at the edge server, learning the local user preferences under the base station coverage by sharing the parameters of the teacher model. And finally, according to the obtained user preference, combining the liveness of the group users, formulating a group cache strategy and optimizing the cache hit rate. The method specifically comprises the following steps:

S1: input data acquisition and pretreatment: the data collected by the user end are generally disordered, and after the data are cleaned, the data are mainly composed of two parts of characteristics, namely continuous characteristics and discrete characteristics, and the two types of characteristics are respectively encoded in an edge server and a cloud center by using different encoding modes, so that the training of a subsequent teacher model and a student model is facilitated;

S2: training a teacher model: inputting the data which is output in the step S1 and is preprocessed by the cloud center into a teacher model deployed by the cloud center for training;

s3: training a student model: inputting the data preprocessed by the edge base station output in the step S1 into a student model for training, combining a fully-connected network layer shared by a teacher model to accelerate training speed, and accelerating convergence of the student model by using distillation loss of the teacher model and the student model;

S4: group caching policy: according to personalized preferences of users, which are predicted by the student model in the step S3, combining the liveness of the group users to form interest preferences of the group users, and selecting the content of Top-k with the highest user preference according to the cache capacity of the edge server for caching;

s5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.

Further, the step S1 specifically includes the following steps:

S11: the user features E ^U and the content features E ^C input from the input layer may contain a number of discrete features, such as the gender, occupation, equipment model, category of content of the user, including the personalized preferences T _B of the predicted target user, etc., which may be encoded using one-hot encoding, assuming that any one of the one-hot encoded discrete features is denoted as f _d.

f_d＝[d₁,d₂,...,d_i,d_||D||]

Wherein the method comprises the steps ofD represents a set of categories of discrete features f _d, thereby encoding a one-dimensional discrete feature into a dimension vector containing only 0 and 1; the input layer after single thermal encoding is characterized by being F _d:

F_d＝f(E^U,E^C,T_B)

Wherein f (·) represents the one-hot encoding of the discrete features;

s12: for the collected continuous features, such as age, viewing duration, viewing integrity, user behavior sequence and the like, feature embedding is used for encoding, so that low-dimensional dense embedded features Y= [ Y ₁,y₂,y₃...,y_k ] are obtained;

Wherein, Parameters representing an overmatrix,/>Represents sparse features of the input, k and m represent the size of the parameter matrix, and k < m,/>Representing the bias vector, thereby converting the high-dimensional sparse features into a low-dimensional dense vector; the characteristic of the input layer after embedded coding is marked as F _y:

F_y＝g(E^U,E^C,T_C)

where g (-) represents the embedded encoding of the continuous feature.

Further, the step S2 specifically includes the following steps:

S21: gating cycle unit GRU: as the user's interests are dynamically changing over time, the GRU is used to model the user's behavior sequence. GRUs perform better in modeling long sequences of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate, wherein the update gate determines how much previous information in a user behavior sequence needs to be reserved and transmitted to the next layer, and the reset gate determines how much previous information needs to be ignored; the GRU model is expressed as follows:

z_t＝σ(W^z(F_d+F_y)+N^zh_t-1+b^z)

r_t＝σ(W^r(F_d+F_y)+N^rh_t-1+b^r)

Wherein, z _t,r_t is a total number, H _t denotes update gate, reset gate, candidate hidden state vector, hidden state vector of current time step, σ is sigmoid activation function, W ^z,W^r,W^h and N ^z,N^r,N^h are training parameters, b ^z,b^r,b^h denotes bias, and Σ denotes hadamard product;

S22: multi-head self-attention mechanism: the user's requests are diverse and in order to extract the user's primary interests from the user behavior sequence, a long sequence of users is analyzed using a multi-headed self-attention mechanism. Traditional attention mechanisms may be subject to errors in the extraction of user importance due to other noise information of the sequence when analyzing the user's important interests. The multi-head self-attention mechanism is to analyze and extract information from a target sequence for multiple times, integrate different output results, and increase the accuracy of main interest positioning of a user:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

wherein q= [ h ₁,h₂,…,h_t ] represents the output sequence of LSTM, k=v represents the key and value of the output sequence, respectively, head _i represents the i-th head in multi-head self-attention, concat (·) represents the concatenation operation, W _i ^Q,W_i ^K,W_i ^V and W _O represent training parameters of the transition matrix, multiHead _t (Q, K, V) represents the output of the teacher model multi-head self-attention;

S23: fully connected neural network: since the matrix of the multi-head self-attention output belongs to a high-dimensional sparse matrix, and in order to share training parameters of the teacher model with the student model, a fully-connected neural network layer is added after the multi-head self-attention output, and the fully-connected neural network of the teacher model and the fully-connected neural network of the student model are of the same structure:

……

Wherein the method comprises the steps of Training parameters of j-th layer neural network of teacher model are expressed,/>Bias term representing j-th layer neural network of teacher model,/>Representing the output of the j-th layer neural network of the teacher model;

s24: loss function setting: for the teacher model, fitting is performed using a log-cross entropy loss function because the predicted user preference belongs to the classification problem:

Where y _t is the output of the teacher model, σ represents the activation function, user preference is a classification problem, using the softmax activation function, N represents the data volume of the training set, Data representing a training set, f _t (·) represents a teacher model.

Further, the step S3 specifically includes:

S31: teacher model parameter sharing: compared with the cloud center, the edge server has smaller data size and weaker computing power, and in order to accelerate training of the student model, parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined, so that training effect of the student model is better.

……

Wherein the method comprises the steps ofTraining parameters representing j-th layer neural network of student model,/>Bias term representing j-th layer neural network of student model MultiHead _s (Q, K, V) represents multi-head self-attention output of student model,/>Representing the output of the j-th layer neural network of the student model;

S32: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weaker computing power. Model distillation mainly shortens training time of a student model through parameter sharing and loss distillation between a teacher model and the student model; the student model prediction model predicts the personalized preference of the group user, and therefore the loss function used is still a log-cross entropy loss. Model distillation the loss distillation function in the student model is as follows:

Where y _s represents the predicted output of the student model, f _s (·) represents the student model, L _s represents the log-cross entropy loss of the student model, and L _t/s represents the loss distillation of the student model.

Further, the step S4 specifically includes:

s41: obtaining user preference output by student models under coverage of each base station according to model distillation;

s42: and setting a caching standard by combining the liveness of the group users, sequencing, and selecting the content of Top-k for caching.

The invention has the beneficial effects that: aiming at the problem of low cache hit rate caused by weak computing capacity and small data volume of an edge server in an edge cache scene, the invention provides the edge cache method based on model distillation, which is used for training a teacher model at a cloud center end, training a student model at the edge server end by using a model distillation technology, predicting user preference and improving the hit rate and service quality of the edge server.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a scene graph of the present invention;

FIG. 2 is a system flow diagram of the present invention;

FIG. 3 (a) is a diagram of a model of a teacher in model distillation according to the present invention, and (b) is a diagram of a model of a student in model distillation according to the present invention;

FIG. 4 is a schematic diagram of a model distillation training scheme in accordance with the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1 to fig. 4, the edge buffer optimization algorithm based on multitasking training according to the present invention specifically includes the following steps:

Step 1: input data acquisition and pretreatment: the data collected by the user end are generally disordered, and after the data are cleaned, the data are mainly composed of two parts of characteristics, namely continuous characteristics and discrete characteristics, and different coding modes are respectively used for coding the two types of special types at the edge server and the cloud center, so that the subsequent training of a teacher model and a student model is facilitated. The step 1 specifically comprises the following steps:

Step 1.1: first, the user features E ^U and the content features E ^C input from the input layer may contain a number of discrete features, such as the gender, occupation, equipment model, category of content of the user, including the personalized preferences T _B of the predicted target user, etc., which may be encoded using one-hot encoding. Assume that any one of the unithermally encoded discrete features is denoted as f _d.

f_d＝[d₁,d₂,...,d_i,d_||D||]

Wherein the method comprises the steps ofD represents a set of categories of discrete features f _d. Thus, a one-dimensional discrete feature is encoded into a dimension vector is shown containing only 0 and 1D. The input layer after single thermal coding is marked as F _d

F_d＝f(E^U,E^C,T_B)

Where f (·) represents the one-hot encoding of the discrete features.

Step 1.2: for the collected continuous features, such as age, viewing duration, viewing integrity, user behavior sequence, etc., feature embedding is used to encode the resulting low-dimensional dense embedded feature y= [ Y ₁,y₂,y₃...,y_k ].

Wherein,Parameters representing an overmatrix,/>Represents sparse features of the input, k and m represent the size of the parameter matrix, and k < m,/>The bias vectors are represented such that the high-dimensional sparse features are converted into low-dimensional dense vectors. The characteristic of the input layer after embedded coding is marked as F _y

F_y＝g(E^U,E^C,T_C)

Where g (-) represents the embedded encoding of the continuous feature.

Step 2: training a teacher model: and (3) inputting the data which is preprocessed by the cloud center and is output by the step (1) into a teacher model deployed by the cloud center for training. The step 2 specifically comprises the following steps:

Step 2.1: gate cycle unit (GRU): since the user's interests are dynamically changing over time, we use the GRU to model the user's behavior sequence. The GRU performs better in modeling long sequences of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate. The update gate determines how much of the previous information in the user behavior sequence needs to be retained and passed on to the next layer. The reset gate determines how much of the previous information should be ignored. The GRU model can be expressed as follows:

z_t＝σ(W^z(F_d+F_y)+N^zh_t-1+b^z)

r_t＝σ(W^r(F_d+F_y)+N^rh_t-1+b^r)

Wherein, z _t,r_t is a total number, H _t denotes the update gate, reset gate, candidate hidden state vector, hidden state vector of the current time step, σ is the sigmoid activation function, W ^z,W^r,W^h and N ^z,N^r, nh are training parameters, b ^z,b^r,b^h denotes the deviation, and the Hadamard product.

Step 2.2: multi-head self-attention mechanism: the user's requests are diverse and we use a multi-headed self-attention mechanism to analyze a long sequence of users in order to extract their primary interests from the sequence of user behaviors. Traditional attention mechanisms may be subject to errors in the extraction of user importance due to other noise information of the sequence when analyzing the user's important interests. The multi-head self-attention mechanism is to analyze and extract information from a target sequence for multiple times, integrate different output results, and increase the accuracy of positioning main interests of a user.

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

Where q= [ h ₁,h₂,…,h_t ] represents the output sequence of LSTM, k=v represents the key and value of the output sequence, respectively, head _i represents the i-th head in multi-head self-attention, concat (·) represents the concatenation operation, W _i ^Q,W_i ^K,W_i ^V and W ^O represent training parameters of the transition matrix, multiHead _t (Q, K, V) represents the output of the teacher model multi-head self-attention.

Step 2.3: fully connected neural network: since the matrix of the multi-head self-attention output belongs to a high-dimensional sparse matrix, and in order to share training parameters of the teacher model with the student model, a fully-connected neural network layer is added after the multi-head self-attention output, and the fully-connected neural network of the teacher model and the fully-connected neural network of the student model are of the same structure.

……

Wherein the method comprises the steps ofTraining parameters of j-th layer neural network of teacher model are expressed,/>Bias term representing j-th layer neural network of teacher model,/>And the output of the j-th layer neural network of the teacher model is represented.

Step 2.4: loss function setting: for the teacher model, the fit is made using a log-cross entropy loss function because the predicted user preference belongs to the classification problem.

Where y _t is the output of the teacher model, σ represents the activation function, user preference is a classification problem, using the softmax activation function, N represents the data volume of the training set,Data representing a training set, f _t (·) represents a teacher model.

S3: training a student model: and (3) inputting the data preprocessed by the edge base station output by the S1 into a student model for training, combining a fully-connected network layer shared by a teacher model to accelerate training speed, and accelerating convergence of the student model by using distillation loss of the teacher model and the student model. The step 3 specifically includes:

Step 3.1: teacher model parameter sharing: compared with the cloud center, the edge server has smaller data size and weaker computing power, and in order to accelerate training of the student model, the parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that the training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined, so that the training effect of the student model is better.

……

Wherein the method comprises the steps ofTraining parameters representing j-th layer neural network of student model,/>Bias term representing j-th layer neural network of student model MultiHead _s (Q, K, V) represents multi-head self-attention output of student model,/>Representing the output of the j-th layer neural network of the student model.

Step 3.2: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weaker computing power. Model distillation mainly shortens training time of student models through parameter sharing and loss distillation between teacher models and student models. The student model prediction model predicts the personalized preference of the group user, and therefore the loss function used is still a log-cross entropy loss. Model distillation the loss distillation function in the student model is as follows:

Step 4: group caching policy: after the personalized preferences of the user are predicted according to the student model in the step 3, the fusion index of the group liveness and the group preferences is used as the basis of content caching and is sequenced, and then the content placement strategy beta can be solved:

wherein Ac ^u represents the activity of user U in the group user, U represents the group user covered by the base station, Representing the data quantity of the user u in the training set, arranging beta in a descending order, and selecting the content of Top-k for caching

Step 5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. An edge caching method based on model distillation is characterized in that: the method comprises the following steps:

S1: input data acquisition and pretreatment: the method comprises the steps that data continuous features and discrete features collected by a user side are encoded in different encoding modes in an edge server and a cloud center respectively;

S5: optimizing a cache result: optimizing the hit rate of the cache according to the cache strategy;

The step S2 specifically includes the following steps:

s21: using GRU to model the user behavior sequence, wherein the GRU model consists of an update gate and a reset gate, the update gate determines how much previous information in the user behavior sequence needs to be reserved and transferred to the next layer, and the reset gate determines how much previous information should be ignored; the GRU model is expressed as follows:

z_t＝σ(W^z(F_d+F_y)+N^zh_t-1+b^z)

r_t＝σ(W^r(F_d+F_y)+N^rh_t-1+b^r)

S22: analyzing a long sequence of a user by using a multi-head self-attention mechanism, analyzing and extracting information for multiple times for a target sequence, integrating different output results, and increasing the accuracy of main interest positioning for the user:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

Wherein q= [ h ₁,h₂,…,h_t ] represents the output sequence of LSTM, k=v represents the key and value of the output sequence, respectively, head _i represents the i-th head in multi-head self-attention, concat (·) represents the concatenation operation, W _i ^Q,W_i ^K,W_i ^V and W ^O represent training parameters of the transition matrix, multiHead _t (Q, K, V) represents the output of the teacher model multi-head self-attention;

s23: the fully connected neural network layer is added after the multi-head self-attention output, and the fully connected neural network of the teacher model and the fully connected neural network of the student model are of the same structure:

……

S24: fitting using a log-cross entropy loss function:

Where y _t is the output of the teacher model, σ represents the activation function, user preference is a classification problem, using the softmax activation function, N represents the data volume of the training set, Data representing a training set, f _t (·) representing a teacher model;

The step S3 specifically includes:

S31: teacher model parameter sharing: sharing parameters of a fully connected neural network layer of the teacher model with the student model:

……

Wherein the method comprises the steps of Training parameters representing j-th layer neural network of student model,/>Bias term representing j-th layer neural network of student model MultiHead _s (Q, K, V) represents multi-head self-attention output of student model,/>Representing the output of the j-th layer neural network of the student model;

s32: model distillation: through parameter sharing and loss distillation between a teacher model and a student model, the loss distillation function of model distillation in the student model is as follows:

2. The model distillation based edge caching method according to claim 1, wherein: the step S1 specifically comprises the following steps:

S11: discrete features in the user feature E ^U and the content feature E ^C input at the input layer are encoded using one-hot encoding, assuming that any one of the one-hot encoded discrete features is denoted as f _d:

f_d＝[d₁,d₂,...,d_i,d_||D||]

Wherein the method comprises the steps of D represents a set of categories of discrete features f _d, thereby encoding a one-dimensional discrete feature into a dimension vector containing only 0 and 1; the input layer after single thermal encoding is characterized by being F _d:

F_d＝f(E^U,E^C,T_B)

Wherein f (·) represents the one-hot encoding of the discrete features;

S12: encoding the collected continuous features by using feature embedding to obtain low-dimensional dense embedded features Y= [ Y ₁,y₂,y₃...,y_k ]:

F_y＝g(E^U,E^C,T_C)

where g (-) represents the embedded encoding of the continuous feature.

3. The model distillation based edge caching method according to claim 1, wherein: the step S4 specifically includes: