CN114610500A

CN114610500A - Edge caching method based on model distillation

Info

Publication number: CN114610500A
Application number: CN202210286001.1A
Authority: CN
Inventors: 吕翊; 李富祥; 李职杜; 吴大鹏; 钟艾玲; 王汝言
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-06-10
Anticipated expiration: 2042-03-22
Also published as: CN114610500B

Abstract

The invention relates to an edge caching method based on model distillation, which belongs to the field of wireless communication, and comprises the steps of firstly converging user-side data acquired by an edge server to a cloud center, preprocessing the data, designing a teacher model, and training and predicting the preference of a user to contents; then, a student model is deployed at the edge server side, and local user preferences under the coverage of the base station are learned by sharing parameters of the teacher model. And finally, according to the obtained user preference, a group cache strategy is formulated by combining the activity of the group users, and the cache hit rate is optimized. The invention can save the link resource of wireless communication, improve the resource utilization rate of the edge server and improve the service quality of users.

Description

Edge caching method based on model distillation

Technical Field

The invention belongs to the field of wireless communication, and relates to an edge caching method based on model distillation.

Background

With the rapid development of communication technology, entertainment activities such as movies, live broadcasts, short videos and the like gradually become an indispensable part of people's daily life, which also makes the use of mobile data traffic increase year by year. According to the traditional method, all resources are stored in a cloud center in a centralized mode, content requested by a user side is distributed to the user from the cloud center through an edge base station, and network response time delay and link congestion are greatly increased in on-demand scenes such as movies and short videos. In order to solve the above difficulties, edge caching techniques have been developed. The edge caching technology is used for caching part of content of the cloud center in advance on an edge base station, user equipment or a vehicle with storage capacity, so that the traffic load of a backhaul link can be reduced, the cost is reduced, the transmission delay can be reduced, and the user experience is improved.

On the premise of mass content of large data, screening out a part of content from a large amount of resources in a cloud center in advance and caching the part of content in an edge base station are great challenges faced by an edge caching technology. The service object oriented to the edge caching technology is a user, and caching according to the group characteristics of the user under the coverage of an edge base station becomes a hot point of research. However, it is obvious that it is very difficult to mine the characteristics of the group users in the face of the user groups and user requests which are constantly changing. The caching object of the edge server is content, and the load pressure of the cloud center can be relieved to a certain extent by caching according to a series of characteristics of the content, such as the updating time, the click rate, the watching duration and the like of the content. Therefore, many researchers have focused on analyzing and mining the characteristics of the content in the group from the content, and then design the caching strategy.

However, compared with the cloud center, the edge server has a smaller data volume and weaker computing capability, and cannot train a large amount of data and a complex model like the cloud center. On the other hand, the user interest is constantly changing, a lot of time is needed for analyzing the user interest in the cloud center, and the personalized user preference under the coverage of the base station can be quickly analyzed by deploying a lightweight model in the edge base station. Finally, it is also a challenge for personalized user preferences how to merge into the interest preferences of a group.

Therefore, it is imperative to deploy lightweight model prediction at the edge server to analyze and predict the dynamic interest of the user, and then obtain the group preference of the user according to the personalized preference of the user, optimize the cache content, and improve the service quality of the edge server.

Disclosure of Invention

In view of this, the present invention aims to provide an edge caching method based on model distillation, which trains a teacher model at a cloud center end, trains a student model at an edge server end by using a model distillation technology, predicts user preference, and improves the hit rate and service quality of the edge server, for the problem of low cache hit rate due to weak computing capability and small data amount of the edge server in an edge caching scene of a short video.

In order to achieve the purpose, the invention provides the following technical scheme:

an edge cache method based on model distillation comprises the steps of firstly converging user-side data acquired by an edge server to a cloud center, preprocessing the data, designing a teacher model, and training and predicting the preference of a user to contents; then, a student model is deployed at the edge server side, and local user preferences under the coverage of the base station are learned by sharing parameters of the teacher model. And finally, according to the obtained user preference, a group cache strategy is formulated by combining the activity of the group users, and the cache hit rate is optimized. The method specifically comprises the following steps:

s1: input data acquisition and preprocessing: the data collected by the user side are disordered, and after the data are cleaned, the data mainly comprise two parts of features, namely continuous features and discrete features, the two parts of features are respectively encoded by using different encoding modes in the edge server and the cloud center, so that the subsequent training of a teacher model and a student model is facilitated;

s2: training a teacher model: inputting the data output by the step S1 after the cloud center preprocessing into a teacher model deployed in the cloud center for training;

s3: training a student model: inputting the data preprocessed by the edge base station output in the step S1 into a student model for training, combining a full-connection network layer shared by the teacher model to accelerate the training speed, and accelerating the convergence of the student model by using the distillation loss of the teacher model and the student model;

s4: group caching strategy: according to the personalized preferences of the users predicted by the student model in the step S3, combining the activity of the group users to integrate into the interest preferences of the group users, and selecting the Top-k content with the highest user preference according to the cache capacity of the edge server for caching;

s5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.

Further, the step S1 specifically includes the following steps:

s11: user features E input from the input layer^UAnd content characteristics E^CWill contain many discrete characteristics, such as gender, occupation, device model, category of content of the user, including predicting the target user's personalized preferences T_BEtc., the discrete features may be encoded using one-hot encoding, assuming that any one of the one-hot encoded discrete features is denoted as f_d。

f_d＝[d₁,d₂,...,d_i,d_||D||]

Wherein

D represents a discrete feature f_dSo as to encode a one-dimensional discrete feature into a vector of dimension | D | containing only 0 and 1; the characteristic of the input layer after single hot coding is marked as F_d：

F_d＝f(E^U,E^C,T_B)

Wherein f (-) represents a one-hot encoding of discrete features;

s12: for continuous features collected, e.g. age, duration of viewing, completeness of viewingCharacter, user behavior sequence, etc., and encoding by using feature embedding to obtain low-dimensional dense embedded feature Y ═ Y₁,y₂,y₃...,y_k]；

Wherein, the first and the second end of the pipe are connected with each other,

the parameters of the over-matrix are represented,

representing the sparse features of the input, k and m representing the size of the parameter matrix, and k < m,

representing a bias vector so as to convert the sparse features of high dimension into dense vectors of low dimension; the characteristic of the input layer after embedded coding is marked as F_y：

F_y＝g(E^U,E^C,T_C)

Where g (-) represents the embedded encoding of the continuous feature.

Further, the step S2 specifically includes the following steps:

s21: gated loop unit GRU: since the user's interests over a period of time are dynamically changing over time, GRUs are used to model the user's sequence of behaviors. GRUs perform better in long sequence modeling of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate, wherein the update gate determines how much previous information in a user behavior sequence needs to be reserved and transmitted to the next layer, and the reset gate determines how much previous information needs to be ignored; the GRU model is represented as follows:

z_t＝σ(W^z(F_d+F_y)+N^zh_t-1+b^z)

r_t＝σ(W^r(F_d+F_y)+N^rh_t-1+b^r)

wherein z is_t，r_t，

h_tRespectively representing an update gate, a reset gate, a candidate hidden state vector and a hidden state vector of the current time step, sigma is a sigmoid activation function, W^z，W^r，W^hAnd N^z，N^r，N^hIs a training parameter, b^z，b^r，b^hIndicating a deviation, an indicates a hadamard product;

s22: the multi-head self-attention mechanism: the requests of users are various, and in order to extract the main interest of the users from the user behavior sequences, a multi-head self-attention mechanism is used for analyzing the long sequences of the users. The traditional attention mechanism may generate errors in the user importance extraction due to other noise information of the sequence when analyzing the user importance interest. The multi-head self-attention mechanism is to analyze and extract information of a target sequence for multiple times, integrate different output results and increase the accuracy of the main interest positioning of a user:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

wherein Q ═ h₁,h₂,…,h_t]Denotes the output sequence of LSTM, K ═ V denotes the key and value of the output sequence, head_iDenotes the ith head in multi-head self-attention, Concat (-) denotes the splicing operation, W_i ^Q，W_i ^K,W_i ^VAnd W_OTraining parameters representing an over-matrix, MultiHead_t(Q, K, V) represents the output of the multi-head self-attention of the teacher model;

s23: fully connecting a neural network: because the matrix of bull self-attention output belongs to high-dimensional sparse matrix, and in order to share the training parameter of teacher's model with student's model, join the neural network layer of full connection after bull self-attention output, and the neural network of full connection of teacher's model and the neural network of full connection of student's model are the same structure:

……

wherein

Training parameters representing a teacher model layer j neural network,

a bias term representing the teacher model's layer j neural network,

representing the output of the teacher model layer j neural network;

s24: setting a loss function: for the teacher model, because the predicted user preferences belong to the classification problem, the fitting is done using a logarithmic cross-entropy loss function:

wherein y is_tIs the output of the teacher model, σ represents the activation function, user preferences are the classification problem, using the softmax activation function, N represents the amount of data in the training set,

data representing a training set, f_t(. cndot.) represents a teacher model.

Further, the step S3 specifically includes:

s31: sharing parameters of the teacher model: the edge server end is compared with the cloud center end, and the data volume is less and the computing power is weaker, and in order to accelerate the training of the student model, the parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that the training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined to ensure that the training effect of the student model is better.

……

Wherein

Representing the training parameters of the student model layer j neural network,

bias term, Multihead, representing the student model layer j neural network_s(Q, K, V) denotes the Bull self-attention of the student modelThe output of the force is carried out,

representing the output of the neural network of the j layer of the student model;

s32: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weak computing power. Model distillation mainly shortens the training time of the student models through parameter sharing and lost distillation between the teacher model and the student models; the individual preference of the group users is still predicted by the student model prediction model, so the loss function used is still logarithmic cross entropy loss. Model distillation the loss distillation function in the student model is as follows:

wherein y is_sRepresenting the predicted output of the student model, f_s(. represents a student model, L_sLogarithmic Cross entropy loss, L, representing the student model_t/sRepresenting lost distillation of the student model.

Further, the step S4 specifically includes:

s41: obtaining user preference output by the student model under coverage of each base station according to model distillation;

s42: and (4) establishing a caching standard and sequencing according to the activity of the group users, and selecting Top-k content for caching.

The invention has the beneficial effects that: aiming at the problem that the cache hit rate is low due to the fact that the computing capacity of an edge server is weak and the data volume is small in an edge cache scene, the invention provides the edge cache method based on model distillation.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a diagram of a scenario of the present invention;

FIG. 2 is a system flow diagram of the present invention;

FIG. 3 is (a) a diagram showing a teacher model in model distillation according to the present invention and (b) a diagram showing a student model in model distillation according to the present invention;

FIG. 4 is a flow chart of the model distillation training of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; for a better explanation of the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1 to 4, the edge cache optimization algorithm based on multi-task training according to the present invention specifically includes the following steps:

step 1: input data acquisition and preprocessing: the data collected by the user side are disordered, and after the data are cleaned, the data are mainly composed of two parts of features, namely continuous features and discrete features, and the two types of features are respectively coded by using different coding modes at the edge server and the cloud center, so that the subsequent training of a teacher model and a student model is facilitated. The step 1 specifically comprises the following steps:

step 1.1: first, a user characteristic E input from an input layer^UAnd content characteristics E^CWill contain many discrete characteristics, such as gender, occupation, device model, category of content of the user, including predicting the target user's personalized preferences T_BEtc., the discrete features may be encoded using one-hot encoding. Assume that any one of the one-hot encoded discrete features is denoted as f_d。

f_d＝[d₁,d₂,...,d_i,d_||D||]

Wherein

D represents a discrete feature f_dA set of categories of. Thus, a one-dimensional discrete feature is encoded into a | | | D | | | | vector containing only 0 and 1. The characteristic of the input layer after single-hot coding is marked as F_d

F_d＝f(E^U,E^C,T_B)

Where f (-) represents a one-hot encoding of discrete features.

Step 1.2: for collected continuous features such as age, viewing duration, viewing integrity, user behavior sequence and the like, feature embedding is used for encoding, and obtained low-dimensional dense embedded features Y are [ Y ═ Y₁,y₂,y₃...,y_k]。

Wherein the content of the first and second substances,

the parameters of the over-matrix are represented,

the bias vectors are represented such that the high-dimensional sparse features are converted into low-dimensional dense vectors. The characteristic of the input layer after embedded coding is marked as F_y

F_y＝g(E^U,E^C,T_C)

Where g (-) represents the embedded encoding of the continuous feature.

Step 2: training a teacher model: and (3) inputting the data output by the step (1) after the cloud center preprocessing into a teacher model deployed in the cloud center for training. The step 2 specifically comprises the following steps:

step 2.1: gated cycle unit (GRU): since the interests of a user over a period of time are dynamically changing over time, we use GRU to model the sequence of user behaviors. GRUs perform better in long sequence modeling of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate. The update gate determines how much of the previous information in the user behavior sequence needs to be retained and passed to the next layer. The reset gate determines how much of the previous information should be ignored. The GRU model can be expressed as follows:

z_t＝σ(W^z(F_d+F_y)+N^zh_t-1+b^z)

r_t＝σ(W^r(F_d+F_y)+N^rh_t-1+b^r)

wherein z is_t，r_t，

h_tRespectively representing an update gate, a reset gate, a candidate hidden state vector and a hidden state vector of the current time step, sigma is a sigmoid activation function, W^z，W^r，W^hAnd N^z，N^rNh is a training parameter, b^z，b^r，b^hIndicating a deviation, which indicates a hadamard product.

Step 2.2: the multi-head self-attention mechanism: the requests of users are various, and in order to extract the main interest of the users from the user behavior sequences, a multi-head self-attention mechanism is used for analyzing the long sequences of the users. The traditional attention mechanism may generate errors in the user importance extraction due to other noise information of the sequence when analyzing the user important interest. The multi-head self-attention mechanism is to analyze the target sequence for multiple times and extract information, and then integrate different output results, so as to increase the accuracy of the main interest positioning of the user.

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

Wherein Q ═ h₁,h₂,…,h_t]Denotes the output sequence of LSTM, K ═ V denotes the key and value of the output sequence, head_iDenotes the ith head in multi-head self-attention, Concat (-) denotes the splicing operation, W_i ^Q，W_i ^K,W_i ^VAnd W^OTraining parameters representing an excess matrix, MultiHead_t(Q, K, V) represents the output of the teacher model from the multi-head attention.

Step 2.3: fully connecting the neural networks: because the matrix of the multi-head self-attention output belongs to the high-dimensional sparse matrix and the training parameters of the teacher model and the student model are shared, the fully-connected neural network layer is added after the multi-head self-attention output, and the fully-connected neural network of the teacher model and the fully-connected neural network of the student model have the same structure.

……

Wherein

Training parameters representing a teacher model layer j neural network,

a bias term representing the teacher model's layer j neural network,

representing the output of the teacher model's layer j neural network.

Step 2.4: setting a loss function: for the teacher model, the fitting is done using a logarithmic cross entropy loss function, since the predicted user preferences belong to the classification problem.

data representing a training set, f_t(. cndot.) represents a teacher model.

S3: training a student model: inputting the data preprocessed by the edge base station output by the S1 into a student model for training, combining a full-connection network layer shared by the teacher model to accelerate the training speed, and accelerating the convergence of the student model by using the distillation loss of the teacher model and the student model. The step 3 specifically includes:

step 3.1: sharing parameters of the teacher model: compared with a cloud center end, the edge server end has smaller data volume and weaker computing capability, and in order to accelerate the training of the student model, the parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that the training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined to ensure that the training effect of the student model is better.

……

Wherein

bias term, Multihead, representing the student model layer j neural network_s(Q, K, V) represents the multi-headed self-attention output of the student model,

representing the output of the student model layer j neural network.

Step 3.2: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weak computing power. Model distillation shortens the training time of the student model mainly through parameter sharing and loss distillation between the teacher model and the student model. The individual preference of the group users is still predicted by the student model prediction model, so the loss function used is still logarithmic cross entropy loss. Model distillation the loss distillation function in the student model is as follows:

wherein y is_sRepresenting the predicted output of the student model, f_s(. represents a student model, L_sRepresenting the logarithmic cross-entropy loss, L, of the student model_t/sRepresenting lost distillation of the student model.

And 4, step 4: group caching strategy: after the individual preference of the user is obtained through the student model prediction in the step 3, the group liveness and the group preference fusion index are used as the basis of content caching and are sequenced, and then the content placement strategy beta can be solved:

wherein Ac^uRepresents the activity of the user U in the group, U represents the group user under the coverage of the base station,

representing the data volume of the user u in the training set, sequencing the beta in a descending order and selecting the contents of Top-k for caching

And 5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. An edge caching method based on model distillation is characterized in that: the method comprises the following steps:

s1: input data acquisition and preprocessing: the data continuous characteristics and the discrete characteristics acquired by the user side are coded by using different coding modes for the two characteristics respectively at the edge server and the cloud center;

2. The model distillation based edge caching method according to claim 1, wherein: the step S1 specifically includes the following steps:

s11: user features E entered into the input layer using one-hot encoding^UAnd content characteristics E^CThe discrete features in (1) are encoded, and any one discrete feature subjected to one-hot encoding is assumed to be denoted as f_d：

f_d＝[d₁,d₂,...,d_i,d_||D||]

Wherein

D represents a discrete feature f_dSo as to encode a one-dimensional discrete feature into a vector of dimension | D | containing only 0 and 1; the characteristic of the input layer after single-hot coding is marked as F_d：

F_d＝f(E^U,E^C,T_B)

Wherein f (-) represents a one-hot encoding of discrete features;

s12: and encoding the collected continuous features by using feature embedding to obtain low-dimensional dense embedded features Y ═ Y₁,y₂,y₃...,y_k]：

Wherein the content of the first and second substances,

the parameters of the over-matrix are represented,

represents the input sparse feature, k and m represent the size of the parameter matrix, and k is less than m,

F_y＝g(E^U,E^C,T_C)

Where g (-) represents the embedded encoding of the continuous feature.

3. The model distillation based edge caching method according to claim 1, wherein: the step S2 specifically includes the following steps:

s21: the GRU is used for modeling the behavior sequence of the user, the GRU model consists of an update gate and a reset gate, the update gate determines how much previous information in the behavior sequence of the user needs to be reserved and transmitted to the next layer, and the reset gate determines how much previous information needs to be ignored; the GRU model is represented as follows:

z_t＝σ(W^z(F_d+F_y)+N^zh_t-1+b^z)

r_t＝σ(W^r(F_d+F_y)+N^rh_t-1+b^r)

wherein z is_t，r_t，

h_tRespectively representing an update gate, a reset gate, a candidate hidden state vector and a hidden state vector of the current time step, sigma is a sigmoid activation function, W^z，W^r，W^hAnd N^z，N^r，N^hIs a training parameter, b^z，b^r，b^hIndicates a deviation,. indicates a hadamard product;

s22: the method comprises the following steps of analyzing a long sequence of a user by using a multi-head self-attention mechanism, analyzing a target sequence for multiple times, extracting information, integrating different output results, and increasing the accuracy of positioning the main interest of the user:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O

wherein Q ═ h₁,h₂,…,h_t]Denotes the output sequence of LSTM, K ═ V denotes the key and value of the output sequence, head_iDenotes the ith head in multi-head self-attention, Concat (-) denotes the splicing operation, W_i ^Q，W_i ^K,W_i ^VAnd W^OTraining parameters representing an over-matrix, MultiHead_t(Q, K, V) represents the output of the multi-head self-attention of the teacher model;

s23: and adding a fully-connected neural network layer after the multi-head self attention is output, wherein the fully-connected neural network of the teacher model and the fully-connected neural network of the student model have the same structure:

……

wherein

Training parameters representing a teacher model layer j neural network,

a bias term representing the teacher model's layer j neural network,

show teacher modelThe output of the type j layer neural network;

s24: fitting was performed using a logarithmic cross entropy loss function:

data representing a training set, f_t(. cndot.) represents a teacher model.

4. The model distillation based edge caching method according to claim 1, wherein: the step S3 specifically includes:

s31: sharing parameters of the teacher model: sharing parameters of a fully connected neural network layer of the teacher model with the student model:

……

wherein

Representing the training parameters of the neural network of the jth layer of the student model,

representing the output of the student model layer j neural network;

s32: model distillation: through parameter sharing and loss distillation between the teacher model and the student model, the loss distillation function of the model distillation in the student model is as follows:

5. The model distillation based edge caching method according to claim 1, wherein: the step S4 specifically includes: