CN114610500A - Edge caching method based on model distillation - Google Patents

Edge caching method based on model distillation Download PDF

Info

Publication number
CN114610500A
CN114610500A CN202210286001.1A CN202210286001A CN114610500A CN 114610500 A CN114610500 A CN 114610500A CN 202210286001 A CN202210286001 A CN 202210286001A CN 114610500 A CN114610500 A CN 114610500A
Authority
CN
China
Prior art keywords
model
user
representing
training
distillation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210286001.1A
Other languages
Chinese (zh)
Other versions
CN114610500B (en
Inventor
吕翊
李富祥
李职杜
吴大鹏
钟艾玲
王汝言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210286001.1A priority Critical patent/CN114610500B/en
Publication of CN114610500A publication Critical patent/CN114610500A/en
Application granted granted Critical
Publication of CN114610500B publication Critical patent/CN114610500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an edge caching method based on model distillation, which belongs to the field of wireless communication, and comprises the steps of firstly converging user-side data acquired by an edge server to a cloud center, preprocessing the data, designing a teacher model, and training and predicting the preference of a user to contents; then, a student model is deployed at the edge server side, and local user preferences under the coverage of the base station are learned by sharing parameters of the teacher model. And finally, according to the obtained user preference, a group cache strategy is formulated by combining the activity of the group users, and the cache hit rate is optimized. The invention can save the link resource of wireless communication, improve the resource utilization rate of the edge server and improve the service quality of users.

Description

Edge caching method based on model distillation
Technical Field
The invention belongs to the field of wireless communication, and relates to an edge caching method based on model distillation.
Background
With the rapid development of communication technology, entertainment activities such as movies, live broadcasts, short videos and the like gradually become an indispensable part of people's daily life, which also makes the use of mobile data traffic increase year by year. According to the traditional method, all resources are stored in a cloud center in a centralized mode, content requested by a user side is distributed to the user from the cloud center through an edge base station, and network response time delay and link congestion are greatly increased in on-demand scenes such as movies and short videos. In order to solve the above difficulties, edge caching techniques have been developed. The edge caching technology is used for caching part of content of the cloud center in advance on an edge base station, user equipment or a vehicle with storage capacity, so that the traffic load of a backhaul link can be reduced, the cost is reduced, the transmission delay can be reduced, and the user experience is improved.
On the premise of mass content of large data, screening out a part of content from a large amount of resources in a cloud center in advance and caching the part of content in an edge base station are great challenges faced by an edge caching technology. The service object oriented to the edge caching technology is a user, and caching according to the group characteristics of the user under the coverage of an edge base station becomes a hot point of research. However, it is obvious that it is very difficult to mine the characteristics of the group users in the face of the user groups and user requests which are constantly changing. The caching object of the edge server is content, and the load pressure of the cloud center can be relieved to a certain extent by caching according to a series of characteristics of the content, such as the updating time, the click rate, the watching duration and the like of the content. Therefore, many researchers have focused on analyzing and mining the characteristics of the content in the group from the content, and then design the caching strategy.
However, compared with the cloud center, the edge server has a smaller data volume and weaker computing capability, and cannot train a large amount of data and a complex model like the cloud center. On the other hand, the user interest is constantly changing, a lot of time is needed for analyzing the user interest in the cloud center, and the personalized user preference under the coverage of the base station can be quickly analyzed by deploying a lightweight model in the edge base station. Finally, it is also a challenge for personalized user preferences how to merge into the interest preferences of a group.
Therefore, it is imperative to deploy lightweight model prediction at the edge server to analyze and predict the dynamic interest of the user, and then obtain the group preference of the user according to the personalized preference of the user, optimize the cache content, and improve the service quality of the edge server.
Disclosure of Invention
In view of this, the present invention aims to provide an edge caching method based on model distillation, which trains a teacher model at a cloud center end, trains a student model at an edge server end by using a model distillation technology, predicts user preference, and improves the hit rate and service quality of the edge server, for the problem of low cache hit rate due to weak computing capability and small data amount of the edge server in an edge caching scene of a short video.
In order to achieve the purpose, the invention provides the following technical scheme:
an edge cache method based on model distillation comprises the steps of firstly converging user-side data acquired by an edge server to a cloud center, preprocessing the data, designing a teacher model, and training and predicting the preference of a user to contents; then, a student model is deployed at the edge server side, and local user preferences under the coverage of the base station are learned by sharing parameters of the teacher model. And finally, according to the obtained user preference, a group cache strategy is formulated by combining the activity of the group users, and the cache hit rate is optimized. The method specifically comprises the following steps:
s1: input data acquisition and preprocessing: the data collected by the user side are disordered, and after the data are cleaned, the data mainly comprise two parts of features, namely continuous features and discrete features, the two parts of features are respectively encoded by using different encoding modes in the edge server and the cloud center, so that the subsequent training of a teacher model and a student model is facilitated;
s2: training a teacher model: inputting the data output by the step S1 after the cloud center preprocessing into a teacher model deployed in the cloud center for training;
s3: training a student model: inputting the data preprocessed by the edge base station output in the step S1 into a student model for training, combining a full-connection network layer shared by the teacher model to accelerate the training speed, and accelerating the convergence of the student model by using the distillation loss of the teacher model and the student model;
s4: group caching strategy: according to the personalized preferences of the users predicted by the student model in the step S3, combining the activity of the group users to integrate into the interest preferences of the group users, and selecting the Top-k content with the highest user preference according to the cache capacity of the edge server for caching;
s5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.
Further, the step S1 specifically includes the following steps:
s11: user features E input from the input layerUAnd content characteristics ECWill contain many discrete characteristics, such as gender, occupation, device model, category of content of the user, including predicting the target user's personalized preferences TBEtc., the discrete features may be encoded using one-hot encoding, assuming that any one of the one-hot encoded discrete features is denoted as fd
fd=[d1,d2,...,di,d||D||]
Figure BDA0003558262880000021
Wherein
Figure BDA0003558262880000022
D represents a discrete feature fdSo as to encode a one-dimensional discrete feature into a vector of dimension | D | containing only 0 and 1; the characteristic of the input layer after single hot coding is marked as Fd
Fd=f(EU,EC,TB)
Wherein f (-) represents a one-hot encoding of discrete features;
s12: for continuous features collected, e.g. age, duration of viewing, completeness of viewingCharacter, user behavior sequence, etc., and encoding by using feature embedding to obtain low-dimensional dense embedded feature Y ═ Y1,y2,y3...,yk];
Figure BDA0003558262880000031
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003558262880000032
the parameters of the over-matrix are represented,
Figure BDA0003558262880000033
representing the sparse features of the input, k and m representing the size of the parameter matrix, and k < m,
Figure BDA0003558262880000034
representing a bias vector so as to convert the sparse features of high dimension into dense vectors of low dimension; the characteristic of the input layer after embedded coding is marked as Fy
Fy=g(EU,EC,TC)
Where g (-) represents the embedded encoding of the continuous feature.
Further, the step S2 specifically includes the following steps:
s21: gated loop unit GRU: since the user's interests over a period of time are dynamically changing over time, GRUs are used to model the user's sequence of behaviors. GRUs perform better in long sequence modeling of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate, wherein the update gate determines how much previous information in a user behavior sequence needs to be reserved and transmitted to the next layer, and the reset gate determines how much previous information needs to be ignored; the GRU model is represented as follows:
zt=σ(Wz(Fd+Fy)+Nzht-1+bz)
rt=σ(Wr(Fd+Fy)+Nrht-1+br)
Figure BDA0003558262880000035
Figure BDA0003558262880000036
wherein z ist,rt
Figure BDA0003558262880000037
htRespectively representing an update gate, a reset gate, a candidate hidden state vector and a hidden state vector of the current time step, sigma is a sigmoid activation function, Wz,Wr,WhAnd Nz,Nr,NhIs a training parameter, bz,br,bhIndicating a deviation, an indicates a hadamard product;
s22: the multi-head self-attention mechanism: the requests of users are various, and in order to extract the main interest of the users from the user behavior sequences, a multi-head self-attention mechanism is used for analyzing the long sequences of the users. The traditional attention mechanism may generate errors in the user importance extraction due to other noise information of the sequence when analyzing the user importance interest. The multi-head self-attention mechanism is to analyze and extract information of a target sequence for multiple times, integrate different output results and increase the accuracy of the main interest positioning of a user:
headi=Attention(QWi Q,KWi K,VWi V)
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
wherein Q ═ h1,h2,…,ht]Denotes the output sequence of LSTM, K ═ V denotes the key and value of the output sequence, headiDenotes the ith head in multi-head self-attention, Concat (-) denotes the splicing operation, Wi Q,Wi K,Wi VAnd WOTraining parameters representing an over-matrix, MultiHeadt(Q, K, V) represents the output of the multi-head self-attention of the teacher model;
s23: fully connecting a neural network: because the matrix of bull self-attention output belongs to high-dimensional sparse matrix, and in order to share the training parameter of teacher's model with student's model, join the neural network layer of full connection after bull self-attention output, and the neural network of full connection of teacher's model and the neural network of full connection of student's model are the same structure:
Figure BDA0003558262880000041
Figure BDA0003558262880000042
……
Figure BDA0003558262880000043
wherein
Figure BDA0003558262880000044
Training parameters representing a teacher model layer j neural network,
Figure BDA0003558262880000045
a bias term representing the teacher model's layer j neural network,
Figure BDA0003558262880000046
representing the output of the teacher model layer j neural network;
s24: setting a loss function: for the teacher model, because the predicted user preferences belong to the classification problem, the fitting is done using a logarithmic cross-entropy loss function:
Figure BDA0003558262880000047
Figure BDA0003558262880000048
wherein y istIs the output of the teacher model, σ represents the activation function, user preferences are the classification problem, using the softmax activation function, N represents the amount of data in the training set,
Figure BDA0003558262880000049
data representing a training set, ft(. cndot.) represents a teacher model.
Further, the step S3 specifically includes:
s31: sharing parameters of the teacher model: the edge server end is compared with the cloud center end, and the data volume is less and the computing power is weaker, and in order to accelerate the training of the student model, the parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that the training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined to ensure that the training effect of the student model is better.
Figure BDA0003558262880000051
……
Figure BDA0003558262880000052
Wherein
Figure BDA0003558262880000053
Representing the training parameters of the student model layer j neural network,
Figure BDA0003558262880000054
bias term, Multihead, representing the student model layer j neural networks(Q, K, V) denotes the Bull self-attention of the student modelThe output of the force is carried out,
Figure BDA0003558262880000055
representing the output of the neural network of the j layer of the student model;
s32: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weak computing power. Model distillation mainly shortens the training time of the student models through parameter sharing and lost distillation between the teacher model and the student models; the individual preference of the group users is still predicted by the student model prediction model, so the loss function used is still logarithmic cross entropy loss. Model distillation the loss distillation function in the student model is as follows:
Figure BDA0003558262880000056
Figure BDA0003558262880000057
Figure BDA0003558262880000058
wherein y issRepresenting the predicted output of the student model, fs(. represents a student model, LsLogarithmic Cross entropy loss, L, representing the student modelt/sRepresenting lost distillation of the student model.
Further, the step S4 specifically includes:
s41: obtaining user preference output by the student model under coverage of each base station according to model distillation;
s42: and (4) establishing a caching standard and sequencing according to the activity of the group users, and selecting Top-k content for caching.
The invention has the beneficial effects that: aiming at the problem that the cache hit rate is low due to the fact that the computing capacity of an edge server is weak and the data volume is small in an edge cache scene, the invention provides the edge cache method based on model distillation.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of a scenario of the present invention;
FIG. 2 is a system flow diagram of the present invention;
FIG. 3 is (a) a diagram showing a teacher model in model distillation according to the present invention and (b) a diagram showing a student model in model distillation according to the present invention;
FIG. 4 is a flow chart of the model distillation training of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; for a better explanation of the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Referring to fig. 1 to 4, the edge cache optimization algorithm based on multi-task training according to the present invention specifically includes the following steps:
step 1: input data acquisition and preprocessing: the data collected by the user side are disordered, and after the data are cleaned, the data are mainly composed of two parts of features, namely continuous features and discrete features, and the two types of features are respectively coded by using different coding modes at the edge server and the cloud center, so that the subsequent training of a teacher model and a student model is facilitated. The step 1 specifically comprises the following steps:
step 1.1: first, a user characteristic E input from an input layerUAnd content characteristics ECWill contain many discrete characteristics, such as gender, occupation, device model, category of content of the user, including predicting the target user's personalized preferences TBEtc., the discrete features may be encoded using one-hot encoding. Assume that any one of the one-hot encoded discrete features is denoted as fd
fd=[d1,d2,...,di,d||D||]
Figure BDA0003558262880000071
Wherein
Figure BDA0003558262880000072
D represents a discrete feature fdA set of categories of. Thus, a one-dimensional discrete feature is encoded into a | | | D | | | | vector containing only 0 and 1. The characteristic of the input layer after single-hot coding is marked as Fd
Fd=f(EU,EC,TB)
Where f (-) represents a one-hot encoding of discrete features.
Step 1.2: for collected continuous features such as age, viewing duration, viewing integrity, user behavior sequence and the like, feature embedding is used for encoding, and obtained low-dimensional dense embedded features Y are [ Y ═ Y1,y2,y3...,yk]。
Figure BDA0003558262880000073
Wherein the content of the first and second substances,
Figure BDA0003558262880000074
the parameters of the over-matrix are represented,
Figure BDA0003558262880000075
representing the sparse features of the input, k and m representing the size of the parameter matrix, and k < m,
Figure BDA0003558262880000076
the bias vectors are represented such that the high-dimensional sparse features are converted into low-dimensional dense vectors. The characteristic of the input layer after embedded coding is marked as Fy
Fy=g(EU,EC,TC)
Where g (-) represents the embedded encoding of the continuous feature.
Step 2: training a teacher model: and (3) inputting the data output by the step (1) after the cloud center preprocessing into a teacher model deployed in the cloud center for training. The step 2 specifically comprises the following steps:
step 2.1: gated cycle unit (GRU): since the interests of a user over a period of time are dynamically changing over time, we use GRU to model the sequence of user behaviors. GRUs perform better in long sequence modeling of user behavior than recurrent neural networks. The GRU model consists of an update gate and a reset gate. The update gate determines how much of the previous information in the user behavior sequence needs to be retained and passed to the next layer. The reset gate determines how much of the previous information should be ignored. The GRU model can be expressed as follows:
zt=σ(Wz(Fd+Fy)+Nzht-1+bz)
rt=σ(Wr(Fd+Fy)+Nrht-1+br)
Figure BDA0003558262880000081
Figure BDA0003558262880000082
wherein z ist,rt
Figure BDA0003558262880000083
htRespectively representing an update gate, a reset gate, a candidate hidden state vector and a hidden state vector of the current time step, sigma is a sigmoid activation function, Wz,Wr,WhAnd Nz,NrNh is a training parameter, bz,br,bhIndicating a deviation, which indicates a hadamard product.
Step 2.2: the multi-head self-attention mechanism: the requests of users are various, and in order to extract the main interest of the users from the user behavior sequences, a multi-head self-attention mechanism is used for analyzing the long sequences of the users. The traditional attention mechanism may generate errors in the user importance extraction due to other noise information of the sequence when analyzing the user important interest. The multi-head self-attention mechanism is to analyze the target sequence for multiple times and extract information, and then integrate different output results, so as to increase the accuracy of the main interest positioning of the user.
headi=Attention(QWi Q,KWi K,VWi V)
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
Wherein Q ═ h1,h2,…,ht]Denotes the output sequence of LSTM, K ═ V denotes the key and value of the output sequence, headiDenotes the ith head in multi-head self-attention, Concat (-) denotes the splicing operation, Wi Q,Wi K,Wi VAnd WOTraining parameters representing an excess matrix, MultiHeadt(Q, K, V) represents the output of the teacher model from the multi-head attention.
Step 2.3: fully connecting the neural networks: because the matrix of the multi-head self-attention output belongs to the high-dimensional sparse matrix and the training parameters of the teacher model and the student model are shared, the fully-connected neural network layer is added after the multi-head self-attention output, and the fully-connected neural network of the teacher model and the fully-connected neural network of the student model have the same structure.
Figure BDA0003558262880000084
……
Figure BDA0003558262880000091
Wherein
Figure BDA0003558262880000092
Training parameters representing a teacher model layer j neural network,
Figure BDA0003558262880000093
a bias term representing the teacher model's layer j neural network,
Figure BDA0003558262880000094
representing the output of the teacher model's layer j neural network.
Step 2.4: setting a loss function: for the teacher model, the fitting is done using a logarithmic cross entropy loss function, since the predicted user preferences belong to the classification problem.
Figure BDA0003558262880000095
Figure BDA0003558262880000096
Wherein y istIs the output of the teacher model, σ represents the activation function, user preferences are the classification problem, using the softmax activation function, N represents the amount of data in the training set,
Figure BDA0003558262880000097
data representing a training set, ft(. cndot.) represents a teacher model.
S3: training a student model: inputting the data preprocessed by the edge base station output by the S1 into a student model for training, combining a full-connection network layer shared by the teacher model to accelerate the training speed, and accelerating the convergence of the student model by using the distillation loss of the teacher model and the student model. The step 3 specifically includes:
step 3.1: sharing parameters of the teacher model: compared with a cloud center end, the edge server end has smaller data volume and weaker computing capability, and in order to accelerate the training of the student model, the parameters of the fully-connected neural network layer of the teacher model are shared with the student model, so that the training speed of the student model can be accelerated, and the characteristics of the teacher model can be combined to ensure that the training effect of the student model is better.
Figure BDA00035582628800000912
Figure BDA00035582628800000913
……
Figure BDA0003558262880000098
Wherein
Figure BDA0003558262880000099
Representing the training parameters of the student model layer j neural network,
Figure BDA00035582628800000910
bias term, Multihead, representing the student model layer j neural networks(Q, K, V) represents the multi-headed self-attention output of the student model,
Figure BDA00035582628800000911
representing the output of the student model layer j neural network.
Step 3.2: model distillation: compared with a complex teacher model, the lightweight student model is more suitable for being deployed at an edge base station with weak computing power. Model distillation shortens the training time of the student model mainly through parameter sharing and loss distillation between the teacher model and the student model. The individual preference of the group users is still predicted by the student model prediction model, so the loss function used is still logarithmic cross entropy loss. Model distillation the loss distillation function in the student model is as follows:
Figure BDA0003558262880000101
Figure BDA0003558262880000102
Figure BDA0003558262880000103
wherein y issRepresenting the predicted output of the student model, fs(. represents a student model, LsRepresenting the logarithmic cross-entropy loss, L, of the student modelt/sRepresenting lost distillation of the student model.
And 4, step 4: group caching strategy: after the individual preference of the user is obtained through the student model prediction in the step 3, the group liveness and the group preference fusion index are used as the basis of content caching and are sequenced, and then the content placement strategy beta can be solved:
Figure BDA0003558262880000104
Figure BDA0003558262880000105
wherein AcuRepresents the activity of the user U in the group, U represents the group user under the coverage of the base station,
Figure BDA0003558262880000106
representing the data volume of the user u in the training set, sequencing the beta in a descending order and selecting the contents of Top-k for caching
And 5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (5)

1. An edge caching method based on model distillation is characterized in that: the method comprises the following steps:
s1: input data acquisition and preprocessing: the data continuous characteristics and the discrete characteristics acquired by the user side are coded by using different coding modes for the two characteristics respectively at the edge server and the cloud center;
s2: training a teacher model: inputting the data output by the step S1 after the cloud center preprocessing into a teacher model deployed in the cloud center for training;
s3: training a student model: inputting the data preprocessed by the edge base station output in the step S1 into a student model for training, combining a full-connection network layer shared by the teacher model to accelerate the training speed, and accelerating the convergence of the student model by using the distillation loss of the teacher model and the student model;
s4: group caching strategy: according to the personalized preferences of the users predicted by the student model in the step S3, combining the activity of the group users to integrate into the interest preferences of the group users, and selecting the Top-k content with the highest user preference according to the cache capacity of the edge server for caching;
s5: optimizing a cache result: and optimizing the hit rate of the cache according to the cache strategy.
2. The model distillation based edge caching method according to claim 1, wherein: the step S1 specifically includes the following steps:
s11: user features E entered into the input layer using one-hot encodingUAnd content characteristics ECThe discrete features in (1) are encoded, and any one discrete feature subjected to one-hot encoding is assumed to be denoted as fd
fd=[d1,d2,...,di,d||D||]
Figure FDA0003558262870000011
Wherein
Figure FDA0003558262870000012
D represents a discrete feature fdSo as to encode a one-dimensional discrete feature into a vector of dimension | D | containing only 0 and 1; the characteristic of the input layer after single-hot coding is marked as Fd
Fd=f(EU,EC,TB)
Wherein f (-) represents a one-hot encoding of discrete features;
s12: and encoding the collected continuous features by using feature embedding to obtain low-dimensional dense embedded features Y ═ Y1,y2,y3...,yk]:
Figure FDA0003558262870000021
Wherein the content of the first and second substances,
Figure FDA0003558262870000022
the parameters of the over-matrix are represented,
Figure FDA0003558262870000023
represents the input sparse feature, k and m represent the size of the parameter matrix, and k is less than m,
Figure FDA0003558262870000024
representing a bias vector so as to convert the sparse features of high dimension into dense vectors of low dimension; the characteristic of the input layer after embedded coding is marked as Fy
Fy=g(EU,EC,TC)
Where g (-) represents the embedded encoding of the continuous feature.
3. The model distillation based edge caching method according to claim 1, wherein: the step S2 specifically includes the following steps:
s21: the GRU is used for modeling the behavior sequence of the user, the GRU model consists of an update gate and a reset gate, the update gate determines how much previous information in the behavior sequence of the user needs to be reserved and transmitted to the next layer, and the reset gate determines how much previous information needs to be ignored; the GRU model is represented as follows:
zt=σ(Wz(Fd+Fy)+Nzht-1+bz)
rt=σ(Wr(Fd+Fy)+Nrht-1+br)
Figure FDA0003558262870000025
Figure FDA0003558262870000026
wherein z ist,rt
Figure FDA0003558262870000027
htRespectively representing an update gate, a reset gate, a candidate hidden state vector and a hidden state vector of the current time step, sigma is a sigmoid activation function, Wz,Wr,WhAnd Nz,Nr,NhIs a training parameter, bz,br,bhIndicates a deviation,. indicates a hadamard product;
s22: the method comprises the following steps of analyzing a long sequence of a user by using a multi-head self-attention mechanism, analyzing a target sequence for multiple times, extracting information, integrating different output results, and increasing the accuracy of positioning the main interest of the user:
headi=Attention(QWi Q,KWi K,VWi V)
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
wherein Q ═ h1,h2,…,ht]Denotes the output sequence of LSTM, K ═ V denotes the key and value of the output sequence, headiDenotes the ith head in multi-head self-attention, Concat (-) denotes the splicing operation, Wi Q,Wi K,Wi VAnd WOTraining parameters representing an over-matrix, MultiHeadt(Q, K, V) represents the output of the multi-head self-attention of the teacher model;
s23: and adding a fully-connected neural network layer after the multi-head self attention is output, wherein the fully-connected neural network of the teacher model and the fully-connected neural network of the student model have the same structure:
Figure FDA0003558262870000031
Figure FDA0003558262870000032
……
Figure FDA0003558262870000033
wherein
Figure FDA0003558262870000034
Training parameters representing a teacher model layer j neural network,
Figure FDA0003558262870000035
a bias term representing the teacher model's layer j neural network,
Figure FDA0003558262870000036
show teacher modelThe output of the type j layer neural network;
s24: fitting was performed using a logarithmic cross entropy loss function:
Figure FDA0003558262870000037
Figure FDA0003558262870000038
wherein y istIs the output of the teacher model, σ represents the activation function, user preferences are the classification problem, using the softmax activation function, N represents the amount of data in the training set,
Figure FDA0003558262870000039
data representing a training set, ft(. cndot.) represents a teacher model.
4. The model distillation based edge caching method according to claim 1, wherein: the step S3 specifically includes:
s31: sharing parameters of the teacher model: sharing parameters of a fully connected neural network layer of the teacher model with the student model:
Figure FDA00035582628700000310
Figure FDA00035582628700000311
……
Figure FDA00035582628700000312
wherein
Figure FDA00035582628700000313
Representing the training parameters of the neural network of the jth layer of the student model,
Figure FDA00035582628700000314
bias term, Multihead, representing the student model layer j neural networks(Q, K, V) represents the multi-headed self-attention output of the student model,
Figure FDA00035582628700000315
representing the output of the student model layer j neural network;
s32: model distillation: through parameter sharing and loss distillation between the teacher model and the student model, the loss distillation function of the model distillation in the student model is as follows:
Figure FDA0003558262870000041
Figure FDA0003558262870000042
Figure FDA0003558262870000043
wherein y issRepresenting the predicted output of the student model, fs(. represents a student model, LsLogarithmic Cross entropy loss, L, representing the student modelt/sRepresenting lost distillation of the student model.
5. The model distillation based edge caching method according to claim 1, wherein: the step S4 specifically includes:
s41: obtaining user preference output by the student model under coverage of each base station according to model distillation;
s42: and (4) establishing a caching standard and sequencing according to the activity of the group users, and selecting Top-k content for caching.
CN202210286001.1A 2022-03-22 2022-03-22 Edge caching method based on model distillation Active CN114610500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210286001.1A CN114610500B (en) 2022-03-22 2022-03-22 Edge caching method based on model distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210286001.1A CN114610500B (en) 2022-03-22 2022-03-22 Edge caching method based on model distillation

Publications (2)

Publication Number Publication Date
CN114610500A true CN114610500A (en) 2022-06-10
CN114610500B CN114610500B (en) 2024-04-30

Family

ID=81865137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210286001.1A Active CN114610500B (en) 2022-03-22 2022-03-22 Edge caching method based on model distillation

Country Status (1)

Country Link
CN (1) CN114610500B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200090000A1 (en) * 2018-09-18 2020-03-19 Microsoft Technology Licensing, Llc Progress Portal for Synthetic Data Tasks
CN112508169A (en) * 2020-11-13 2021-03-16 华为技术有限公司 Knowledge distillation method and system
CN112819155A (en) * 2021-01-22 2021-05-18 中国人民解放军国防科技大学 Deep neural network model hierarchical compression method and device applied to edge equipment
CN113850362A (en) * 2021-08-20 2021-12-28 华为技术有限公司 Model distillation method and related equipment
CN113849641A (en) * 2021-09-26 2021-12-28 中山大学 Knowledge distillation method and system for cross-domain hierarchical relationship
CN113988263A (en) * 2021-10-29 2022-01-28 内蒙古大学 Knowledge distillation-based space-time prediction method in industrial Internet of things edge equipment
WO2022022274A1 (en) * 2020-07-31 2022-02-03 华为技术有限公司 Model training method and apparatus
CN114490447A (en) * 2022-01-24 2022-05-13 重庆邮电大学 Intelligent caching method for multitask optimization

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200090000A1 (en) * 2018-09-18 2020-03-19 Microsoft Technology Licensing, Llc Progress Portal for Synthetic Data Tasks
WO2022022274A1 (en) * 2020-07-31 2022-02-03 华为技术有限公司 Model training method and apparatus
CN112508169A (en) * 2020-11-13 2021-03-16 华为技术有限公司 Knowledge distillation method and system
CN112819155A (en) * 2021-01-22 2021-05-18 中国人民解放军国防科技大学 Deep neural network model hierarchical compression method and device applied to edge equipment
CN113850362A (en) * 2021-08-20 2021-12-28 华为技术有限公司 Model distillation method and related equipment
CN113849641A (en) * 2021-09-26 2021-12-28 中山大学 Knowledge distillation method and system for cross-domain hierarchical relationship
CN113988263A (en) * 2021-10-29 2022-01-28 内蒙古大学 Knowledge distillation-based space-time prediction method in industrial Internet of things edge equipment
CN114490447A (en) * 2022-01-24 2022-05-13 重庆邮电大学 Intelligent caching method for multitask optimization

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AIWALKER-HAPPY: ""IAKD | 给你的知识蒸馏加上交互的翅膀"", Retrieved from the Internet <URL:《https://zhuanlan.zhihu.com/p/158186335》> *
YOSHITOMO MATSUBARA: ""Head Network Distillation : Splitting Distilled Deep Neural Networks for Resource-Constrained Edge Computing Systems"", 《IEEE ACCESS》, vol. 8, 20 November 2020 (2020-11-20), pages 212177 - 212193, XP011824945, DOI: 10.1109/ACCESS.2020.3039714 *
朱倩倩: ""基于高精度深度学习的模型简化的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2022, 15 January 2022 (2022-01-15), pages 140 - 533 *
李富祥: ""推荐系统使能的边缘缓存策略研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2023, 15 October 2023 (2023-10-15), pages 138 - 388 *

Also Published As

Publication number Publication date
CN114610500B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
Li et al. A deep learning method based on an attention mechanism for wireless network traffic prediction
CN113242469B (en) Self-adaptive video transmission configuration method and system
CN111339433A (en) Information recommendation method and device based on artificial intelligence and electronic equipment
CN111611488B (en) Information recommendation method and device based on artificial intelligence and electronic equipment
US20230353828A1 (en) Model-based data processing method and apparatus
CN112598438A (en) Outdoor advertisement recommendation system and method based on large-scale user portrait
CN113297936B (en) Volleyball group behavior identification method based on local graph convolution network
CN111563770A (en) Click rate estimation method based on feature differentiation learning
Hou et al. No-reference video quality evaluation by a deep transfer CNN architecture
CN115374853A (en) Asynchronous federal learning method and system based on T-Step polymerization algorithm
CN107547898A (en) A kind of controllable two-parameter distribution system of sensor of energy consumption precision
Hao et al. Knowledge-centric proactive edge caching over mobile content distribution network
CN114039871B (en) Method, system, device and medium for cellular traffic prediction
Saiyad et al. Exploring determinants of feeder mode choice behavior using Artificial Neural Network: Evidences from Delhi metro
CN112817563A (en) Target attribute configuration information determination method, computer device, and storage medium
CN114610500B (en) Edge caching method based on model distillation
CN113822954A (en) Deep learning image coding method for man-machine cooperation scene under resource constraint
CN113159371B (en) Unknown target feature modeling and demand prediction method based on cross-modal data fusion
CN117216255A (en) Classification model training method and related equipment
CN115577797B (en) Federal learning optimization method and system based on local noise perception
CN114490447A (en) Intelligent caching method for multitask optimization
Rahman et al. Dynamic error-bounded lossy compression to reduce the bandwidth requirement for real-time vision-based pedestrian safety applications
Wang et al. Automatic learning-based data optimization method for autonomous driving
CN115587266A (en) Air-space-ground integrated internet intelligent edge caching method
Zhao et al. Research on human behavior recognition in video based on 3DCCA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant