WO2021253835A1

WO2021253835A1 - Heterogeneous network cache decision-making method based on user preference prediction

Info

Publication number: WO2021253835A1
Application number: PCT/CN2021/074167
Authority: WO
Inventors: 朱琦; 单冠捷
Original assignee: 南京邮电大学
Priority date: 2020-06-17
Filing date: 2021-01-28
Publication date: 2021-12-23
Also published as: CN111860595A

Abstract

Disclosed in the present invention is a heterogeneous network cache decision-making method based on user preference prediction. In the method, a macro base station, a small base station and D2D communication coexist, and the mobility and social relationship influence of users are considered. First, in a case that user preferences are unknown, a machine learning method is adopted to predict the user preferences according to request historical records of the user preferences; then, the average system cost is calculated by considering mobility, a physical position relationship and a social relationship of the users, an optimization problem of minimizing the average system cost is constructed by taking cache strategies of the small base station and the important users as variables under the constraint of cache capacity, and cache decision-making is performed by solving the problem. According to the method, the optimization problem of the present invention is solved based on the minimization problem of the super-mode function on a partition quasi-matrix, on the premise that the performance of the suboptimal solution is guaranteed, the calculation complexity of cache decision-making is greatly reduced, and therefore the system cost is greatly reduced by caching at a small base station and important users.

Description

A heterogeneous network cache decision-making method based on user preference prediction

Technical field

The invention belongs to wireless communication technology, and specifically relates to a heterogeneous network cache decision-making method based on user preference prediction.

Background technique

With the development of the mobile Internet, the rapid growth of wireless mobile devices has generated a large amount of data traffic, which has brought challenges to mobile communications. Local caching of popular files is one of the solutions to these challenges. 5G heterogeneous networks deploy small base stations to offload the traffic load of macro base stations, but the backhaul line of small base stations has become a bottleneck for system performance. The caching technology caches popular files in some users and small cells in advance. When users need these files, they can be obtained through small cells or D2D communication without occupying the backhaul link of the small cell and the bandwidth of the macro cell. During the peak traffic period, network congestion is avoided, and the delay can also be reduced, thereby improving QoS.

However, considering the limitation of cache cost, the capacity of cache devices deployed by small base stations is limited, and the storage capacity of mobile devices is smaller, far less than the capacity of the Internet content library. Therefore, make correct cache decisions to determine the files placed in the cache. It is very important to improve the cache hit rate.

Summary of the invention

Objective of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a heterogeneous network cache decision-making method based on user preference prediction.

In order to achieve the above objectives, the technical solutions provided by the present invention are as follows:

A heterogeneous network cache decision-making method based on user preference prediction. In the method, macro base station, small base station, and D2D communication modes coexist, and the method includes the following steps:

(S1) First, when the probability distribution of the user requesting different files is unknown, use machine learning to predict user preferences based on their request history;

(S2) Derive the expression of the average system cost based on the user's mobility, physical location relationship, and social relationship. Under the constraint of cache capacity, use the cache strategy of small base stations and important users as variables to construct an optimization that minimizes the average system cost The problem, the cache decision is made by solving the problem;

(S3) The suboptimal algorithm based on the greedy algorithm solves the optimization problem of minimizing the average system cost, and determines the file to be cached according to the solution vector.

Further, the algorithm processing process of the method of the present invention is specifically as follows:

(1) Use S={1,...,S}, U={1,2,...,U}, C={1,...,C} and F={1,... ,C*F _c }represent the small base station set, user set, file category set and file set, where S, U, C, F _c represent the number of small base stations, the number of users, the number of file categories, and the number of files in each category, respectively. t _min and t _min ′ respectively indicate the minimum communication time required to download each file through D2D and through the small base station, and the macro base station contains all the files in the content library;

(2) Divide time into equal-length time slots, t ∈ N represents the t-th time slot, its starting time is τ _t , all time slots are of length T, and each time slot starts, that is, the current time slot User's initial D2D connection

Where indicator function

Represents whether user i and user j can conduct D2D communication at the beginning of time slot t, which can be 1 or 0, and then each user randomly requests files according to their preferences to form a file request vector

in

Is the file requested by user i in time slot t;

(3) Use indicator variables

Represents the physical relationship between users. If user i and user j have a physical relationship at time t, then

If not then

Use μ _{i,j to} represent the exponential distribution parameter of the connection duration between user i and user j, and use λ _{i,j to} represent the exponential distribution parameter of the interval time between user i and user j. According to user i and user j at time _{t 0} Connections

Calculate the probability that user i and user j are _{connected at t c}

(4) Use μ′ _u,s and λ′ _{u,s to} represent the exponentially distributed parameters that the connection time and interval time between user u and small base station s obey respectively, and use indicator variables

Represents the physical relationship between user u and small base station s, based on the connection at time _{t 0}

Calculate the probability that user u and small base station s are _{connected at time t c}

(5) Use S _{i,j to} denote the social relationship between user i and user j, and use _{ST to} denote the social relationship threshold _. Calculate the social relationship s _i,j _{between users based on S i,j} and ST, and use θ _u represents _a user u social importance, it is important to measure the degree of social users, each user is calculated social importance _{_{θ u = α · V u +}} β · B u, wherein V _{_u,} B _u representing the user u Equipment capacity and betweenness centrality, α and β are weight coefficients, and satisfy α+β=1, select important users to cache files according to social importance;

(6) Use H={H ₁ , H ₂ ,..., H _U } to represent the historical file request for _{T b} time slots before the decision time.

Represents the request history of user u,

_{For the file requested at the t b-} th time slot in the previous T _b time slots, the user’s empirical probability distribution for each type of file based on the number of times is calculated according to the historical file request H

And as the data set of the K-means algorithm;

(7) Calculate the sum of the distances from all data points to their cluster centers under different K values as a performance metric to measure the current K-means model. The calculation expression is as follows:

Where X is the data point vector, and the distance is Euclidean distance;

(8) calculates Gap (K) = E (logD K) -logD K as Gap Statistic, wherein E (logD _K) of the _K logD desired, selected so that the maximum value of K Gap (K) optK number of classes classified as the user ；

(9) For each type of user, calculate its cluster center as the empirical probability distribution of the type of user requesting this type of file, sort the cluster centers from large to small, and obtain the corresponding index vector, taking the top five values Take the logarithm of ranking and rank as y, x data to perform linear regression to obtain Zipf distribution parameter s;

(10) Calculate the probability of this type of user requesting each type of file, and the calculation expression is as follows:

Calculate user file preferences based on the uniform distribution of file preferences for each type of file (probability distribution of all files requested)

in

Represents the probability of user u requesting the f-th file;

(11) Repeat steps (9) to (10) until the file preferences of optK users have been calculated, and a set of file preferences of all users is obtained

(12) Let the cost of obtaining files from oneself or from important users through D2D communication be ξ ₁ ; the cost of obtaining files from small base stations is ξ ₂ ; the cost of obtaining files from macro base stations is ξ ₃ , and the user first considers storing from himself Or important users obtain the request file, if not, consider the small base station, and if they are not replaced, obtain the request file from the macro base station;

(13) Let N represent the number of important users, let

Represents all important users and small cell buffer placement strategy variables, derives the expression of the average system overhead f(x), initializes i=N+1, and x _subopt is an all-zero vector of length (N+S)F;

(14) Let j = 1, let the set F _left = {1,...,F};

(15) Order

Then set the value of the (i-1) F+ _{fopt element} in x subopt to 1, remove the f _opt _{element in the set F left} , and finally set j=j+1;

(16) Repeat step (15) until j>V' _i .

Furthermore, in the step (3)

The calculation formula is as follows:

Step (4)

The calculation formula is:

Furthermore, the calculation formula _{of social relations S i,j} in step (5) is as follows:

Among them, _Ai represents the social attributes of user i, social attributes refer to user interest tags, groups, etc. on social networks, frequency(k) represents the social attributes shared by several users in total, and the shared social attributes between user i and user j The more remote, the closer their social relations; the Chinese social relations s _i,j are judged as follows:

When S _i,j >=S _T , it is considered that there is a social connection between user i and user j, at this time s _i,j =1, otherwise there is no, s _i,j =0.

Furthermore, the calculation formula of _{betweenness centrality Bu is as follows:}

Where b _i,j represents the number of shortest paths between vertex i∈VU and vertex j∈VU in graph _{G s} _{, and bi,j} (g _u ) represents the passage between vertex i∈VU and vertex j∈VU in graph _{G s} The number of shortest paths of V _u;

Furthermore, the empirical probability distribution in the step (6)

The calculation formula is:

Among them, 1 _A (x) represents the indicator function, if the condition x is true, its value is 1, otherwise it is 0;

Furthermore, in the step (10)

The calculation formula is:

Where c _f is the file category to which file f belongs and satisfies

Represents the probability that the user u requests the file in the _{c i category obtained by fitting the Zipf distribution.}

Furthermore, the calculation formula of the average system overhead f(x) in step (13) is:

in

Beneficial effects: Compared with the prior art, the present invention optimizes the cache placement strategy of small base stations and important users with the goal of minimizing the average system overhead, first predicts user preferences based on the request history; then considers user mobility and social relations, and adopts pseudo The Boolean optimization method optimizes the cache decision-making method. Its notable effects include the following aspects:

1. On the assumption that users with similar interests have basically the same file preferences, use K-means to classify users into different types of files according to their historical file requests, and obtain the empirical probability distribution of different file requests for each type. Since this probability distribution is inaccurate in the case of limited historical data, Zipf distribution is used to fit these data and provide more accurate user file preference predictions.

2. According to user file preferences, user mobility, user social relationships, and the cache placement content of important users and small base stations, deduce the probability that the user will obtain the requested file from important users, small base stations or macro base stations in the next time slot, and further The average system overhead is derived, and the nonlinear integer programming problem that minimizes the average system overhead is obtained.

3. The nonlinear integer programming problem is an NP-complete problem, and the solution complexity is very high. In order to reduce the complexity, after proving that the objective function of the problem is a monotonic supermodular function and the constraint is a division matroid, a The polynomial time greedy algorithm obtains the sub-optimal cache decision.

Description of the drawings

Figure 1 is a schematic flow diagram of the method of the present invention;

Figure 2 is a schematic diagram of a system model of the method of the present invention;

Figure 3 is a comparison diagram of a caching strategy based on popularity, a random caching strategy and the proposed caching strategy;

Figure 4 is a comparison diagram of a cache strategy that does not consider mobility, a cache strategy that does not consider social relationships, and the proposed cache strategy;

Fig. 5 is a comparison diagram of the suboptimal value and the optimal value in the embodiment.

detailed description

In order to explain in detail the technical solutions disclosed in the present invention, further explanations are given below in conjunction with the drawings and specific embodiments of the specification.

In a heterogeneous network cache decision method based on user preference prediction provided by the present invention, macro base stations, small base stations and D2D communications coexist, and users have mobility and are affected by social relationships. First, when user preferences (the probability distribution of users requesting different files) are unknown, machine learning methods are used to predict user preferences based on their request history. Comprehensively considering the user’s mobility and social relations, the expression of the average system cost is derived. The user’s mobility relative to other users and relative small base stations is described by the following equations (1) and (2) respectively, and the social relationship between users is represented by The following formula (3) describes that, under the constraint of cache capacity, the optimization problem of minimizing the average system cost is constructed with the cache strategy of small base stations and important users as variables, and the cache decision is made by solving the problem. In order to solve the problem of high computational complexity when the number of important users is large, after proving that the objective function is a supermodel function, the optimization problem is solved based on the suboptimal algorithm of the greedy algorithm to reduce the complexity of cache decision-making. The present invention proves that the optimization problem formed belongs to the minimization problem of the supermodular function on the partitioned matroid. Under the premise of ensuring the performance of the sub-optimal solution, the calculation complexity of the cache decision is greatly reduced, and the calculation complexity of the cache decision is greatly reduced. Cache to greatly reduce system cost.

Specifically, the overall flow chart of the method of the present invention is shown in Fig. 1, and includes the following steps:

Step1, predict user preferences

As shown in Figure 2, there are multiple small base stations and multiple users in the coverage area of a macro base station. Suppose there are a total of S small base stations in the coverage area of the macro base station, and the buffer capacity of each small base station s∈S={1,...,S} is the same and V _SBS ; the coverage areas of the small base stations can be overlapped and shared in the macro base station U users, the device capacity of each user u∈U={1,2,...,U} is V _u . The file library is composed of C files, where each category c∈C={1,...,C} contains F _c files, then the entire file library has a total of F=C*F _c files, assuming each file The size of f∈F={1,...,C*F _c } is the same, and the minimum communication time required to download each file through D2D and small base stations is t _min , t _min ′, respectively, assuming that the macro base station has a content library All files in. D2D communication can be carried out between users.

Divide time into time slots of equal length, t ∈ N represents the t-th time slot, its starting time is τ _t , and all time slots are of length T. At the beginning of each time slot, the macro base station can obtain whether the distance between users meets the requirements of D2D communication, that is, the initial D2D connection status of users in the current time slot

Where indicator function

It represents whether user i and user j can perform D2D communication at the beginning of time slot t, which can be 1 and vice versa. Then each user randomly requests files according to their preferences to form a file request vector

in

Is the file requested by user i in time slot t. In order to simplify the model, it is assumed that each user requests a file at the beginning of the time slot.

There are three ways for users to obtain files.

The first is obtained from the cache of important users around through D2D communication, and the system cost is ξ ₁ ;

The second type is obtained from the buffer of the small base station, and the system cost is ξ ₂ ;

The third type is obtained from the macro base station, the system cost is ξ ₃ , and ξ ₁ ＜ξ ₂ ＜ξ ₃ . Assuming that D2D communication supports one-to-many, that is, a user can send files to multiple users at the same time or receive files from multiple users at the same time; users who are also in the service range of multiple small base stations can also establish communication with multiple small base stations at the same time .

In the current time slot, the macro base station first guesses the user’s initial D2D connection status in the next time slot based on the user’s initial D2D connection status, and then comprehensively considers the user’s mobility and social relations and other factors to arrive at the optimal caching strategy for the next time slot, and then Place the files that need to be cached in advance.

Whether or not D2D communication can be established between two users should not only consider the physical relationship between the users, but also the social relationship between them. The physical relationship between users is the physical distance relationship between the two. Because users are mobile, the physical distance between users is constantly changing. One user may be close to or far away from another user. D2D communication needs to be at a certain distance. It can only be established within the range of physical distance, so whether users can establish D2D communication, or whether there is a physical relationship between users can be regarded as a probabilistic question. When the physical distance between two users is less than the maximum distance of D2D communication, they will be connected. The duration of their connection is called the connection duration; the interval between two successful connections is called the interval duration. In order to model the mobility of users, it is assumed that both the connection duration and the interval duration obey an exponential distribution. Since the physical distance between the user and the small base station can only be communicated within the coverage of the small base station, and although the location of the small base station is fixed, due to the mobility of the user, the relative distance between the user and the small base station will also change, so it communicates with D2D Similarly, we can also use exponential distribution to model the connection duration and interval duration between users and small base stations.

Define indicator variables

To show the physical relationship between users. If user i and user j have a physical relationship at time t, then

If not then

Define μ _i,j as a parameter of exponential distribution obeyed by the connection time between user i and user j; define λ _i,j as an exponential distribution parameter obeyed by the interval time between user i and user j. Suppose we know the connection between user i and user j at t ₀

Calculate the probability that user i and user j are _{connected at t c:}

Similarly, we assume that the connection duration and interval duration between the user u and the small base station s obey _{the exponential distribution of the parameters μ′ u,s} and λ′ _u,s respectively, and the indicator variable

Represents the physical relationship between user u and small base station s. If we know the connection at _{t 0}

Calculate the probability that the user u and the small base station s are _{connected at t c as:}

Based on security considerations, the successful establishment of D2D communication also involves social relationships, and only users with close social relationships are willing to establish D2D communication. Define s _i,j as the social relationship between user i and user j, using the Adamic/Adar method to calculate the social relationship between users based on the user’s social attributes as:

Among them, _Ai represents the social attributes of user i (the user's interest tags on social networks, groups, etc.), and frequency(k) represents a total of several users sharing k social attributes. The more remote the shared social attributes between user i and user j are, the closer their social relationship is. This is because the remote attributes can better reflect the characteristics and preferences of users. Define S _T as the social relationship threshold. Only when S _i,j ＞=S _T , the user i and user j are considered to have a social connection, at this time s _i,j =1, otherwise there is no, s _i,j ＝ 0. G _s (VU, E _s ) is used to describe the social connection between users, where VU is the set of users, E _s represents the social connection between users, and the wire segment connection between users represents the social connection between users.

Cache files in the user's terminal device will occupy the storage space of the device. Due to the user's selfishness, the user himself is unwilling to cache files, and only important users hired by the operator will act as cache nodes. In order to measure the social importance of users and introduce the concept of social importance, operators will select users with the greatest possible social importance as important users. Define social importance as:

θ _u =α·V _u +β·B _u ,u=1,...,U (4)

Among them, V _u and _Bu respectively represent the device capacity and intermediary centrality of user u. α, β are weighting coefficients, and satisfy α+β=1. Intermediary centrality is a commonly used concept in social network analysis to express the centrality of a point in a social network in the entire network. Betweenness centrality is defined as:

Where b _i,j represents the number of shortest paths between vertex i∈VU and vertex j∈VU in graph _{G s} _{, and bi,j} (g _u ) represents the passage between vertex i∈VU and vertex j∈VU in graph _{G s} The number of shortest paths of V _u.

It can be seen that the larger the capacity of a user's device and the more users who have social connections with it, the greater its social importance. Operators select important users according to the social importance of users, and rank users in descending order of social importance. Generally, operators select the top N users as important users.

The user's file preferences play a crucial role in determining the cache placement strategy. The file preference of each user is unknown. The macro base station only has the history file request H={H ₁ , H ₂ ,..., H _U } for _{each user T b time slots before the research time.}

Represents the request history of user u,

It is the file requested at _{the t b-} th time slot in the _{previous T b time slots.} According to the historical file request H, the empirical probability distribution of users for each type of file based on the number of times can be calculated as:

Among them, 1 _A (x) represents the indicator function. If the condition x is true, its value is 1, otherwise it is 0.

Represents the empirical probability _{of user u requesting c i} files calculated based on the number of requests.

Since the number of time slots requested by the observation user's history is small, resulting in less observation data, this empirical probability obviously cannot accurately describe the true probability of the user requesting each type of file. Therefore, it is necessary to predict the true probability distribution of each type of file requested by the user based on the obtained empirical probability.

In real life, users are divided into different types. For example, some users like to watch science fiction movies the most, and some users like to watch comedy shows the most. In other words, users of the same type can be considered to have basically the same probability distribution. If you can accurately divide the number of user types and the users included in each user type, not only can the probability distribution of different users requesting each type of file be reduced, but also because the same type of user is equivalent to one user, it will increase in disguise. The acquired historical file request data of each user makes the empirical probability distribution more accurate, which is conducive to further predicting user file preferences.

Use the K-means method to classify user types, and use the Gap Statistic method to determine the K value, and use this K value as the cluster center point obtained by K-means

As the empirical probability distribution of each type of user request for each type of file. Then use this probability distribution to further predict the user's file preferences.

Zipf distribution is widely used in mobile network caching research, and it is considered to be a good description of users' file preferences or file popularity (the probability distribution of each file being requested by users), etc. Therefore, Zipf distribution is used to request user requests The empirical probability distribution of each type of file is fitted. The Zipf probability distribution is:

Where P _c represents the probability that the user requests a file in the category c in their preferences, rank(c) ∈ {1,...,C} represents the popularity ranking of the c category file, and s is the Zipf distribution The parameter describes the skewness of the user's preference, and C is the total number of categories.

It can be seen that the Zipf distribution is determined by the parameter s, so the fitting only needs to determine the value of s. Take the logarithms of both sides of the equation (7) and arrange them to get:

It can be seen that the logarithm of the probability of each type of file being requested has a linear relationship with the logarithm of the category ranking, with a slope of -s and an intercept of

The top-ranked category in the Zipf distribution occupies the vast majority of requests, so only the request probabilities of the top 5 types of files in the probability distribution of each type of user experience are considered, and the logarithm of their probabilities and rankings is linearized. Regression, get the Zipf distribution parameter s that it obeys, and then calculate the request probability of each type of file according to the ranking in the empirical probability distribution. Then it is assumed that the user's preferences for files in each type of file are uniformly distributed, and the predicted user file preferences are:

in

Is the file preference of user u,

Represents the probability that user u requests the f-th file, and has:

Where c _f is the file category to which file f belongs and satisfies

Step2: The optimization problem with the goal of minimizing the average system overhead:

Define the system cost for all users in time slot t to obtain the requested file as:

Where ξ _u (t) is the cost for user u to obtain the requested file at time t, and there are:

Among them, case1 indicates that user u in time slot t can obtain files from themselves or from important users through D2D communication; case2 indicates that user u in time slot t can obtain files from the small base station; case3 indicates that user u at time t can obtain files from the macro base station.

Since in time slot t, the cache placement strategy cannot be changed, and the user request file has been determined, so the system cost ξ(t) is determined. The work to be done is to determine the cache placement strategy at t+1 according to the current D2D connection between users and the user’s file preferences to minimize the average system cost E(ξ(t+1)). For convenience, the following The time label is omitted in the text, and all refer to the t+1 time slot except for special instructions. The average system cost is expressed as:

Using the total probability formula, we can get:

in

Represents the probability that user u requests file f in time slot t+1, using the user file preference prediction algorithm in Section 3, we can get

Suppose a total of N important users are selected in time slot t+1, and the buffer placement strategy for important users is

in

Is the buffer placement strategy vector of the nth important user in time slot t+1,

It is also a 0-1 variable. It is 1 when the nth important user caches the file f in time slot t+1, otherwise it is 0. The probability that the user obtains the requested file through himself or D2D communication is:

The event A _{u, f, n} indicates that the user u can obtain the requested file f from the nth important user. The first equal sign is established because the user can establish D2D communication with multiple important users at the same time, as long as one of the important users can completely transfer the file f to it, that is, the D2D communication time between the two is not less than t _min , the user is The request file can be obtained from itself or through D2D communication, that is, case1 is satisfied. The third equal sign is established because the events of obtaining files from different important users are independent of each other. Derived below

The first equal sign is established because the average system cost is calculated during the buffer placement stage of time slot t, and whether D2D is possible between users at time slot t

It is known, and the event A _{u, f, n is} equivalent to the D2D connection condition between the user u and IU _n and the D2D communication duration t _d2d between the two is not less than t _min and there is social connection between the two and IU _n The file f requested by the user is cached. The second equal sign is established because of the event

It has no effect on the probability of the previous events, and

Only affect the event

The third equal sign was established because of social connections

And cache policy variables

It is not a random variable but a certain value. In the fourth equal sign

It can be obtained by formula (1). In order to simplify the concept, let

Then there is

Put it into equation (15) to get:

Similar to the important user cache strategy, suppose the cache placement strategy of the small cell is

in

Is the buffer placement strategy of the small base station s in the t+1 time slot, according to

The derivation method can be obtained:

in

Note that since communication with small base stations does not consider social relations, so

and

Compared with that, an indicator variable that represents social relations is multiplied.

Since users can always communicate with the macro base station, and the macro base station has all the files in the content library, there are:

Incorporating equations (17) to (19) into equation (13), we can get:

According to this average system cost, the optimization problem can be constructed as:

The first limitation is the buffer capacity limitation of the small base station, and the second limitation is the equipment capacity limitation of important users. The third restriction is that the buffer placement strategy variables of the small cell and important users are both 0-1 variables.

Step3. Prove that the optimization problem belongs to the problem of minimizing the monotonically decreasing supermodular function on the partitioned matroid:

make

Representing all important users and small cell buffer placement strategy variables, the objective function in question (21) can be regarded as a function f(x) about x, namely:

in

According to its definition, the value range is [0,1]. In order to prove that

Uniformly expressed as

When 1≤k≤N,

represent

When N+1≤k≤N+S,

represent

will

Uniformly expressed as

When 1≤k≤N,

represent

When N+1≤k≤N+S,

represent

The formula (22) is further simplified as:

The following prove that f(x) is a monotonically decreasing function with respect to x.

Take any variable

Find its first derivative. (i) When 1≤k≤N, its first derivative is:

Because ξ ₁ ＜ξ ₂ ＜ξ ₃ , ξ ₂ -ξ ₁ ＞0,ξ ₃ -ξ ₂ ＞0; because all

All satisfied

so

at this time

(ii) When N+1≤k≤N+S, the first derivative is:

From the analysis of situation (i), we know that ξ ₃ -ξ ₂ ＞0,

So at this time also

Combining situation (i) and situation (ii), we know that for any

Both have

That is to say, f(x) is a monotonically decreasing function of x.

The following proves that f(x) is a supermodular function.

Take any two variables

Find its second derivative. (i) When f1≠f2, observing the expression of f(x), it is easy to know that none of the monomials in the polynomial expansion contains the factor

In other words, the second derivative at this time

(ii) When f1=f2=f and k1, k2 satisfy k1∈{1,...,N}, k2∈{1,...,N}, the second derivative is:

By analyzing the content of monotonicity, we know that in formula (26), ξ ₂ -ξ ₁ ＞0,ξ ₃ -ξ ₂ ＞0,

So at this moment

(iii) When f1=f2=f and one of k1, k2 belongs to {N+1,...,N+S}, the second derivative is:

Where ξ ₃ -ξ ₂ ＞0,

So at this moment

Comprehensive situation (i), (ii), (iii), it can be seen that f(x) is for any two variables

Second derivative of

Heng was established. From Proposition 1, we can see that the function f(x) is a supermodular function.

Therefore, f(x) is a monotonically decreasing supermodular function.

definition

When i∈{1,...,N}, EF _i = {1,...,F} is the basic set of the i-th important user, which represents the file that can be selected for cache; when i∈{N+ When 1,...,N+S}, EF _i ={1,...,F} is the basic set of the iN-th small base station, which means that it can choose the file to be cached. Obviously, each important user or small base station can choose to cache any file in F={1,...,F}. definition:

Where V′ _i represents the buffer capacity limit of important users or small base stations, that is, when i ∈ {1,...,N},

When i∈{N+1,...,N+S},

Then in LF

The physical meaning of is the cache placement strategy of important users or small cells that meet the constraint of problem (21), that is to say, LF is the cache placement strategy of all possible important users and all small cells that meet the constraint of problem (21). gather. Therefore, the constraint condition of problem (21) is equivalent to the partition matroid (EF, LF).

In summary, the optimization problem (21) belongs to the problem of minimizing the monotonically decreasing supermodular function on the partitioned matroid.

Step4. Solving the optimization problem:

Based on the greedy algorithm, a local greedy cache algorithm for solving the cache placement strategy is designed. The specific steps are as follows:

1): Let N represent the number of important users, let

Represent all important users and small cell buffer placement strategy variables, derive the expression of average system overhead f(x), initialize i=N+1, and x _subopt is an all-zero vector of length (N+S)F;

2): Let j = 1, let the set F _left = {1,...,F};

3): Let

4): Repeat step (3) until j>V′ _i ;

5): Assign values to i in the order of N+2,...,N+S,1,...,N, and execute step (2) to step (4) after each assignment;

Figure 3 shows a comparison of system costs obtained through three different methods. From top to bottom, the first curve corresponds to the system cost obtained through random caching. This strategy randomly places files into the caches of IUs and SBS until they are full. The second curve shows the system cost of using a caching strategy based on popularity. This is a widely used caching strategy whose idea is to cache the most popular files at each cache node. In order to implement a caching strategy based on popularity, after predicting the preferences of all users, we take the average of all user preferences as the global file popularity, and all IUs and SBSs put the most popular files in their caches until Its cache is full. The bottom curve shows the system cost obtained by the proposed suboptimal caching strategy. It can be seen that the performance of random caching is the worst, because it does not consider the influence of user preferences, but caching files randomly. The system cost obtained by using this strategy is far greater than the popularity-based caching strategy and the proposed caching strategy. The caching strategy based on popularity is much better than the performance of random caching, but because it does not consider the joint optimization of different IUs and SBSs, the system cost is larger than the proposed caching strategy, and this gap increases with the increase in the number of IUs. Increase.

Figure 4 demonstrates the necessity of considering mobility and sociality in caching strategies. There are three curves on the graph. The above curve shows the system cost of using the optimal caching strategy without considering mobility. The curve is obtained in the following way: First, remove the mobility in the scene, that is, if a user can When communicating with another user or SBS in the current time slot, they must be able to communicate in the next time slot. Then, the local greedy caching algorithm is applied to this changed scenario to obtain a caching strategy that does not consider mobility, and then the strategy is applied to a scenario that considers mobility to obtain the system cost corresponding to the strategy. It can be seen that because the caching strategy that does not consider mobility ignores the mobility in the scene, and takes the connection of the current time slot as the connection of the next time slot, the system cost obtained by using this strategy is greater than the proposed caching strategy. The middle curve shows the system cost of using a caching strategy that does not consider sociality. The curve is obtained in the following way: First, remove the sociality in the scene, that is, if two users physically meet the requirements of D2D communication, then they can establish D2D communication regardless of whether they have a social relationship. Then apply the local greedy caching algorithm to this scenario to obtain a caching strategy that does not consider sociality, and then apply this strategy to a scenario that considers sociality to obtain the corresponding system cost. Although the system cost of this strategy is basically the same as that of our proposed strategy when there are fewer important users, as the number of important users increases, compared with the proposed caching strategy, the system cost of this strategy is larger and the gap becomes larger. Come bigger. This is because it ignores the fact that some users cannot communicate with each other because their social relationships are not close enough, resulting in some files placed at the user's place being invalid. Because there is no social relationship, people around the user may be unwilling to communicate with him.

Figure 5 shows the comparison of the system cost between the proposed suboptimal caching strategy and the optimal caching strategy. The optimal caching strategy here is obtained by replacing variables. Specifically, the nonlinear integer programming problem can be transformed into a linear integer programming problem, and then standard linear integer programming optimization tools can be used to solve the optimal caching strategy problem. Because the optimization problem is NP-complete, in order to reduce the computational complexity, the comparison scenario only contains one SBS, and the number of important users is between 1 and 4. The second best value is obtained by the proposed method. It can be seen that the gap between the optimal value and the sub-optimal value is very small.

Claims

A heterogeneous network cache decision-making method based on user preference prediction, characterized in that: the communication modes of macro base station, small base station, and D2D coexist in the method, and include the following steps:

(S1) First, when the probability distribution of the user requesting different files is unknown, predict user preferences based on user request history through machine learning;

(S2) Derive the expression of the average system cost based on the user's mobility, physical location relationship, and social relationship. Under the constraint of cache capacity, use the cache strategy of small base stations and important users as variables to construct an optimization that minimizes the average system cost The problem, the cache decision is made by solving the problem;

(S3) The suboptimal algorithm based on the greedy algorithm solves the optimization problem of minimizing the average system cost, and determines the file to be cached according to the solution vector.
The heterogeneous network cache decision-making method based on user preference prediction according to claim 1, wherein the algorithm processing process of the method is specifically as follows:

(1) Use S={1,...,S}, U={1,2,...,U}, C={1,...,C} and F={1,... ,C*F c }represent the small base station set, user set, file category set and file set, where S, U, C, F c represent the number of small base stations, the number of users, the number of file categories, and the number of files in each category, respectively. t min and t min ′ respectively indicate the minimum communication time required to download each file through D2D and through the small base station, and the macro base station contains all the files in the content library;

(2) Divide time into equal-length time slots, t ∈ N represents the t-th time slot, its starting time is τ t , all time slots are of length T, and each time slot starts, that is, the current time slot User's initial D2D connection
Where indicator function
Represents whether user i and user j can conduct D2D communication at the beginning of time slot t, which is represented by "1" or "0"; then each user randomly requests files according to his preferences, forming a file request vector R t ={r i t : I=1,...,U}, where r i t ∈F is the file requested by user i in time slot t;

(3) Through indicator variables
Represents the physical relationship between users. If user i and user j have a physical relationship at time t, then
If not then
Define μ i,j to represent the exponential distribution parameter of the connection time between user i and user j, and use λ i,j to represent the exponential distribution parameter of the time interval between user i and user j. According to user i and user j at time t 0 Connections
Calculate the probability that user i and user j are connected at t c

(4) Define μ′ u,s and λ′ u,s to represent the exponentially distributed parameters that the connection time and interval time between user u and small base station s obey respectively, indicating variable
Represents the physical relationship between user u and small base station s, based on the connection at time t 0
Calculate the probability that user u and small base station s are connected at time t c

(5) defines S i, j represents a social relationship between the user i and user j, with S T represents a social relationship between a threshold value, based on S i, j and S T calculated social connection s i between the user, j, with θ u social importance of user u represents social importance, to measure the user's social importance is calculated for each user θ u = α · V u + β · B u, wherein V u, B u u representing user device Capacity and betweenness centrality, α and β are weight coefficients, and satisfy α+β=1, select important users to cache files according to social importance;

(6) Constructing H={H 1 , H 2 ,..., H U } represents the historical file request of T b time slots before the decision time.
Represents the request history of user u,
Is the file requested at the t b- th time slot in the previous T b time slots, and calculates the user’s empirical probability distribution for each type of file based on the number of times according to the historical file request H, and uses
Represents the probability of user u requesting the c i-th file, and is used as the data set of the K-means algorithm;

(7) Calculate the sum of the distances from all data points to their clustering center points under different K values as a measure of the performance of the current K-means model. The calculation expression is as follows:

Wherein X is a vector of data points, the representative of the cluster center M i of class i, Euclidean distance using the distance;

(8) calculates Gap (K) = E (logD K) -logD K as Gap Statistic, wherein E (logD K) of the K logD desired, selected so that the maximum value of K Gap (K) optK number of classes classified as the user ；

(9) For each type of user, calculate its cluster center as the empirical probability distribution of the type of user requesting this type of file, sort the cluster centers from large to small, and obtain the corresponding index vector, and select the value and ranking according to the ranking Take the logarithm as the y, x data and perform linear regression to obtain the Zipf distribution parameter s;

(10) Calculate the probability of this type of user requesting each type of file, and the calculation expression is as follows:

Where c represents the user category, rank(c) represents the ranking of the number of requests for the c-th file The formula is as follows:

in
Represents the probability of user u requesting the f-th file;

(11) Repeat steps (9) to (10) until the file preferences of optK users have been calculated, and a set of file preferences of all users is obtained

(12) Let the cost of obtaining files from oneself or from important users through D2D communication be ξ 1 ; the cost of obtaining files from small base stations is ξ 2 ; the cost of obtaining files from macro base stations is ξ 3 , and the user first considers storing from himself Or important users obtain the request file, if not, consider the small base station, and if they are not replaced, obtain the request file from the macro base station;

(13) Let N represent the number of important users, let
Represents all important users and small cell buffer placement strategy variables, among which Boolean variables
Represents whether the important user n caches the file f, boolean variable
Represents whether the small base station s caches the file f, and derives the expression of the average system overhead f(x), initializes i=N+1, and x subopt is an all-zero vector of length (N+S)F;

(14) Let j = 1, let the set F left = {1,...,F};

(15) Order
Then set the value of the (i-1) F+ fopt element in x subopt to 1, remove the f opt element in the set F left , and finally set j=j+1;

(16) Repeat step (15) until j>V' i .
The heterogeneous network caching decision-making method based on user preference prediction according to claim 2, characterized in that: in the step (3), the probability that user i and user j are connected at time t c
The calculation formula is as follows:

Where
Represents the physical relationship between user i and user j,
Indicates that user i and user j have a physical relationship at time t;
Indicates that user i and user j do not have a physical relationship at time t, μ i,j represents the exponential distribution parameter that the connection time between user i and user j obeys, and λ i,j represents the exponential distribution that the interval time between user i and user j obeys parameter,
Indicates the connection between user i and user j at time t 0.
The heterogeneous network caching decision-making method based on user preference prediction according to claim 2, characterized in that: in step (4), the probability that user u and small base station s are connected at time t c
The calculation formula is as follows:
The heterogeneous network caching decision-making method based on user preference prediction according to claim 2, characterized in that: the calculation formula of the social relationship S i,j in the step (5) is as follows:

Wherein a total of between social attributes A i i representing the user's social attributes, the number of users Frequency (k) representative of a total of k social attributes, the user i and user j rare, the more closely their social relations; s social relationship to the The judging method of i, j is as follows:

When S i,j ＞=S T , it is determined that there is a social connection between user i and user j, at this time s i,j =1; otherwise, it is not determined that there is no social connection between user i and user j, at this time s i,j =0;

The calculation formula of the betweenness centrality Bu is as follows:

Where b i,j represents the number of shortest paths between vertex i ∈ VU and vertex j ∈ VU in graph G s , and bi, j (g u ) represents the passage between vertex i ∈ VU and vertex j ∈ VU in graph G s The number of shortest paths of V u.
The heterogeneous network caching decision-making method based on user preference prediction according to claim 2, characterized in that: the empirical probability distribution in the step (6)
The calculation formula is as follows:

Among them, 1 A (x) represents the indicator function. If the condition x is true, its value is 1, otherwise it is 0.
The heterogeneous network caching decision-making method based on user preference prediction according to claim 2, characterized in that: in the step (10)
The calculation formula is as follows:

Where c f is the file category to which file f belongs and satisfies
Represents the total number of files in the c f category,
Represents the probability that the user u requests the file in the c i category obtained by fitting the Zipf distribution.
The heterogeneous network cache decision-making method based on user preference prediction according to claim 2, characterized in that: the calculation formula of the average system overhead f(x) in the step (13) is as follows:

in