WO2021212444A1

WO2021212444A1 - Vod service cache replacement method based on random forest algorithm in edge network environment

Info

Publication number: WO2021212444A1
Application number: PCT/CN2020/086550
Authority: WO
Inventors: 张晖; 孙叶钧; 赵海涛; 孙雁飞; 倪艺洋; 朱洪波
Original assignee: 南京邮电大学
Priority date: 2020-04-20
Filing date: 2020-04-24
Publication date: 2021-10-28
Also published as: CN111629216B; JP7098204B2; JP2022530175A; CN111629216A

Abstract

Disclosed is a VOD service cache replacement method based on a random forest algorithm in an edge network environment. The method comprises the following steps: collecting video data; processing a missing value of the video data using a random forest filling method, and establishing a prediction model; predicting an average access duration by means of the prediction model; establishing a cache replacement model according to a prediction result; and solving the cache replacement model using an implicit enumeration method to obtain a final replacement scheme. According to the present invention, an edge server needing to process a large amount of video information and machine learning having an excellent analysis capability in terms of big data processing are taken into consideration, and a random forest algorithm in machine learning is first used to predict a weekly average access duration for a video. Therefore, on this basis, a new video cache replacement model is provided, and the model is solved using an implicit enumeration method, such that the load of a core network is reduced to the greatest extent by an edge server. Moreover, the scheme is very simple and is easily implemented, and has very good application prospects.

Description

VOD service buffer replacement method based on random forest algorithm in edge network environment

Technical field

The invention belongs to the technical field of edge networks, and in particular relates to a VOD service buffer replacement method based on a random forest algorithm in an edge network environment.

Background technique

With the development of science and technology, ports and equipment of various standards, as well as various services and applications are connected to the Internet, resulting in an explosive growth of business requests in the network, and then a blowout of data traffic in the network The main reason is the growth of video traffic. The core network is an important part of the distribution business and service provision. One of the main functions of the core network is to connect requests for access to the network through devices and interfaces of different standards to different service networks according to business requirements, so that each business request can get the service it deserves. Another main function of the core network is to act as a service provider to process business requests submitted by various interfaces. The core network itself contains a number of different service networks. When a business request comes, the core network must provide services for the business. With the explosion of business volume, the service volume provided by the core network increases sharply. In terms of providing business services, the core network is under tremendous load pressure.

The edge network is the part of the network closest to the user. On the one hand, the edge network shares the service request processing pressure for the core network. On the other hand, it also decentralizes service provision to the edge network. If the edge network is capable of processing the services required by the business, they are processed on the edge network side. However, due to the limited computing power of the edge network, the key to maximizing the distribution of traffic to the core network is how to improve service efficiency, and edge caching is the key to improving service efficiency. Edge caching refers to the caching of resources that are frequently used by services on the edge server. When the related services come again, the resources can be obtained directly from the cache. The business needs that the edge server cannot meet are then obtained from the core network. .

In addition, with the advent of the era of big data, the efficient acquisition of knowledge through machine learning has gradually become one of the main driving forces for technological development in various fields, and the field of edge networks is no exception. In the era of big data, with the explosive growth of data, a variety of new data types that need to be analyzed are also emerging, such as semantic understanding, image analysis, network data analysis, etc., making machine learning extremely useful in the big data environment. Important role.

Most of the existing cache replacement solutions still use video popularity as one of the main standards, and add some auxiliary standards such as video similarity, so as to reduce repeated caching of similar and low-popular videos. Video popularity reflects the number of video visits per unit of time. For video services, the high total number of video visits cached in the edge server does not mean that it is burdened by the core network. The video visit time represents the video. The time used is more suitable to reflect the load borne by the edge server, coupled with auxiliary standards, such as video volume, etc., to perform cache replacement effects will be more ideal.

Summary of the invention

The purpose of the invention: In order to overcome the shortcomings in the prior art, to provide a VOD service buffer replacement method based on a random forest algorithm in an edge network environment.

Technical Solution: In order to achieve the above objective, the present invention provides a method for replacing a VOD service buffer based on a random forest algorithm in an edge network environment, which includes the following steps:

S1: Collect video data;

S2: Use random forest filling method to process missing values of video data and establish a prediction model;

S3: Predict the average visit duration through the predictive model;

S4: Establish a cache replacement model according to the prediction result;

S5: Use implicit enumeration to solve the cache replacement model and get the final replacement scheme.

Further, the establishment of the prediction model in step S2 specifically includes:

The average visit time is used as the dependent variable, and the remaining features are used as independent variables for regression training, and the data set is divided, the importance ranking of each feature value is output, and the features are selected according to the ranking to obtain the final modeling feature value. Eigenvalue modeling forms a predictive model.

Further, the establishment process of the cache replacement model in step S4 is specifically as follows:

Assuming that the cache space size of the edge server is S, the video in the test set that cannot be cached by the edge server is stored on the cloud, and the predicted access time set of all videos in the test set is T = {t ₁ ,t ₂ ,...,t _K }, the video The volume set is V={v ₁ ,v ₂ ,...,v _K }, where K is the total number of videos in the test set, the number of cached videos in the edge server before cache replacement is R; there is a video Q part in the cloud, K =R+Q; the establishment of a cache replacement model is shown in the following formula:

in

It is the best cache replacement solution for video, a _i represents the i-th video in the edge server, a _i =0 means that video i needs to be replaced, a _i =1 means that video i does not need to be replaced, and b _j represents the cloud B _j =0 means that video j does not need to be stored in the cloud and needs to be replaced into the edge server, b _j =1 means that video j is still stored in the cloud and does not need to be replaced into the edge server;

It represents the cost-effectiveness of replacing the standard edge server with the access duration. There are two possibilities. When a _i =0, the formula is 0, meaningless. When a _i =1, it means the access duration of video i and the video i volume ratio;

Definition

Represents the cost-effectiveness of buffer replacement for video i; similarly, the formula

It represents the cost-effectiveness of cloud cache replacement for video j. When b _j =1, this formula is 0, which has no practical meaning.

Further, the solution process of the cache replacement model in step S5 is:

Let the cost of total visit time be:

Assuming that the capacity of the edge server is S, the new total access duration calculated each time is TC'. In order to reduce the number of enumerations, let the initial condition be

The {a ₁ ,a ₂ ,...,a _K } part is the video cache set before the cache replacement, and the {b ₁ ,b ₂ ,...,b _Q } part is the initial cache video set of the video in the cloud, and the initial conditions are substituted into the formula (2) Get the initial total access time cost performance TC ₀ , and add new constraints:

TC＞TC ₀ (3)

Iteratively calculate the constraint condition equation (3) and the two constraint equations in the cache replacement model to obtain the optimal

Alternative plan.

Further, the iterative calculation is specifically:

The constraint formula (3) is regarded as the constraint formula ①, and the two constraint formulas in the cache replacement model are respectively regarded as the constraint formula ② and the constraint formula ③. The specific calculation process is as follows:

1) Replace a cached video in the _{set {a 1} , a ₂ ,..., a _K } from back to front _{, that is, set a i} =1 of _{the video to a i} =0;

2) Traverse the set {b ₁ ,b ₂ ,...,b _Q } from back to front, and calculate the new total access time TC;

3) Comparison of TC and TC _0, if TC≥TC _0, then TC ₀ TC is set to a new value, and even if TC ₀ = TC, proceed to step 4, otherwise repeat Step 1, the next iteration, TC ₀ unchanged ；

4) Calculate the constraint ②, if it is met, proceed to step 5, otherwise, proceed to step 1 again and proceed to the next iteration, and TC ₀ remains unchanged;

5) Calculate the constraint ③, if it is satisfied, then this iteration meets all the constraints, TC _{0 is} the new value, and pruning is performed here, that is, stop traversing the set {b ₁ ,b ₂ ,...,b _Q } , And proceed to the next iteration from step 1.

The present invention considers that the edge server needs to process a large amount of video information, and the excellent analysis ability of machine learning in big data processing, uses the random forest algorithm in machine learning to predict the weekly average access time of the video, and proposes on this basis Introduced a new video cache replacement scheme. On the one hand, the scheme uses random forest algorithm for modeling, and the prediction accuracy is high; on the other hand, the scheme is very simple and easy to implement, and has good application prospects.

Beneficial effects: Compared with the prior art, the present invention considers that the edge server needs to process a large amount of video information and the excellent analysis ability of machine learning in big data processing. The access duration is predicted, and a new video cache replacement model is proposed on this basis, and the model is solved by implicit enumeration. In the case of a certain edge server capacity, the weekly average access time of the video cached in the edge server is the longest. The access time represents the load shared by the edge server for the core network. The replacement model of the present invention can make the edge server have a certain capacity Under the conditions, the load of the core network is minimized, and the scheme is very simple and easy to implement, and has a good application prospect.

Description of the drawings

Figure 1 is a schematic flow diagram of the method of the present invention;

Figure 2 is a schematic diagram of cache replacement;

Figure 3 is a comparison diagram between the average weekly total visit time of videos and the actual weekly average visit time;

Figure 4 is a comparison diagram of the price-performance ratio between the average weekly video visit time and the actual weekly average visit time;

Figure 5 is a graph showing changes in the prediction accuracy rate of the weekly average video visit time and the prediction accuracy rate of the weekly average visit time cost performance over time;

Figure 6 is a graph showing changes in cache replacement rate and weekly access duration increase rate over time.

Detailed ways

The present invention will be further clarified below in conjunction with the drawings and specific embodiments.

As shown in Figure 1, the present invention provides a VOD service cache replacement method based on a random forest algorithm in an edge network environment, which mainly includes three major parts, one respectively, using random forest to model and predict video access duration; 2. Propose a cache replacement model based on the prediction results; 3. Use the implicit enumeration method to solve the cache replacement model; the specific process is as follows:

1. Using random forest algorithm to model and predict the weekly average visit time of VOD videos

(1) Collect sample video data and data preprocessing

Randomly collect 100,000 pieces of video information in the movie library of the video playback platform to obtain a sample data set, and preprocess the video data in the sample data set: take the week as the unit, average the data of the video information within a week, and the video information includes Online time, movie popularity rankings, popularity, number of likes, number of comments, ratings, and video access time, etc. The data retains one decimal place. For data that cannot be a decimal, such as movie popularity rankings and the number of days on the line, the average obtained is rounded to the nearest whole number. For videos that have been online for less than a week, the data corresponding to the remaining days are filled with 0. Visit duration refers to the duration of continuous visits, that is, if the time interval between two visits in the visit log is less than 60 seconds, and the user makes a mistake or skips the advertisement, it does not mean that the playback is stopped, so the break is not counted as the break.

(2) Modeling and prediction using random forest algorithm

Then, the random forest filling method is used to process the missing values of the data. Assuming that a certain feature is missing, the feature is regarded as a label, and the remaining features are set as a new feature matrix. If other features also have missing values, then all features are traversed, starting with the feature with the least missing values, the fewer missing values, the less accurate information is needed. When filling a feature, you need to replace the missing values of other feature values with 0. Each time you loop, the features with missing values will be reduced by one.

Take 60% of the data set as the training set and 40% as the test set. The online time, movie popularity ranking, popularity, number of likes, number of comments and scores are independent variables, and the average weekly visit time is used as the prediction object to model the formation. Forecast model to get the forecast value. Output feature importance, remove less important features, reduce model complexity, adjust parameters, make model prediction accuracy reach a more ideal value, get the final model, and use the built model to predict the weekly average of the next week’s video Duration of the visit.

2. Establish a cache replacement model

Assuming that the cache space size of an edge server is S, the video in the test set that cannot be cached by the edge server is stored on the cloud, and the predicted access time set of all videos in the test set is T = {t ₁ ,t ₂ ,...,t _K }, The video volume set is V={v ₁ ,v ₂ ,...,v _K }, where K is the total number of videos in the test set, the number of cached videos in the edge server before cache replacement is R; there is a video Q part in the cloud, K=R+Q, where a schematic diagram of cache replacement is shown in Figure 2. The order of cache replacement in the figure does not mean that the actual replacement process is replaced in order. The establishment of a cache replacement model is as follows:

in

It is the best cache replacement solution for videos, a _i represents the i-th video in the edge server, a _i =0 means that video i needs to be replaced, a _i =1 means that video i does not need to be replaced, and b _j represents For the j-th video in the cloud, b _j =0 means that video j does not need to be stored in the cloud and needs to be replaced into the edge server, b _j =1 means that video j is still stored in the cloud and does not need to be replaced into the edge server; formula

It represents the cost-effectiveness of replacing the standard edge server with the access duration. There are two possibilities. When a _i =0, the formula is 0, meaningless. When a _i =1, it means the access duration of video i and the video The ratio of i volume, this value is to weigh the length of the visit and the volume of the video.

Suppose that the predicted access time of video i is very high, but at the same time the volume of the video is very large, which will occupy a large amount of cache memory of the edge server. If there are many such videos, it will inevitably make the video that can be cached in the edge server Greatly reduced, the cache replacement effect is not guaranteed, so the definition formula

Indicates the cost-effectiveness of buffer replacement of video i, and the purpose of optimization is to maximize the cost-effectiveness of video buffer replacement; similarly, the formula

Represents the cost-effectiveness of cloud cache replacement for video j. When b _j =1, the formula is 0, meaningless. When b _j =0, the physical meaning is the same as above; the first constraint indicates that the cloud cache is replaced with the edge server The total video volume in the edge server cannot be greater than the total volume of the video replaced by the cache in the edge server, otherwise the cache in the edge server will not be enough to replace the video; the second constraint is that the video that is not replaced in the edge server and the video from the cloud The total volume of videos replaced into the edge server cannot be greater than the cache space of the edge server.

Third, use the implicit enumeration method to solve the cache replacement model

The above model is essentially a 0-1 plastic programming problem. The implicit enumeration method is used to solve the problem. The variables are checked as part of the combination of 0 or 1, and the objective function values are compared to find the optimal solution.

First, find a feasible solution and generate a filter condition. The filter condition is to satisfy the constraint condition that the objective function value is better than the objective function value of the feasible solution that has been calculated. Let the cost of total visit time be:

The {a ₁ ,a ₂ ,...,a _K } part of the set is the video cache set before the cache replacement, and the {b ₁ ,b ₂ ,...,b _Q } part is the initial cache video set of the cloud video, and the initial conditions are substituted into Equation (2), the initial total access time cost performance TC _{0 is obtained} , and the new constraint conditions are added:

TC＞TC ₀ (3)

Among them, TC is the total access time cost performance obtained after each iteration. In order to effectively prune during the iteration process and maximize the replacement efficiency, the optimization targets are arranged in order of coefficients, and the set {a ₁ ,a ₂ ,..., The variables in a _K } are arranged in descending order of the cost-effectiveness coefficient, and the variables in the set {b ₁ ,b ₂ ,...,b _Q } are arranged in descending order of the cost-effectiveness coefficient. When traversing, the two parts of the set are from right to Left traversal, the purpose of this sorting is to replace the videos with lower cost-effectiveness first. When replacing, start from the videos with higher cost-effectiveness in the cloud to achieve the pruning effect.

Regarding the newly added constraint formula (3) as the constraint formula ①, the constraint formulas in the cache replacement model (1) are the constraint formula ② and the constraint formula ③ in sequence, and the calculation process is as follows:

(1) Replace a cached video in the _{set {a 1} , a ₂ ,..., a _K _{} from back to front, that is, set a i} =1 of _{the video to a i} =0;

(2) Traverse the set {b ₁ ,b ₂ ,...,b _Q } from back to front, and calculate the new total access time TC;

(3) comparing the TC and TC _0, if TC≥TC _0, then ₀ TC TC is set to a new value, and even if TC ₀ = TC, continue with step (4), otherwise repeat steps (1), one iteration , TC ₀ remains unchanged;

(4) Calculate the constraint condition ②, if it is satisfied, proceed to step (5), otherwise proceed to step (1) again, and proceed to the next iteration, and TC ₀ remains unchanged;

(5) Calculate the constraint ③, if it is satisfied, then this iteration meets all the constraints, TC _{0 is} the new value, and pruning is performed here, that is, stop traversing the set {b ₁ ,b ₂ ,...,b _Q }, start the next iteration from step (1).

In the above iterative process, _{the video in the set {b 1} ,b ₂ ,...,b _Q } that changes from 1 to 0 at the same time represents the replacement set {a ₁ ,a ₂ ,...,a _K } from 1 to For the video of 0, in the actual video replacement, a video may be replaced by two or three or more videos at the same time due to its large size. Therefore, the situation of replacing one video with multiple videos is not considered. , That is, when traversing the set {b ₁ ,b ₂ ,...,b _Q }, the simultaneous change of 2 or more bits in the _{set {b 1} ,b ₂ ,...,b _{Q} is not considered, which greatly reduces} The number of iterations and the amount of calculation, and finally get the best

Alternative plan.

This embodiment uses existing data simulation results to illustrate the cache replacement effect of the present invention. The first is the prediction effect of the random forest algorithm. Let the test video set be c={c ₁ ,c ₂ ,...c _K }, its predicted weekly average visit time set is t={t ₁ ,t ₂ ,...t _K }, and the weekly average visit time set of actual videos is t'={t' ₁ ,t' ₂ ,...t' _K }, the prediction accuracy of weekly average visit time is:

The second term of the above formula represents the ratio of the predicted visit duration error to the actual total visit duration. The smaller the value, the better the prediction effect. The comparison chart between the average weekly total visit time and the actual weekly average visit time is shown in Figure 3. After calculation, P _at = 95.1%.

Suppose the predicted weekly average visit time cost performance set is tp={tp ₁ ,tp ₂ ,…,tp _K }, and the actual weekly average visit time cost performance set is tp'={tp' ₁ ,tp' ₂ ,…,tp' _K } , Then the correct rate of predicting the average weekly visit time cost performance is defined as:

A comparison chart of the price/performance ratio between the average weekly visit time and the actual weekly average visit time is shown in Fig. 4. After calculation, P _tp = 94.7%.

The above results indicate that the accuracy of the random forest prediction results in the present invention is very high. Next, simulate and verify the replacement effect of the cache replacement model. Assuming that the video set cached before the cache replacement is c, where u is the number of videos cached in the edge server, and the video set after the cache replacement is c', the video cache replacement rate is defined as:

After calculation, P _re = 11.6%.

Assuming that the weekly average access time of the video cached in the edge server before the cache replacement is t _c = {t ₁ , t ₂ ,..., _tu }, the weekly average access time of the video cached in the edge server after the cache replacement is t _c' = { t ₁ ,t ₂ ,…,t _u }, define the increase rate of access duration, the expression is as follows:

Equation (7) represents the ratio of the difference between the sum of the weekly average access time of the video after cache replacement and the sum of the weekly average access time of the video before the cache replacement to the sum of the weekly average access time of the video before the cache replacement, if P _t ≤ 0 , It means that the access time of the video after the cache replacement is less than the access time of the video before the cache replacement or the same as before the cache replacement, that is, the load shared by the edge server for the core network after the cache replacement is not increased or smaller, and the cache replacement effect is very poor . Long if P _t> 0, then the access to the cache replacement video when grown in the cache replace access to the front of the video, i.e., greater post-cache substitution edge server as the core network share the load, the larger P _t value, the cache replacement The edge server will share more load for the core network. After calculation, P _t =8.7%, indicating that the cache replacement model of the present invention effectively increases the load shared by the edge server for the core network.

The simulation diagrams of the weekly prediction model and the cache replacement model over time are shown in Figure 5 and Figure 6. It can be seen that the prediction accuracy of the weekly average access time and the prediction accuracy of the weekly average access time cost performance are decreasing with the passage of time, while the cache The replacement rate and the growth rate of access duration are increasing with the passage of time. Among them, the increasing trend of the cache replacement rate is relatively fast, but the overall trend of the curve is relatively stable over time, and there is no large fluctuation. Therefore, the present invention is in practical application. The algorithm update frequency saves computing resources.

Claims

The VOD service buffer replacement method based on the random forest algorithm in the edge network environment is characterized in that it includes the following steps:

S1: Collect video data;

S2: Use random forest filling method to process missing values of video data and establish a prediction model;

S3: Predict the average visit duration through the predictive model;

S4: Establish a cache replacement model according to the prediction result;

S5: Use implicit enumeration to solve the cache replacement model and get the final replacement scheme.
The VOD service buffer replacement method based on the random forest algorithm in the edge network environment according to claim 1, wherein the establishment of the prediction model in step S2 specifically includes:

The average visit time is used as the dependent variable, and the remaining features are used as independent variables for regression training, and the data set is divided, the importance ranking of each feature value is output, and the features are filtered according to the ranking to obtain the final modeling feature value, according to the modeling feature Value modeling forms a predictive model.
The VOD service cache replacement method based on the random forest algorithm in the edge network environment according to claim 1, wherein the establishment process of the cache replacement model in step S4 is specifically:

Assuming that the cache space size of the edge server is S, the video in the test set that cannot be cached by the edge server is stored on the cloud, and the predicted access time set of all videos in the test set is T = {t 1 ,t 2 ,...,t K }, the video The volume set is V={v 1 ,v 2 ,...,v K }, where K is the total number of videos in the test set, the number of cached videos in the edge server before cache replacement is R; there is a video Q part in the cloud, K =R+Q; the establishment of a cache replacement model is shown in the following formula:

in
It is the best cache replacement solution for video, a i represents the i-th video in the edge server, a i =0 means that video i needs to be replaced, a i =1 means that video i does not need to be replaced, and b j represents the cloud B j =0 means that video j does not need to be stored in the cloud and needs to be replaced into the edge server, b j =1 means that video j is still stored in the cloud and does not need to be replaced into the edge server;
It represents the cost-effectiveness of replacing the standard edge server with the access duration. There are two possibilities. When a i =0, the formula is 0, meaningless. When a i =1, it means the access duration of video i and the video i volume ratio;

Definition
Represents the cost-effectiveness of buffer replacement for video i; similarly, the formula
It represents the cost-effectiveness of cloud cache replacement for video j. When b j =1, this formula is 0, which has no practical meaning.
The VOD service cache replacement method based on the random forest algorithm in the edge network environment according to claim 3, characterized in that: the solution process of the cache replacement model in step S5 is:

Let the cost of total visit time be:

Assuming that the capacity of the edge server is S, the new total access duration calculated each time is TC'. In order to reduce the number of enumerations, let the initial condition be
The {a 1 ,a 2 ,...,a K } part is the video cache set before the cache replacement, and the {b 1 ,b 2 ,...,b Q } part is the initial cache video set of the video in the cloud, and the initial conditions are substituted into the formula (2) Get the initial total access time cost performance TC 0 , and add new constraints:

TC＞TC 0 (3)

Iteratively calculate the constraint condition equation (3) and the two constraint equations in the cache replacement model to obtain the optimal
Alternative plan.
The VOD service buffer replacement method based on the random forest algorithm in an edge network environment according to claim 4, wherein the iterative calculation is specifically:

Regarding constraint equation (3) as constraint equation ①, and the two constraint equations in the cache replacement model as constraint equation ② and constraint equation ③, the specific calculation process is as follows:

1) Replace a cached video in the set {a 1 , a 2 ,..., a K } from back to front , that is, set a i =1 of the video to a i =0;

2) Traverse the set {b 1 ,b 2 ,...,b Q } from back to front, and calculate the new total access time TC;

3) Comparison of TC and TC 0, if TC≥TC 0, then TC 0 TC is set to a new value, and even if TC 0 = TC, proceed to step 4, otherwise repeat Step 1, the next iteration, TC 0 unchanged ；

4) Calculate the constraint ②, if it is met, proceed to step 5, otherwise, proceed to step 1 again and proceed to the next iteration, and TC 0 remains unchanged;

5) Calculate the constraint ③, if it is satisfied, then this iteration meets all the constraints, TC 0 is the new value, and pruning is performed here, that is, stop traversing the set {b 1 ,b 2 ,...,b Q } , And proceed to the next iteration from step 1.