CN109978627B

CN109978627B - Modeling method for big data of broadband access network user surfing behavior

Info

Publication number: CN109978627B
Application number: CN201910250704.7A
Authority: CN
Inventors: 张崇富; 倪明; 易子川; 水玲玲; 迟锋; 刘黎明; 张智
Original assignee: University of Electronic Science and Technology of China; University of Electronic Science and Technology of China Zhongshan Institute
Current assignee: University of Electronic Science and Technology of China; University of Electronic Science and Technology of China Zhongshan Institute
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2023-08-08
Anticipated expiration: 2039-03-29
Also published as: CN109978627A

Abstract

The invention discloses a modeling method for big data of a broadband access network user surfing behavior, which provides a semi-supervised learning method based on a clustering algorithm and a regression algorithm for network traffic.

Description

Modeling method for big data of broadband access network user surfing behavior

Technical Field

The invention relates to the field of big data mining analysis and big data modeling, in particular to a modeling method for big data of surfing behavior of a broadband access network user.

Background

With the development of the internet, the number of network users is gradually increasing. In the field of telecommunications, there are hundreds of millions of broadband access network users, and the types of data provided by these broadband access network users are rich and diverse, including user basic data (such as user identity ID, user attribution, date of birth, occupation), internet surfing behavior data (such as internet surfing traffic, internet surfing time, browsing content, search keywords), and location data (regional climate, regional economic total amount, regional broadband access network user number); the higher requirements of broadband access network users on network quality and network services make network operators and service providers continually improve network service quality and add new services; in addition, the user data of the broadband access network are analyzed and mined timely and effectively, the distribution situation of different user groups in space is found, and guidance is provided for building an optimization server.

To improve the user satisfaction, the user is provided with interesting business and information, the behavior of the network user is required to be analyzed, the internet surfing characteristics, internet surfing interests and the like of the user are explored, so that the user requirements are deeply known, and meanwhile, an important information channel is provided for network marketing; to improve the network quality, the operation condition and the service condition of the network must be known in depth, the monitoring of the network traffic is maintained, the network structure and the bandwidth are continuously adjusted, the network problem is solved, the network service quality is improved, and a large amount of network traffic data is effectively processed; to optimize the construction of the server, the information such as the income, the regional difference, the service demand change, the distribution condition of different user groups and the proportion thereof of different services must be known, and in order to meet the demands of different user groups in the area on the aspects such as bandwidth, service and the like, the distribution proportion characteristic of the locations of different user groups on the server is obtained, so that the infrastructure cost of the server is reduced.

In order to maximize the utilization of customer resources, there are currently a number of methods for telecommunication broadband access network user behavior analysis and prediction. For the prediction of continuous telecommunication broadband access network user traffic data, various supervised learning methods have been proposed. Some of these studies consider network traffic as a linear model and predict using linear models such as auto-regressive moving average (ARMA) model, differential auto-regressive moving average (ARIMA) model, and differential auto-regressive sum-of-motion average (FARIMA) model, respectively. However, as the complexity of the network increases, the traffic characteristics of the network exceed the poisson distribution or markov distribution which is conventionally considered, so that the prediction by using the linear model has theoretical defects, and it is difficult to ensure the accuracy of the prediction. The prediction of the nonlinear model mainly comprises an artificial neural network, a support vector machine, a gray model and the like, and although the prediction precision of the nonlinear model is improved to a certain extent compared with that of the linear model, the prediction precision is still not ideal. The neural network has the defects of easy localization of local optimal values and difficult determination of network structures; although the number of samples required by the support vector machine is small, key parameters of the support vector machine are difficult to determine; while the gray model is only suitable for the case where the data change is not severe; therefore, there is a need to develop and design a new modeling method to optimize server construction and improve network quality and user satisfaction.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a modeling method for big data of the surfing behavior of a broadband access network user.

The technical scheme adopted for solving the technical problems is as follows:

a modeling method for big data of broadband access network user surfing behavior includes the following steps:

s1, acquiring internet surfing behavior data of a broadband access network user, performing quality evaluation on the data, and screening out high-quality data.

S2, preprocessing the screened high-quality data, carrying out user region division and marking on the screened high-quality data by using an unsupervised algorithm, and searching for the association relation between the Internet surfing behavior of the user and each data field by using an association algorithm in unsupervised learning in combination with the basic data and the position data of the user.

S3, predicting the time-flow of the marked user group by using a regression model in supervised learning to obtain the flow trend of each user group, and obtaining the total flow trend by statistical calculation.

S4, obtaining distribution conditions and quantity of each marked user group through statistical calculation, and obtaining the requirement characteristics of different user groups on the server.

Further, in the step S1, according to the basic information (such as the flow size and the data type) of the broadband access network user internet surfing behavior data, a chart is drawn, and the data is analyzed through the chart, so that unnecessary data is removed, and high-quality data is obtained.

In the step S2, the high-quality data is divided into two data packets, namely DataBill and Factorbill, wherein DataBill is two training sets formed by taking the internet surfing behavior (such as internet surfing time, browsing content, search keywords and the like) in the user internet surfing behavior data as a feature vector, taking the internet surfing flow (uplink flow and downlink flow) of the user internet surfing behavior data, and respectively performing training research on the internet surfing behavior data in time and space by using an unsupervised algorithm; factorBill consists of user profile data and location data.

The beneficial effects of the invention are as follows: according to the method, the regional user subdivision and marking are dynamically carried out on the user Internet surfing data by using a clustering algorithm, regional flow prediction is carried out on the subdivided user groups, and a regional server construction scheme is given through calculation statistics and is respectively provided for marketing departments, operation departments and infrastructure departments, so that fusion among all departments of telecommunication is innovatively promoted, network quality is effectively improved through construction of an optimized server, and better Internet surfing experience is brought to users.

Drawings

The invention will be further described with reference to the drawings and examples.

FIG. 1 is a diagram of modeling steps of the present invention;

FIG. 2 is a block diagram of the overall modeling of the present invention;

FIG. 3 is a block diagram of FIG. 2 at A;

FIG. 4 is a block diagram of FIG. 2 at B;

FIG. 5 is a block diagram of FIG. 2 at C;

FIG. 6 is a ratio diagram of different data types for data quality assessment of the present invention;

FIG. 7 is a ratio plot of each type of data in the analyzable data of the data quality assessment of the present invention;

FIG. 8 is a bar graph of the time coverage length of each type of data for data quality assessment of the present invention;

FIG. 9 is a time distribution plot of total data volume for data quality assessment of the present invention;

FIG. 10 is a block diagram of DataBill and FactorBill for data preprocessing of the present invention.

Detailed Description

Referring to fig. 1 to 10, a modeling method for big data of broadband access network user internet surfing behavior includes the following steps:

In step S2, the high-quality data is divided into two data packets, namely DataBill and FactorBill, wherein DataBill is two training sets formed by taking the internet surfing behavior (such as internet surfing time, browsing content, search keywords, etc.) in the internet surfing behavior data of the user as a feature vector, taking the internet surfing flow (uplink flow and downlink flow) of the internet surfing behavior data of the user, and respectively performing training research on the internet surfing behavior data in time and space by using an unsupervised algorithm; the factor bill is composed of user basic data and location data, referring specifically to fig. 10, table 1, table 2 and table 3, in this embodiment, the training set obtained by training and researching the DataBill in space is denoted as B1 (which is a cell array of 1*7), the training set obtained by training and researching in time is denoted as B2 (which is a cell array of 1×22), the user basic data (denoted as UserFactor attribute) includes user ID, user home location, date of birth, occupation and income, the user location data (denoted as NaturalFactor attribute) includes week, weather, temperature, air quality, regional economic total amount and regional user number, and the correlation analysis is performed between the B1 and B2 and the user basic data and the location data, and for the nominal attributes such as week, weather and region, a difference method is adopted to take values, in this embodiment, the same distance is 1, and different values are taken as 0.

Table 1 data preprocessed space dimension training set B1 and time dimension training set B2

TABLE 2 UserFactor attribute for data preprocessing

Attribute name	User identity ID	User home location	Birth date	Occupation of	Income (income)
						Attribute type	Nominal scale	Nominal scale	Numerical value	Nominal scale	Numerical value

TABLE 3 NaturalFactor Properties of data pretreatment

Attribute name

Week of week

Weather of

Temperature (temperature)

Air quality

Region of

Total amount of regional economy

Number of regional users

Attribute type

Nominal scale

Numerical value

Nominal scale

Numerical value

Examples

In the embodiment, statistical information of large data of user behaviors of broadband access networks in Sichuan areas is selected, and the statistical information is divided into 6 different types according to 446.8MB of records and 1606995 records, wherein the time span is from 1 month, 23 days in 2015 to 4 months, 10 days in 2017; the basic information for different types of data can be seen in table 4, and the proportion of the different types of data to the total number can be seen clearly from fig. 6.

Table 4 basic information of six types of data for data quality assessment

Next, screening and removing the data, referring to table 5, it can be known that the data such as the number of users, the internet time, the browsing content, the search keyword, the number of entrance (uploading) bytes, the number of exit (downloading) bytes, the entrance (uploading) rate, the exit (downloading) rate, the total rate, the access number and the like are taken as the analyzable data, and the date, the user attribution, the birth date, the occupation, the value (sampling) time interval, the remote server IP and the bandwidth type are taken as the classification data; while the proportions of the analyzable data in the different data types are different, in Table 6 we can see this difference, for which we can let the data we use to analyze be what we really care, the amount of analyzable data can measure whether a data type is more important or has a higher data quality. The pie chart of fig. 7 is used for showing the proportion of each data type in all the analyzable data, so that the data of which type is more valuable to analyze is compared, and by combining fig. 6, all-time traffic statistics, 100M user analysis, online user analysis, bandwidth-divided online user analysis and peak traffic statistics types can be found to have considerable amounts of analyzable data, the data quantity of the accumulated users is small, and even the data of the type of the accumulated users can be considered to have no analytical value in the dimension of the data quantity, so that the accumulated user data can be removed.

Table 5 structural list of different types of data for data quality assessment

TABLE 6 proportion of analysis class data of different types of data for data quality assessment

In this embodiment, the data is widely distributed in time (from 23 in 2015 to 4 in 2017, 10 in 2015), and since the data is discontinuous in time and not all time points have complete data (data has missing values), the data in a certain period is selected and discarded, so that the data quality is improved, referring to table 7, fig. 8 and fig. 9, the time coverage length of various data types can be seen, and by observing, no data exists in the time from 30 in 2015 to 4 in 2016 up to 247 days, and the data amount is not too much after 1 in 2016, that is, the data richness after 7 in 2015 is insufficient, the data amount is obviously reduced by all types of data in the data set before 30 in 2015, and the total flow statistical data amount after 4 in 2016 is obviously reduced, and several values are obviously abnormal. Since the continuity of data is poor and the data quality is poor after the period of 30 months of 7 in 2015, we can improve the overall data quality by discarding the following data.

TABLE 7 time coverage length of various types of data for data quality assessment

	Number of records	Type weight (min)	Time coverage length (min)
				100M user analysis	73160	5	365800
On-line user analysis	353007	5	1765035
				Banded online user analysis	762022	5	3810110
Peak flow statistics	30710	5	153550
				Full time period flow analysis	994230	5	4971150
Accumulating the number of users	9552	35	334320

Through analysis and screening of the data, the study samples are finally determined to be 100M users in 22 areas of Sichuan province, wherein the peak period is 20:00 to 22:30 late in 2015, 1 month, 23 days, to 2015, 7 month, 30 days, and 20:00.

In the embodiment, the screened data are divided into two groups, namely X|Y (X or Y) and X+Y (X and Y), wherein X represents the internet surfing time, browsing content and search keywords in the user internet surfing behavior data, Y represents the internet surfing flow in the user internet surfing behavior data, a clustering algorithm is carried out on the X+Y to obtain marks, the marks are respectively added to the two groups of data sets to obtain X|Y marks and X+Y marks, the association analysis is carried out on the X+Y marks and the basic data and the position data of the user, and the association relation between the user internet surfing behavior and different data such as user business behavior, daily life and the like is deeply mined, so that a user group is divided, and enterprises can conveniently carry out sales means aiming at user group differentiation; the data marked by x|y is used for regression algorithm (supervised learning) to predict the flow of various users, specifically, if a region (denoted as P1) is subdivided and marked by regional users, 3 user groups U1 to U3 are obtained, namely U (P1) = { U1}, { U2}, { U3}, where U1= { U11, U12, U13, U14, U15}, U2= { U21, U22, U23, U24, U25}, U3= { U31, U32, U33, U34, U35, U36}. Taking 7 days a week as an example, for each user group, machine learning is performed according to training samples of monday to Saturday, and the regression algorithm is used for predicting the traffic trend of the monday, and for the same user group, the surfing traffic trend of the same user group is similar because the surfing behavior features of the same user group are similar, so that the prediction complexity is greatly reduced. And aiming at a certain sub-region P11-P15, adding the predicted flow statistics of the corresponding users to obtain the total flow trend of the sub-region. As shown in fig. 4, comparing and analyzing the flow trend of P11-P15, the method can provide a basis for resource allocation for network development in advance; the predicted flow of P11-P15 is statistically analyzed to obtain the prediction or trend of the flow in the P1 region, so that the prediction precision of the flow can be effectively improved by the total-to-total method; finally, according to the obtained data of the X+Y marks, the proportion of the user groups in each sub-area is calculated, the user group with the largest proportion is obtained, the server is built for meeting the requirement of the user group with the larger proportion, the distribution condition and the quantity of the regional servers can be obtained, and in fig. 5, the proportion of the users in the P1 area is calculated as follows:

in the present embodiment, the concept of "dictionary" is presented, and in fig. 5, the "dictionary" is generated by P11, P12, P14, and P13, P15 can be derived by "looking up the dictionary". The dictionary should have integrity (including various user groups in P1) and independence (including the user groups independent of each other). According to calculation, the proportion of the user group U1 in the P11 occupied by the area is the largest, so that the requirement of the U1 on the P11 server in the place is obtained to meet the requirement of the user group U1 on bandwidth, service and the like, and further the requirement of the P11 on the distribution and the quantity of the servers is obtained. Similarly, the requirements of the user groups U3 and U2 on the server distribution and number of the locations P12 and P14 are respectively obtained by P12 and P14, which creates a "dictionary" containing all types of user groups P1. Further, the overlapping area of the P1 server distribution is optimized, so that the distribution condition and the number of the P1 regional servers are obtained, and a server construction scheme is provided for an infrastructure department.

The above embodiments do not limit the protection scope of the invention, and those skilled in the art can make equivalent modifications and variations without departing from the whole inventive concept, and they still fall within the scope of the invention.

Claims

1. A modeling method for big data of broadband access network user surfing behavior is characterized by comprising the following steps:

s1, acquiring internet behavior data of a broadband access network user, performing quality evaluation on the data, and screening out high-quality data;

s2, preprocessing the screened high-quality data, carrying out user region division and marking on the screened high-quality data by using an unsupervised algorithm, and searching for the association relation between the Internet surfing behavior of the user and each data field by using an association algorithm in unsupervised learning in combination with the basic data and the position data of the user;

dividing the screened data into two groups, namely X-Y, namely X or Y and X+Y, namely X and Y, wherein X represents the internet surfing time, browsing content and search keywords in the user internet surfing behavior data, Y represents the internet surfing flow in the user internet surfing behavior data, clustering algorithm is carried out on the X+Y to obtain marks, the marks are respectively added to the two groups of data sets to obtain X-Y marks and X+Y marks, correlation analysis is carried out on the X+Y marks and the basic data and the position data of the user, and the correlation relation between the user internet surfing behavior and different data in user business behaviors and daily life is deeply excavated, so that user groups are divided;

s3, predicting time-flow of the marked user group by using a regression model in supervised learning to obtain flow trend conditions of each user group, and obtaining total flow trend conditions through statistical calculation;

s4, obtaining the distribution condition and the quantity of each marked user group through statistical calculation, so as to obtain the requirement characteristics of different user groups on the server;

in the step S1, according to basic information of broadband access network user internet surfing behavior data, including flow size and data type, a chart is drawn, and unnecessary data is removed by analyzing the data through the chart, so as to obtain high-quality data;

in the step S2, the high-quality data is divided into two data packets, namely a DataBill and a FactorBill, wherein the DataBill is two training sets formed by taking the internet surfing behavior in the user internet surfing data as a feature vector, taking the internet surfing flow of the user internet surfing behavior data, including uplink flow and downlink flow, and respectively performing training research on the data in time and space by using an unsupervised algorithm; factorBill consists of user profile data and location data.