CN107086925B

CN107086925B - Deep learning-based internet traffic big data analysis method

Info

Publication number: CN107086925B
Application number: CN201710132366.8A
Authority: CN
Inventors: 潘强
Original assignee: Zhuhai City Polytechnic
Current assignee: Zhuhai City Polytechnic
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2020-04-07
Anticipated expiration: 2037-03-07
Also published as: CN107086925A

Abstract

The invention discloses an internet flow big data analysis method based on deep learning, which comprises the following steps: acquiring original internet traffic monitoring data; filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm; classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website; data mining is carried out according to the classification result of the internet website; and recommending the internet website for the user according to the data mining result. The invention adopts the infinite deep neural network with feedback connection to replace the existing feedforward neural network, can process dynamic data and has better real-time property; and the obtained internet flow monitoring data is filled by adopting an incomplete data filling algorithm fusing an N neighbor filling algorithm and a threshold filling algorithm, so that the precision is high. The method can be widely applied to the field of data mining.

Description

Deep learning-based internet traffic big data analysis method

Technical Field

The invention relates to the field of data mining, in particular to an internet traffic big data analysis method based on deep learning.

Background

With the rapid development of information and communication technologies such as the internet, a mobile intelligent terminal and the internet of things and the continuous improvement of computer storage and computing capacity, the explosive growth and continuous acquisition of various data become possible, and the big data times are natural. Compared with traditional data, people sum up the characteristics of big data into 5V, namely large volume (volume), fast speed (velocity), multi-modal (variety), difficult recognition (veracity) and high value and low density (value). How to analyze the big data and fully mine the potential value of the big data becomes a scientific problem needing deep discussion.

In the internet field, network traffic monitoring is the most effective means for acquiring network traffic indexes and network user behavior parameters. With the increasing of internet users, data needing to be researched and analyzed by the internet is also increasing, and how to dig out a flow rule and a user behavior rule from massive user flow data (i.e., how to analyze internet flow big data) becomes a technical problem to be solved urgently in the industry.

A learning algorithm based on a deep neural network (a deep learning method for short) is well known in academia and industry as a successful big data analysis method. Compared with the traditional method, the deep learning method is driven by data, can automatically extract features (knowledge) from the data, and has remarkable advantages for analyzing unstructured, variable and cross-domain large data.

At present, a deep learning method used in internet flow big data analysis is a deep learning method based on a feedforward neural network, and the feedforward neural network is characterized in that no feedback connection exists between neurons in the same layer, and no time parameter attribute exists, so that the deep learning method based on the feedforward neural network is only good at processing static data, but cannot process dynamic data (namely data related to time), is poor in real-time performance, and cannot meet the increasingly high requirements of people on internet flow big data analysis. In addition, in the current internet traffic big data analysis, the data monitored by the network traffic monitoring equipment is lost due to the influence of various faults, and then the accuracy of the subsequent internet traffic big data analysis is seriously influenced due to the incomplete monitored data.

Disclosure of Invention

To solve the above technical problems, the present invention aims to: the method for analyzing the internet flow big data is good in instantaneity and high in precision and is based on deep learning.

The technical scheme adopted by the invention is as follows:

an internet traffic big data analysis method based on deep learning comprises the following steps:

acquiring original internet traffic monitoring data;

filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-nearest neighbor filling algorithm and a threshold filling algorithm, wherein N is the total number of the set nearest neighbor data;

classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website, wherein feedback connection exists between neurons in the same layer of the infinite deep neural network;

data mining is carried out according to the classification result of the internet website;

and recommending the internet website for the user according to the data mining result.

Further, the step of performing filling processing on the acquired internet traffic monitoring data by using an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm includes:

s1, carrying out noise cleaning on the acquired Internet flow monitoring data to obtain data after the noise cleaning;

s2, dividing the data after noise cleaning into a complete data set C and an incomplete data set I according to whether the data are complete or not, and directly executing the data in C to the step S5 and executing the data in I to the step S3;

s3, searching N neighbor data in C for the data I in I, judging whether N neighbor data most similar to the data I can be found in C or not, if yes, filling complete data by taking the average value of the N neighbor data as the data I, and then executing the step S5, otherwise, executing the step S4;

s4, calculating the sum D of the distances between the data I in the step I and all the data in the complete data set C, judging whether the sum D is smaller than a set threshold Th, if so, filling the complete data with the mean value of all the data in the step C as the data I, and then executing a step S5, otherwise, deleting the data I from the step I;

and S5, sequentially performing data integration, data conversion and data specification on the filled data, and storing the data subjected to data specification processing into the HDFS.

Further, the step of performing classification processing by using a deep learning method based on an infinite deep neural network according to the filled data to obtain a classification result of the internet website includes:

reading Internet flow records from the HDFS;

performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records, and storing the analyzed webpage content into an HBase database;

the method comprises the following steps that a library identification module directly identifies and classifies URLs of all records in Internet traffic records by adopting a library-based identification method, wherein the library identification module updates and maintains a URL identification result table and a URL unidentified result table through library files;

the method comprises the steps that unidentified webpage contents classified by a library recognition module are used as a training set, and a deep learning method based on an infinite deep neural network is adopted for modeling and classifying so as to finish automatic recognition and classification of different types of internet websites;

and extracting the URL with correct classification based on the result of deep learning identification, and updating and expanding the library file in the library identification module.

Further, the step of performing MapReduce parallel processing and data capture and analysis processing on the read internet traffic records and storing the analyzed webpage content in the HBase database includes:

preprocessing the read internet flow records by a MapReduce program to obtain URL addresses capable of crawling web pages, wherein the preprocessing comprises URL combination, URL filtering and URL duplication removal;

crawling and analyzing the URL address by adopting a plurality of parallel webpage crawling threads to obtain a website title, a keyword and contents describing the three fields, and storing the contents of the three fields in an HBase database.

Further, the step of modeling and classifying by using the unrecognized web page content after the library recognition and classification as a training set and adopting a deep learning method based on an infinite deep neural network comprises the following steps:

crawling a website title, a keyword and a description three fields in webpage content corresponding to a URL (uniform resource locator) which cannot be identified by a library identification and classification module;

and taking the three crawled fields as a training set, and performing training modeling and classification by adopting a BPTT deep learning algorithm or an RTRL deep learning algorithm to finish automatic identification and classification of different types of internet websites.

Further, the step of performing data mining according to the classification result of the internet website includes:

acquiring the number of times of visiting an internet website, the number of visitors, the flow of each time period of a user all day and the type data of an application store website;

analyzing the user behavior characteristics according to the acquired data to obtain the user behavior characteristics of the internet website, wherein the user behavior characteristics of the internet website comprise the current total number of users of the internet website, the average number of times of visiting each user, the average flow brought by visiting each time and the period of time of the current time;

predicting the total number of users, the average number of visits per user and the average flow brought by each visit of the internet website in the same period of the next day by adopting a triple moving average method according to the user behavior characteristics of the internet website;

and calculating the maximum access flow of the internet website in the same period of the next day according to the total number of users, the average access times per user and the average flow brought by each access of the internet website in the same period of the next day.

Further, the process of predicting by the triple moving average method comprises the following steps:

initialize and read the current value X in the t period_tWherein T is 1,2,3, …, and T is the time period of the current time of the current day;

calculating X within t time period_tA moving average of

The above-mentioned

The calculation formula of (2) is as follows:

wherein T ═ 0.5T ], [0.5T ] +1, …, T, "[ ]" is a rounded symbol, and "[ 0.5T ]" represents a minimum integer of not less than 0.5T;

calculating X within t time period_tMoving average of two times

The above-mentioned

The calculation formula of (2) is as follows:

wherein T ═ 0.75T ], [0.75T ] +1, …, T, [ ] "is a rounded symbol, and" [0.5T ] "represents a minimum integer of not less than 0.5T;

calculating the T-th time interval X_tMoving average of three times

The above-mentioned

The calculation formula of (2) is as follows:

in the formula, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, T-1 means the same period on the day before the current day, and T +1 means the same period on the day after the current day;

calculating X in the same time interval of the next day of the current day_tPredicted value X of_T+1，X_T+1The calculation formula of (a) is as follows:

wherein, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, and T +1 means that the same period of time is the next day at the present day;

ending and outputting X in the same time interval of the next day of the current day_tPredicted value X of_T+1。

Further, the step of calculating the maximum access flow of the internet website in the same time period next day according to the total number of users of the internet website in the same time period next day, the average number of times of access per user, and the average flow brought by each access, specifically includes:

according to the predicted total number U of users in the same time interval of the next day_T+11Average number of visits per user

And averaging the traffic brought by each access

Calculating the maximum access FLOW FLOW in the same time interval on the next day_T+1Said FLOW_T+1The formula of the calculation is as follows:

further, the step of recommending internet websites for the user according to the result of the data mining specifically comprises:

judging the maximum access flow U in the same time interval next day one by one through internet websites_T+1And if the flow rate is larger than the flow rate threshold value set by the user, recommending the internet website to the user, otherwise, switching to the next internet website to judge again until all the internet websites are judged to be finished.

and recommending the Internet website for the user by adopting a collaborative filtering method according to the result of the data mining, wherein the collaborative filtering method comprises a collaborative filtering algorithm based on the user and a collaborative filtering algorithm based on the article.

The invention has the beneficial effects that: the method comprises the steps of acquiring original internet flow monitoring data, performing filling processing on the acquired internet flow monitoring data by adopting an incomplete data filling algorithm fusing an N neighbor filling algorithm and a threshold filling algorithm, performing classification processing on the data according to the filled data by adopting a deep learning method based on an infinite deep neural network, performing data mining according to the classification result of internet websites, recommending the internet websites for a user according to the data mining result, performing classification processing by adopting a deep learning method based on the infinite deep neural network, performing deep learning by replacing the existing feedforward neural network through the infinite deep neural network with feedback connection, processing dynamic data, having good real-time performance, and meeting the increasingly high requirement of people on internet flow big data analysis; the obtained internet traffic monitoring data is filled by adopting an incomplete data filling algorithm fusing the N neighbor filling algorithm and the threshold filling algorithm, the occurrence of incomplete data is reduced, and the accuracy of internet traffic big data analysis is improved. Further, based on the user behavior characteristics of the internet website, the relation among the total user number, the user access times and the flow brought by each access of the user is used as a prediction basis, the total user number, the average user access times and the average flow brought by each access of the internet website in the same period of time next day are predicted by using a moving average method three times in the same period of time every day, and the rationality and the accuracy of prediction are effectively improved.

Drawings

Fig. 1 is an overall flowchart of an internet traffic big data analysis method based on deep learning according to the present invention.

Detailed Description

Referring to fig. 1, an internet traffic big data analysis method based on deep learning includes the following steps:

acquiring original internet traffic monitoring data;

Further as a preferred embodiment, the step of performing classification processing according to the filled data by using a deep learning method based on an infinite deep neural network to obtain a classification result of an internet website includes:

reading Internet flow records from the HDFS;

Further as a preferred embodiment, the step of performing MapReduce parallel processing and data capture and parsing processing on the read internet traffic records, and storing the parsed web page content in the HBase database includes:

Further as a preferred embodiment, the step of modeling and classifying by using the unrecognized web page content after the library recognition and classification as a training set and using a deep learning method based on an infinite deep neural network includes:

Further, as a preferred embodiment, the step of performing data mining according to the classification result of the internet website includes:

Further, as a preferred embodiment, the process of predicting by the triple moving average method includes:

calculating X within t time period_tA moving average of

The above-mentioned

The calculation formula of (2) is as follows:

calculating X within t time period_tMoving average of two times

The above-mentioned

The calculation formula of (2) is as follows:

calculating the T-th time interval X_tMoving average of three times

The above-mentioned

The calculation formula of (2) is as follows:

Further as a preferred embodiment, the step of calculating the maximum access traffic of the internet website in the same time period next day according to the total number of users of the internet website in the same time period next day, the average number of times of access per user, and the average traffic brought by each access includes:

And averaging the traffic brought by each access

further as a preferred embodiment, the step of recommending internet websites for the user according to the result of the data mining specifically includes:

The invention will be further explained and explained with reference to the drawings and the embodiments in the description.

Example one

The invention provides an internet flow big data analysis method based on deep learning, aiming at the problems of poor real-time performance and low precision in the prior art.

As shown in fig. 1, the internet traffic big data analysis method of the present invention specifically includes the following steps:

and (I) acquiring original Internet traffic monitoring data.

The invention can acquire the internet flow monitoring data by the existing networking flow monitoring means or equipment.

And (II) filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm which is fused with the N neighbor filling algorithm and the threshold filling algorithm.

This process can be further subdivided into the following steps:

Wherein, the noise cleaning is to remove the deviation, redundancy and random error in the original internet flow monitoring data. Methods of noise cleaning include smoothing, deduplication, and the like. Data integration, mainly for the purpose of unified storage management of data; data conversion, mainly for normalizing and standardizing data; the data protocol is mainly used for carrying out constraints such as dimensionality, numerical values and marks on data so as to improve the efficiency of subsequent data mining.

And thirdly, classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website.

This process can be further subdivided into the following steps:

(1) internet traffic records are read from the HDFS.

(2) And performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records, and storing the analyzed webpage content into an HBase database.

The conducting of MapReduce parallel processing on the read Internet traffic records means that the read Internet traffic records are preprocessed by adopting a MapReduce program to obtain URL addresses capable of being crawled by webpages. Preprocessing the read internet traffic records by adopting a MapReduce program comprises URL (uniform resource locator) combination, URL filtering and URL deduplication. Each URL is composed of a Host and a URL field, so the URL combination includes a combination of the Host and the URL field. URL filtering and URL deduplication are used for deleting wrong and repeated URLs and improving data processing efficiency.

And the data capturing and analyzing processing refers to that a plurality of parallel webpage crawling threads are adopted to crawl and analyze the URL address to obtain the website title, the keyword and the content describing the three fields, and the content of the three fields is stored in the HBase database. The three fields of the website title, the keyword and the description are the core contents of the webpage, and in order to save the storage space, the invention only selects the three fields to crawl and analyze. Data capture and analysis processing can be realized by adopting a jsup analyzer.

(3) And the library identification module directly identifies and classifies the URL of each record in the Internet flow records by adopting a library-based identification method.

The library identification module updates and maintains the URL identification result table and the URL unidentified result table through the library file. The library identification module needs to have library files in advance, the library files are initially established by manual addition, and then new URL correct classification can be generated based on deep learning identification classification to update the original library files, so that the library files are larger and more comprehensive.

(4) And the webpage contents which are not identified after being classified by the library identification module are used as a training set, and a deep learning method based on an infinite deep neural network is adopted for modeling and classifying so as to finish the automatic identification and classification of different types of internet websites.

The method is characterized in that the webpage content which is not identified after being classified by the library identification module is used as a training set, namely, a URL which cannot be identified by the library identification classification module is crawled, and three fields of website titles, keywords and descriptions in the webpage content corresponding to the URL are used as the training set.

The method for modeling and classifying by adopting the deep learning method based on the infinite deep neural network refers to the fact that the deep learning method based on the infinite deep neural network is adopted to train and test according to a training set to obtain a correct classification recognition model and parameters thereof.

The deep learning method based on the infinite deep neural network can be realized by adopting a BPTT deep learning algorithm or an RTRL deep learning algorithm. The BPTT (Back-Propagation Through Time) deep learning algorithm is a backward transfer algorithm which can train an infinite deep neural network and is proposed by the Williams RJ professor of the University of Northeastern University in the United states. The RTRL (Real-Time current Learning) deep Learning algorithm is an algorithm for forward propagation of "activity" information proposed by Robinson & fallisid et al.

(5) Based on the result of deep learning identification classification, the URL of the correct classification is extracted to update and expand the library file in the library identification module.

And (IV) carrying out data mining according to the classification result of the Internet website.

One specific implementation of data mining comprises the following steps:

(1) and acquiring the access times and the number of the access people of the internet website, the flow of each time period of the user all day and the type data of the application store website.

The traffic of each time period of the user all day can be counted by taking 1 hour as a traffic counting interval. The type of the app store-like website may be apple app store, android app store, and so on.

(2) And analyzing the user behavior characteristics according to the acquired data to obtain the user behavior characteristics of the internet website, wherein the user behavior characteristics of the internet website comprise the current total number of users of the internet website, the average number of times of visiting each user, the average flow brought by visiting each time and the period of the current time.

The user behavior feature analysis can be realized by adopting the existing feature analysis method, such as a clustering analysis algorithm and the like.

(3) And predicting the total number of users, the average number of visits per user and the average flow brought by each visit of the internet website in the same period of the next day by adopting a three-time moving average method according to the user behavior characteristics of the internet website.

The process of predicting by the three-time moving average method comprises the following steps:

1) initialize and read the current value X in the t period_tWhere T is 1,2,3, …, and T is the time period of the current time of the current day.

Wherein, X_tThe method can be used for the current total number of users, the average access times per user and the average flow brought by each access.

2) Calculating X within t time period_tA moving average of

The above-mentioned

The calculation formula of (2) is as follows:

in the formula, "[ 0.5T ], [0.5T ] +1, …, T," [ ] "is a rounded symbol, and" [0.5T ] "indicates a minimum integer of not less than 0.5T.

3) Calculating X within t time period_tMoving average of two times

The above-mentioned

The calculation formula of (2) is as follows:

in the formula, "[ 0.75T ], [0.75T ] +1, …, T," [ ] "is a rounded symbol, and" [0.5T ] "indicates a minimum integer of not less than 0.5T.

4) Calculating the T-th time interval X_tMoving average of three times

The above-mentioned

The calculation formula of (2) is as follows:

in the formula, "[ ]" is a rounded symbol, "[ 0.5T ]" means taking a minimum integer of not less than 0.5T, T-1 means the same period on the day before the current day, and T +1 means the same period on the day after the current day.

5) Calculating X in the same time interval of the next day of the current day_tPredicted value X of_T+1，X_T+1The calculation formula of (a) is as follows:

wherein "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer of not less than 0.5T is taken, and T +1 means that the same period of time is the next day at the present day.

6) Ending and outputting X in the same time interval of the next day of the current day_tPredicted value X of_T+1。

(4) According to the total number U of users in the same time period next day of the Internet website_T+1Average number of visits per user

And averaging the traffic brought by each access

Calculating the maximum access FLOW FLOW of the internet website in the same time period next day_T+1。

And (V) recommending the internet website for the user according to the data mining result.

There are two methods for recommending internet websites for users: one is a recommendation method based on predicted maximum access traffic, and the other is a recommendation method based on collaborative filtering.

The recommendation method based on the predicted maximum access flow comprises the following specific processes: judging the maximum access flow U in the same time interval next day one by one through internet websites_T+1And if the flow rate is larger than the flow rate threshold value set by the user, recommending the internet website to the user, otherwise, switching to the next internet website to judge again until all the internet websites are judged to be finished.

And the recommendation method based on collaborative filtering carries out recommendation by calculating similar users or similar articles. Therefore, the recommendation method of collaborative filtering can be divided into a collaborative filtering algorithm based on the user and a collaborative filtering algorithm based on the user, and the user can flexibly select the collaborative filtering algorithm according to actual needs.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A deep learning-based internet traffic big data analysis method is characterized by comprising the following steps: the method comprises the following steps:

acquiring original internet traffic monitoring data;

recommending internet websites for the user according to the data mining result;

the step of filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm comprises the following steps:

s5, sequentially performing data integration, data conversion and data specification on the filled data, and storing the data subjected to data specification processing into the HDFS;

the step of obtaining the classification result of the internet website by classifying the data after the filling processing by adopting a deep learning method based on an infinite deep neural network comprises the following steps:

reading Internet flow records from the HDFS;

based on the result of deep learning identification, extracting the URL of correct classification, and updating and expanding the library file in the library identification module;

the step of performing data mining according to the classification result of the internet website includes:

2. The deep learning-based internet traffic big data analysis method according to claim 1, characterized in that: the step of performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records and storing the analyzed webpage content into an HBase database comprises the following steps:

3. The deep learning-based internet traffic big data analysis method according to claim 2, characterized in that: the method comprises the following steps of taking unidentified webpage contents after being identified and classified by a library as a training set, and modeling and classifying by adopting a deep learning method based on an infinite deep neural network, and comprises the following steps:

4. The deep learning-based internet traffic big data analysis method according to claim 3, characterized in that: the process of predicting by the triple moving average method comprises the following steps:

calculating X within t time period_tA moving average of

The above-mentioned

The calculation formula of (2) is as follows:

calculating X within t time period_tMoving average of two times

The above-mentioned

The calculation formula of (2) is as follows:

calculating the T-th time interval X_tMoving average of three times

The above-mentioned

The calculation formula of (2) is as follows:

5. The deep learning-based internet traffic big data analysis method according to claim 4, wherein the deep learning-based internet traffic big data analysis method comprises the following steps: the step of calculating the maximum access flow of the internet website in the same period of time next day according to the total number of users, the average access times per user and the average flow brought by each access of the internet website in the same period of time next day specifically comprises the following steps:

according to the predicted total number U of users in the same time interval of the next day_T+1Average number of visits per user

And averaging the traffic brought by each access

6. the deep learning-based internet traffic big data analysis method according to claim 5, wherein: the step of recommending the internet website for the user according to the data mining result specifically comprises the following steps:

7. The deep learning-based internet traffic big data analysis method according to any one of claims 1 to 3, characterized in that: the step of recommending the internet website for the user according to the data mining result specifically comprises the following steps: