CN107086925B - Deep learning-based internet traffic big data analysis method - Google Patents

Deep learning-based internet traffic big data analysis method Download PDF

Info

Publication number
CN107086925B
CN107086925B CN201710132366.8A CN201710132366A CN107086925B CN 107086925 B CN107086925 B CN 107086925B CN 201710132366 A CN201710132366 A CN 201710132366A CN 107086925 B CN107086925 B CN 107086925B
Authority
CN
China
Prior art keywords
data
internet
deep learning
website
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710132366.8A
Other languages
Chinese (zh)
Other versions
CN107086925A (en
Inventor
潘强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai City Polytechnic
Original Assignee
Zhuhai City Polytechnic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai City Polytechnic filed Critical Zhuhai City Polytechnic
Priority to CN201710132366.8A priority Critical patent/CN107086925B/en
Publication of CN107086925A publication Critical patent/CN107086925A/en
Application granted granted Critical
Publication of CN107086925B publication Critical patent/CN107086925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Environmental & Geological Engineering (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an internet flow big data analysis method based on deep learning, which comprises the following steps: acquiring original internet traffic monitoring data; filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm; classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website; data mining is carried out according to the classification result of the internet website; and recommending the internet website for the user according to the data mining result. The invention adopts the infinite deep neural network with feedback connection to replace the existing feedforward neural network, can process dynamic data and has better real-time property; and the obtained internet flow monitoring data is filled by adopting an incomplete data filling algorithm fusing an N neighbor filling algorithm and a threshold filling algorithm, so that the precision is high. The method can be widely applied to the field of data mining.

Description

Deep learning-based internet traffic big data analysis method
Technical Field
The invention relates to the field of data mining, in particular to an internet traffic big data analysis method based on deep learning.
Background
With the rapid development of information and communication technologies such as the internet, a mobile intelligent terminal and the internet of things and the continuous improvement of computer storage and computing capacity, the explosive growth and continuous acquisition of various data become possible, and the big data times are natural. Compared with traditional data, people sum up the characteristics of big data into 5V, namely large volume (volume), fast speed (velocity), multi-modal (variety), difficult recognition (veracity) and high value and low density (value). How to analyze the big data and fully mine the potential value of the big data becomes a scientific problem needing deep discussion.
In the internet field, network traffic monitoring is the most effective means for acquiring network traffic indexes and network user behavior parameters. With the increasing of internet users, data needing to be researched and analyzed by the internet is also increasing, and how to dig out a flow rule and a user behavior rule from massive user flow data (i.e., how to analyze internet flow big data) becomes a technical problem to be solved urgently in the industry.
A learning algorithm based on a deep neural network (a deep learning method for short) is well known in academia and industry as a successful big data analysis method. Compared with the traditional method, the deep learning method is driven by data, can automatically extract features (knowledge) from the data, and has remarkable advantages for analyzing unstructured, variable and cross-domain large data.
At present, a deep learning method used in internet flow big data analysis is a deep learning method based on a feedforward neural network, and the feedforward neural network is characterized in that no feedback connection exists between neurons in the same layer, and no time parameter attribute exists, so that the deep learning method based on the feedforward neural network is only good at processing static data, but cannot process dynamic data (namely data related to time), is poor in real-time performance, and cannot meet the increasingly high requirements of people on internet flow big data analysis. In addition, in the current internet traffic big data analysis, the data monitored by the network traffic monitoring equipment is lost due to the influence of various faults, and then the accuracy of the subsequent internet traffic big data analysis is seriously influenced due to the incomplete monitored data.
Disclosure of Invention
To solve the above technical problems, the present invention aims to: the method for analyzing the internet flow big data is good in instantaneity and high in precision and is based on deep learning.
The technical scheme adopted by the invention is as follows:
an internet traffic big data analysis method based on deep learning comprises the following steps:
acquiring original internet traffic monitoring data;
filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-nearest neighbor filling algorithm and a threshold filling algorithm, wherein N is the total number of the set nearest neighbor data;
classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website, wherein feedback connection exists between neurons in the same layer of the infinite deep neural network;
data mining is carried out according to the classification result of the internet website;
and recommending the internet website for the user according to the data mining result.
Further, the step of performing filling processing on the acquired internet traffic monitoring data by using an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm includes:
s1, carrying out noise cleaning on the acquired Internet flow monitoring data to obtain data after the noise cleaning;
s2, dividing the data after noise cleaning into a complete data set C and an incomplete data set I according to whether the data are complete or not, and directly executing the data in C to the step S5 and executing the data in I to the step S3;
s3, searching N neighbor data in C for the data I in I, judging whether N neighbor data most similar to the data I can be found in C or not, if yes, filling complete data by taking the average value of the N neighbor data as the data I, and then executing the step S5, otherwise, executing the step S4;
s4, calculating the sum D of the distances between the data I in the step I and all the data in the complete data set C, judging whether the sum D is smaller than a set threshold Th, if so, filling the complete data with the mean value of all the data in the step C as the data I, and then executing a step S5, otherwise, deleting the data I from the step I;
and S5, sequentially performing data integration, data conversion and data specification on the filled data, and storing the data subjected to data specification processing into the HDFS.
Further, the step of performing classification processing by using a deep learning method based on an infinite deep neural network according to the filled data to obtain a classification result of the internet website includes:
reading Internet flow records from the HDFS;
performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records, and storing the analyzed webpage content into an HBase database;
the method comprises the following steps that a library identification module directly identifies and classifies URLs of all records in Internet traffic records by adopting a library-based identification method, wherein the library identification module updates and maintains a URL identification result table and a URL unidentified result table through library files;
the method comprises the steps that unidentified webpage contents classified by a library recognition module are used as a training set, and a deep learning method based on an infinite deep neural network is adopted for modeling and classifying so as to finish automatic recognition and classification of different types of internet websites;
and extracting the URL with correct classification based on the result of deep learning identification, and updating and expanding the library file in the library identification module.
Further, the step of performing MapReduce parallel processing and data capture and analysis processing on the read internet traffic records and storing the analyzed webpage content in the HBase database includes:
preprocessing the read internet flow records by a MapReduce program to obtain URL addresses capable of crawling web pages, wherein the preprocessing comprises URL combination, URL filtering and URL duplication removal;
crawling and analyzing the URL address by adopting a plurality of parallel webpage crawling threads to obtain a website title, a keyword and contents describing the three fields, and storing the contents of the three fields in an HBase database.
Further, the step of modeling and classifying by using the unrecognized web page content after the library recognition and classification as a training set and adopting a deep learning method based on an infinite deep neural network comprises the following steps:
crawling a website title, a keyword and a description three fields in webpage content corresponding to a URL (uniform resource locator) which cannot be identified by a library identification and classification module;
and taking the three crawled fields as a training set, and performing training modeling and classification by adopting a BPTT deep learning algorithm or an RTRL deep learning algorithm to finish automatic identification and classification of different types of internet websites.
Further, the step of performing data mining according to the classification result of the internet website includes:
acquiring the number of times of visiting an internet website, the number of visitors, the flow of each time period of a user all day and the type data of an application store website;
analyzing the user behavior characteristics according to the acquired data to obtain the user behavior characteristics of the internet website, wherein the user behavior characteristics of the internet website comprise the current total number of users of the internet website, the average number of times of visiting each user, the average flow brought by visiting each time and the period of time of the current time;
predicting the total number of users, the average number of visits per user and the average flow brought by each visit of the internet website in the same period of the next day by adopting a triple moving average method according to the user behavior characteristics of the internet website;
and calculating the maximum access flow of the internet website in the same period of the next day according to the total number of users, the average access times per user and the average flow brought by each access of the internet website in the same period of the next day.
Further, the process of predicting by the triple moving average method comprises the following steps:
initialize and read the current value X in the t periodtWherein T is 1,2,3, …, and T is the time period of the current time of the current day;
calculating X within t time periodtA moving average of
Figure BDA0001240412840000031
The above-mentioned
Figure BDA0001240412840000032
The calculation formula of (2) is as follows:
Figure BDA0001240412840000033
wherein T ═ 0.5T ], [0.5T ] +1, …, T, "[ ]" is a rounded symbol, and "[ 0.5T ]" represents a minimum integer of not less than 0.5T;
calculating X within t time periodtMoving average of two times
Figure BDA0001240412840000041
The above-mentioned
Figure BDA0001240412840000042
The calculation formula of (2) is as follows:
Figure BDA0001240412840000043
wherein T ═ 0.75T ], [0.75T ] +1, …, T, [ ] "is a rounded symbol, and" [0.5T ] "represents a minimum integer of not less than 0.5T;
calculating the T-th time interval XtMoving average of three times
Figure BDA0001240412840000044
The above-mentioned
Figure BDA0001240412840000045
The calculation formula of (2) is as follows:
Figure BDA0001240412840000046
in the formula, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, T-1 means the same period on the day before the current day, and T +1 means the same period on the day after the current day;
calculating X in the same time interval of the next day of the current daytPredicted value X ofT+1,XT+1The calculation formula of (a) is as follows:
Figure BDA0001240412840000047
wherein, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, and T +1 means that the same period of time is the next day at the present day;
ending and outputting X in the same time interval of the next day of the current daytPredicted value X ofT+1
Further, the step of calculating the maximum access flow of the internet website in the same time period next day according to the total number of users of the internet website in the same time period next day, the average number of times of access per user, and the average flow brought by each access, specifically includes:
according to the predicted total number U of users in the same time interval of the next dayT+11Average number of visits per user
Figure BDA0001240412840000048
And averaging the traffic brought by each access
Figure BDA0001240412840000049
Calculating the maximum access FLOW FLOW in the same time interval on the next dayT+1Said FLOWT+1The formula of the calculation is as follows:
Figure BDA00012404128400000410
further, the step of recommending internet websites for the user according to the result of the data mining specifically comprises:
judging the maximum access flow U in the same time interval next day one by one through internet websitesT+1And if the flow rate is larger than the flow rate threshold value set by the user, recommending the internet website to the user, otherwise, switching to the next internet website to judge again until all the internet websites are judged to be finished.
Further, the step of recommending internet websites for the user according to the result of the data mining specifically comprises:
and recommending the Internet website for the user by adopting a collaborative filtering method according to the result of the data mining, wherein the collaborative filtering method comprises a collaborative filtering algorithm based on the user and a collaborative filtering algorithm based on the article.
The invention has the beneficial effects that: the method comprises the steps of acquiring original internet flow monitoring data, performing filling processing on the acquired internet flow monitoring data by adopting an incomplete data filling algorithm fusing an N neighbor filling algorithm and a threshold filling algorithm, performing classification processing on the data according to the filled data by adopting a deep learning method based on an infinite deep neural network, performing data mining according to the classification result of internet websites, recommending the internet websites for a user according to the data mining result, performing classification processing by adopting a deep learning method based on the infinite deep neural network, performing deep learning by replacing the existing feedforward neural network through the infinite deep neural network with feedback connection, processing dynamic data, having good real-time performance, and meeting the increasingly high requirement of people on internet flow big data analysis; the obtained internet traffic monitoring data is filled by adopting an incomplete data filling algorithm fusing the N neighbor filling algorithm and the threshold filling algorithm, the occurrence of incomplete data is reduced, and the accuracy of internet traffic big data analysis is improved. Further, based on the user behavior characteristics of the internet website, the relation among the total user number, the user access times and the flow brought by each access of the user is used as a prediction basis, the total user number, the average user access times and the average flow brought by each access of the internet website in the same period of time next day are predicted by using a moving average method three times in the same period of time every day, and the rationality and the accuracy of prediction are effectively improved.
Drawings
Fig. 1 is an overall flowchart of an internet traffic big data analysis method based on deep learning according to the present invention.
Detailed Description
Referring to fig. 1, an internet traffic big data analysis method based on deep learning includes the following steps:
acquiring original internet traffic monitoring data;
filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-nearest neighbor filling algorithm and a threshold filling algorithm, wherein N is the total number of the set nearest neighbor data;
classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website, wherein feedback connection exists between neurons in the same layer of the infinite deep neural network;
data mining is carried out according to the classification result of the internet website;
and recommending the internet website for the user according to the data mining result.
Further, the step of performing filling processing on the acquired internet traffic monitoring data by using an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm includes:
s1, carrying out noise cleaning on the acquired Internet flow monitoring data to obtain data after the noise cleaning;
s2, dividing the data after noise cleaning into a complete data set C and an incomplete data set I according to whether the data are complete or not, and directly executing the data in C to the step S5 and executing the data in I to the step S3;
s3, searching N neighbor data in C for the data I in I, judging whether N neighbor data most similar to the data I can be found in C or not, if yes, filling complete data by taking the average value of the N neighbor data as the data I, and then executing the step S5, otherwise, executing the step S4;
s4, calculating the sum D of the distances between the data I in the step I and all the data in the complete data set C, judging whether the sum D is smaller than a set threshold Th, if so, filling the complete data with the mean value of all the data in the step C as the data I, and then executing a step S5, otherwise, deleting the data I from the step I;
and S5, sequentially performing data integration, data conversion and data specification on the filled data, and storing the data subjected to data specification processing into the HDFS.
Further as a preferred embodiment, the step of performing classification processing according to the filled data by using a deep learning method based on an infinite deep neural network to obtain a classification result of an internet website includes:
reading Internet flow records from the HDFS;
performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records, and storing the analyzed webpage content into an HBase database;
the method comprises the following steps that a library identification module directly identifies and classifies URLs of all records in Internet traffic records by adopting a library-based identification method, wherein the library identification module updates and maintains a URL identification result table and a URL unidentified result table through library files;
the method comprises the steps that unidentified webpage contents classified by a library recognition module are used as a training set, and a deep learning method based on an infinite deep neural network is adopted for modeling and classifying so as to finish automatic recognition and classification of different types of internet websites;
and extracting the URL with correct classification based on the result of deep learning identification, and updating and expanding the library file in the library identification module.
Further as a preferred embodiment, the step of performing MapReduce parallel processing and data capture and parsing processing on the read internet traffic records, and storing the parsed web page content in the HBase database includes:
preprocessing the read internet flow records by a MapReduce program to obtain URL addresses capable of crawling web pages, wherein the preprocessing comprises URL combination, URL filtering and URL duplication removal;
crawling and analyzing the URL address by adopting a plurality of parallel webpage crawling threads to obtain a website title, a keyword and contents describing the three fields, and storing the contents of the three fields in an HBase database.
Further as a preferred embodiment, the step of modeling and classifying by using the unrecognized web page content after the library recognition and classification as a training set and using a deep learning method based on an infinite deep neural network includes:
crawling a website title, a keyword and a description three fields in webpage content corresponding to a URL (uniform resource locator) which cannot be identified by a library identification and classification module;
and taking the three crawled fields as a training set, and performing training modeling and classification by adopting a BPTT deep learning algorithm or an RTRL deep learning algorithm to finish automatic identification and classification of different types of internet websites.
Further, as a preferred embodiment, the step of performing data mining according to the classification result of the internet website includes:
acquiring the number of times of visiting an internet website, the number of visitors, the flow of each time period of a user all day and the type data of an application store website;
analyzing the user behavior characteristics according to the acquired data to obtain the user behavior characteristics of the internet website, wherein the user behavior characteristics of the internet website comprise the current total number of users of the internet website, the average number of times of visiting each user, the average flow brought by visiting each time and the period of time of the current time;
predicting the total number of users, the average number of visits per user and the average flow brought by each visit of the internet website in the same period of the next day by adopting a triple moving average method according to the user behavior characteristics of the internet website;
and calculating the maximum access flow of the internet website in the same period of the next day according to the total number of users, the average access times per user and the average flow brought by each access of the internet website in the same period of the next day.
Further, as a preferred embodiment, the process of predicting by the triple moving average method includes:
initialize and read the current value X in the t periodtWherein T is 1,2,3, …, and T is the time period of the current time of the current day;
calculating X within t time periodtA moving average of
Figure BDA0001240412840000071
The above-mentioned
Figure BDA0001240412840000072
The calculation formula of (2) is as follows:
Figure BDA0001240412840000073
wherein T ═ 0.5T ], [0.5T ] +1, …, T, "[ ]" is a rounded symbol, and "[ 0.5T ]" represents a minimum integer of not less than 0.5T;
calculating X within t time periodtMoving average of two times
Figure BDA0001240412840000074
The above-mentioned
Figure BDA0001240412840000075
The calculation formula of (2) is as follows:
Figure BDA0001240412840000076
wherein T ═ 0.75T ], [0.75T ] +1, …, T, [ ] "is a rounded symbol, and" [0.5T ] "represents a minimum integer of not less than 0.5T;
calculating the T-th time interval XtMoving average of three times
Figure BDA0001240412840000081
The above-mentioned
Figure BDA0001240412840000082
The calculation formula of (2) is as follows:
Figure BDA0001240412840000083
in the formula, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, T-1 means the same period on the day before the current day, and T +1 means the same period on the day after the current day;
calculating X in the same time interval of the next day of the current daytPredicted value X ofT+1,XT+1The calculation formula of (a) is as follows:
Figure BDA0001240412840000084
wherein, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, and T +1 means that the same period of time is the next day at the present day;
ending and outputting X in the same time interval of the next day of the current daytPredicted value X ofT+1
Further as a preferred embodiment, the step of calculating the maximum access traffic of the internet website in the same time period next day according to the total number of users of the internet website in the same time period next day, the average number of times of access per user, and the average traffic brought by each access includes:
according to the predicted total number U of users in the same time interval of the next dayT+11Average number of visits per user
Figure BDA0001240412840000085
And averaging the traffic brought by each access
Figure BDA0001240412840000086
Calculating the maximum access FLOW FLOW in the same time interval on the next dayT+1Said FLOWT+1The formula of the calculation is as follows:
Figure BDA0001240412840000087
further as a preferred embodiment, the step of recommending internet websites for the user according to the result of the data mining specifically includes:
judging the maximum access flow U in the same time interval next day one by one through internet websitesT+1And if the flow rate is larger than the flow rate threshold value set by the user, recommending the internet website to the user, otherwise, switching to the next internet website to judge again until all the internet websites are judged to be finished.
Further as a preferred embodiment, the step of recommending internet websites for the user according to the result of the data mining specifically includes:
and recommending the Internet website for the user by adopting a collaborative filtering method according to the result of the data mining, wherein the collaborative filtering method comprises a collaborative filtering algorithm based on the user and a collaborative filtering algorithm based on the article.
The invention will be further explained and explained with reference to the drawings and the embodiments in the description.
Example one
The invention provides an internet flow big data analysis method based on deep learning, aiming at the problems of poor real-time performance and low precision in the prior art.
As shown in fig. 1, the internet traffic big data analysis method of the present invention specifically includes the following steps:
and (I) acquiring original Internet traffic monitoring data.
The invention can acquire the internet flow monitoring data by the existing networking flow monitoring means or equipment.
And (II) filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm which is fused with the N neighbor filling algorithm and the threshold filling algorithm.
This process can be further subdivided into the following steps:
s1, carrying out noise cleaning on the acquired Internet flow monitoring data to obtain data after the noise cleaning;
s2, dividing the data after noise cleaning into a complete data set C and an incomplete data set I according to whether the data are complete or not, and directly executing the data in C to the step S5 and executing the data in I to the step S3;
s3, searching N neighbor data in C for the data I in I, judging whether N neighbor data most similar to the data I can be found in C or not, if yes, filling complete data by taking the average value of the N neighbor data as the data I, and then executing the step S5, otherwise, executing the step S4;
s4, calculating the sum D of the distances between the data I in the step I and all the data in the complete data set C, judging whether the sum D is smaller than a set threshold Th, if so, filling the complete data with the mean value of all the data in the step C as the data I, and then executing a step S5, otherwise, deleting the data I from the step I;
and S5, sequentially performing data integration, data conversion and data specification on the filled data, and storing the data subjected to data specification processing into the HDFS.
Wherein, the noise cleaning is to remove the deviation, redundancy and random error in the original internet flow monitoring data. Methods of noise cleaning include smoothing, deduplication, and the like. Data integration, mainly for the purpose of unified storage management of data; data conversion, mainly for normalizing and standardizing data; the data protocol is mainly used for carrying out constraints such as dimensionality, numerical values and marks on data so as to improve the efficiency of subsequent data mining.
And thirdly, classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website.
This process can be further subdivided into the following steps:
(1) internet traffic records are read from the HDFS.
(2) And performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records, and storing the analyzed webpage content into an HBase database.
The conducting of MapReduce parallel processing on the read Internet traffic records means that the read Internet traffic records are preprocessed by adopting a MapReduce program to obtain URL addresses capable of being crawled by webpages. Preprocessing the read internet traffic records by adopting a MapReduce program comprises URL (uniform resource locator) combination, URL filtering and URL deduplication. Each URL is composed of a Host and a URL field, so the URL combination includes a combination of the Host and the URL field. URL filtering and URL deduplication are used for deleting wrong and repeated URLs and improving data processing efficiency.
And the data capturing and analyzing processing refers to that a plurality of parallel webpage crawling threads are adopted to crawl and analyze the URL address to obtain the website title, the keyword and the content describing the three fields, and the content of the three fields is stored in the HBase database. The three fields of the website title, the keyword and the description are the core contents of the webpage, and in order to save the storage space, the invention only selects the three fields to crawl and analyze. Data capture and analysis processing can be realized by adopting a jsup analyzer.
(3) And the library identification module directly identifies and classifies the URL of each record in the Internet flow records by adopting a library-based identification method.
The library identification module updates and maintains the URL identification result table and the URL unidentified result table through the library file. The library identification module needs to have library files in advance, the library files are initially established by manual addition, and then new URL correct classification can be generated based on deep learning identification classification to update the original library files, so that the library files are larger and more comprehensive.
(4) And the webpage contents which are not identified after being classified by the library identification module are used as a training set, and a deep learning method based on an infinite deep neural network is adopted for modeling and classifying so as to finish the automatic identification and classification of different types of internet websites.
The method is characterized in that the webpage content which is not identified after being classified by the library identification module is used as a training set, namely, a URL which cannot be identified by the library identification classification module is crawled, and three fields of website titles, keywords and descriptions in the webpage content corresponding to the URL are used as the training set.
The method for modeling and classifying by adopting the deep learning method based on the infinite deep neural network refers to the fact that the deep learning method based on the infinite deep neural network is adopted to train and test according to a training set to obtain a correct classification recognition model and parameters thereof.
The deep learning method based on the infinite deep neural network can be realized by adopting a BPTT deep learning algorithm or an RTRL deep learning algorithm. The BPTT (Back-Propagation Through Time) deep learning algorithm is a backward transfer algorithm which can train an infinite deep neural network and is proposed by the Williams RJ professor of the University of Northeastern University in the United states. The RTRL (Real-Time current Learning) deep Learning algorithm is an algorithm for forward propagation of "activity" information proposed by Robinson & fallisid et al.
(5) Based on the result of deep learning identification classification, the URL of the correct classification is extracted to update and expand the library file in the library identification module.
And (IV) carrying out data mining according to the classification result of the Internet website.
One specific implementation of data mining comprises the following steps:
(1) and acquiring the access times and the number of the access people of the internet website, the flow of each time period of the user all day and the type data of the application store website.
The traffic of each time period of the user all day can be counted by taking 1 hour as a traffic counting interval. The type of the app store-like website may be apple app store, android app store, and so on.
(2) And analyzing the user behavior characteristics according to the acquired data to obtain the user behavior characteristics of the internet website, wherein the user behavior characteristics of the internet website comprise the current total number of users of the internet website, the average number of times of visiting each user, the average flow brought by visiting each time and the period of the current time.
The user behavior feature analysis can be realized by adopting the existing feature analysis method, such as a clustering analysis algorithm and the like.
(3) And predicting the total number of users, the average number of visits per user and the average flow brought by each visit of the internet website in the same period of the next day by adopting a three-time moving average method according to the user behavior characteristics of the internet website.
The process of predicting by the three-time moving average method comprises the following steps:
1) initialize and read the current value X in the t periodtWhere T is 1,2,3, …, and T is the time period of the current time of the current day.
Wherein, XtThe method can be used for the current total number of users, the average access times per user and the average flow brought by each access.
2) Calculating X within t time periodtA moving average of
Figure BDA0001240412840000111
The above-mentioned
Figure BDA0001240412840000112
The calculation formula of (2) is as follows:
Figure BDA0001240412840000113
in the formula, "[ 0.5T ], [0.5T ] +1, …, T," [ ] "is a rounded symbol, and" [0.5T ] "indicates a minimum integer of not less than 0.5T.
3) Calculating X within t time periodtMoving average of two times
Figure BDA0001240412840000114
The above-mentioned
Figure BDA0001240412840000115
The calculation formula of (2) is as follows:
Figure BDA0001240412840000116
in the formula, "[ 0.75T ], [0.75T ] +1, …, T," [ ] "is a rounded symbol, and" [0.5T ] "indicates a minimum integer of not less than 0.5T.
4) Calculating the T-th time interval XtMoving average of three times
Figure BDA0001240412840000117
The above-mentioned
Figure BDA0001240412840000118
The calculation formula of (2) is as follows:
Figure BDA0001240412840000121
in the formula, "[ ]" is a rounded symbol, "[ 0.5T ]" means taking a minimum integer of not less than 0.5T, T-1 means the same period on the day before the current day, and T +1 means the same period on the day after the current day.
5) Calculating X in the same time interval of the next day of the current daytPredicted value X ofT+1,XT+1The calculation formula of (a) is as follows:
Figure BDA0001240412840000122
wherein "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer of not less than 0.5T is taken, and T +1 means that the same period of time is the next day at the present day.
6) Ending and outputting X in the same time interval of the next day of the current daytPredicted value X ofT+1
(4) According to the total number U of users in the same time period next day of the Internet websiteT+1Average number of visits per user
Figure BDA0001240412840000123
And averaging the traffic brought by each access
Figure BDA0001240412840000124
Calculating the maximum access FLOW FLOW of the internet website in the same time period next dayT+1
And (V) recommending the internet website for the user according to the data mining result.
There are two methods for recommending internet websites for users: one is a recommendation method based on predicted maximum access traffic, and the other is a recommendation method based on collaborative filtering.
The recommendation method based on the predicted maximum access flow comprises the following specific processes: judging the maximum access flow U in the same time interval next day one by one through internet websitesT+1And if the flow rate is larger than the flow rate threshold value set by the user, recommending the internet website to the user, otherwise, switching to the next internet website to judge again until all the internet websites are judged to be finished.
And the recommendation method based on collaborative filtering carries out recommendation by calculating similar users or similar articles. Therefore, the recommendation method of collaborative filtering can be divided into a collaborative filtering algorithm based on the user and a collaborative filtering algorithm based on the user, and the user can flexibly select the collaborative filtering algorithm according to actual needs.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A deep learning-based internet traffic big data analysis method is characterized by comprising the following steps: the method comprises the following steps:
acquiring original internet traffic monitoring data;
filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-nearest neighbor filling algorithm and a threshold filling algorithm, wherein N is the total number of the set nearest neighbor data;
classifying the data after filling processing by adopting a deep learning method based on an infinite deep neural network to obtain a classification result of the internet website, wherein feedback connection exists between neurons in the same layer of the infinite deep neural network;
data mining is carried out according to the classification result of the internet website;
recommending internet websites for the user according to the data mining result;
the step of filling the acquired internet traffic monitoring data by adopting an incomplete data filling algorithm fusing an N-neighbor filling algorithm and a threshold filling algorithm comprises the following steps:
s1, carrying out noise cleaning on the acquired Internet flow monitoring data to obtain data after the noise cleaning;
s2, dividing the data after noise cleaning into a complete data set C and an incomplete data set I according to whether the data are complete or not, and directly executing the data in C to the step S5 and executing the data in I to the step S3;
s3, searching N neighbor data in C for the data I in I, judging whether N neighbor data most similar to the data I can be found in C or not, if yes, filling complete data by taking the average value of the N neighbor data as the data I, and then executing the step S5, otherwise, executing the step S4;
s4, calculating the sum D of the distances between the data I in the step I and all the data in the complete data set C, judging whether the sum D is smaller than a set threshold Th, if so, filling the complete data with the mean value of all the data in the step C as the data I, and then executing a step S5, otherwise, deleting the data I from the step I;
s5, sequentially performing data integration, data conversion and data specification on the filled data, and storing the data subjected to data specification processing into the HDFS;
the step of obtaining the classification result of the internet website by classifying the data after the filling processing by adopting a deep learning method based on an infinite deep neural network comprises the following steps:
reading Internet flow records from the HDFS;
performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records, and storing the analyzed webpage content into an HBase database;
the method comprises the following steps that a library identification module directly identifies and classifies URLs of all records in Internet traffic records by adopting a library-based identification method, wherein the library identification module updates and maintains a URL identification result table and a URL unidentified result table through library files;
the method comprises the steps that unidentified webpage contents classified by a library recognition module are used as a training set, and a deep learning method based on an infinite deep neural network is adopted for modeling and classifying so as to finish automatic recognition and classification of different types of internet websites;
based on the result of deep learning identification, extracting the URL of correct classification, and updating and expanding the library file in the library identification module;
the step of performing data mining according to the classification result of the internet website includes:
acquiring the number of times of visiting an internet website, the number of visitors, the flow of each time period of a user all day and the type data of an application store website;
analyzing the user behavior characteristics according to the acquired data to obtain the user behavior characteristics of the internet website, wherein the user behavior characteristics of the internet website comprise the current total number of users of the internet website, the average number of times of visiting each user, the average flow brought by visiting each time and the period of time of the current time;
predicting the total number of users, the average number of visits per user and the average flow brought by each visit of the internet website in the same period of the next day by adopting a triple moving average method according to the user behavior characteristics of the internet website;
and calculating the maximum access flow of the internet website in the same period of the next day according to the total number of users, the average access times per user and the average flow brought by each access of the internet website in the same period of the next day.
2. The deep learning-based internet traffic big data analysis method according to claim 1, characterized in that: the step of performing MapReduce parallel processing and data capture and analysis processing on the read Internet traffic records and storing the analyzed webpage content into an HBase database comprises the following steps:
preprocessing the read internet flow records by a MapReduce program to obtain URL addresses capable of crawling web pages, wherein the preprocessing comprises URL combination, URL filtering and URL duplication removal;
crawling and analyzing the URL address by adopting a plurality of parallel webpage crawling threads to obtain a website title, a keyword and contents describing the three fields, and storing the contents of the three fields in an HBase database.
3. The deep learning-based internet traffic big data analysis method according to claim 2, characterized in that: the method comprises the following steps of taking unidentified webpage contents after being identified and classified by a library as a training set, and modeling and classifying by adopting a deep learning method based on an infinite deep neural network, and comprises the following steps:
crawling a website title, a keyword and a description three fields in webpage content corresponding to a URL (uniform resource locator) which cannot be identified by a library identification and classification module;
and taking the three crawled fields as a training set, and performing training modeling and classification by adopting a BPTT deep learning algorithm or an RTRL deep learning algorithm to finish automatic identification and classification of different types of internet websites.
4. The deep learning-based internet traffic big data analysis method according to claim 3, characterized in that: the process of predicting by the triple moving average method comprises the following steps:
initialize and read the current value X in the t periodtWherein T is 1,2,3, …, and T is the time period of the current time of the current day;
calculating X within t time periodtA moving average of
Figure FDA0002363891060000031
The above-mentioned
Figure FDA0002363891060000032
The calculation formula of (2) is as follows:
Figure FDA0002363891060000033
wherein T ═ 0.5T ], [0.5T ] +1, …, T, "[ ]" is a rounded symbol, and "[ 0.5T ]" represents a minimum integer of not less than 0.5T;
calculating X within t time periodtMoving average of two times
Figure FDA0002363891060000034
The above-mentioned
Figure FDA0002363891060000035
The calculation formula of (2) is as follows:
Figure FDA0002363891060000036
wherein T ═ 0.75T ], [0.75T ] +1, …, T, [ ] "is a rounded symbol, and" [0.5T ] "represents a minimum integer of not less than 0.5T;
calculating the T-th time interval XtMoving average of three times
Figure FDA0002363891060000037
The above-mentioned
Figure FDA0002363891060000038
The calculation formula of (2) is as follows:
Figure FDA0002363891060000039
in the formula, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, T-1 means the same period on the day before the current day, and T +1 means the same period on the day after the current day;
calculating X in the same time interval of the next day of the current daytPredicted value X ofT+1,XT+1The calculation formula of (a) is as follows:
Figure FDA00023638910600000310
wherein, "[ ]" is a rounding symbol, "[ 0.5T ]" means that a minimum integer not less than 0.5T is taken, and T +1 means that the same period of time is the next day at the present day;
ending and outputting X in the same time interval of the next day of the current daytPredicted value X ofT+1
5. The deep learning-based internet traffic big data analysis method according to claim 4, wherein the deep learning-based internet traffic big data analysis method comprises the following steps: the step of calculating the maximum access flow of the internet website in the same period of time next day according to the total number of users, the average access times per user and the average flow brought by each access of the internet website in the same period of time next day specifically comprises the following steps:
according to the predicted total number U of users in the same time interval of the next dayT+1Average number of visits per user
Figure FDA00023638910600000311
And averaging the traffic brought by each access
Figure FDA0002363891060000041
Calculating the maximum access FLOW FLOW in the same time interval on the next dayT+1Said FLOWT+1The formula of the calculation is as follows:
Figure FDA0002363891060000042
6. the deep learning-based internet traffic big data analysis method according to claim 5, wherein: the step of recommending the internet website for the user according to the data mining result specifically comprises the following steps:
judging the maximum access flow U in the same time interval next day one by one through internet websitesT+1And if the flow rate is larger than the flow rate threshold value set by the user, recommending the internet website to the user, otherwise, switching to the next internet website to judge again until all the internet websites are judged to be finished.
7. The deep learning-based internet traffic big data analysis method according to any one of claims 1 to 3, characterized in that: the step of recommending the internet website for the user according to the data mining result specifically comprises the following steps:
and recommending the Internet website for the user by adopting a collaborative filtering method according to the result of the data mining, wherein the collaborative filtering method comprises a collaborative filtering algorithm based on the user and a collaborative filtering algorithm based on the article.
CN201710132366.8A 2017-03-07 2017-03-07 Deep learning-based internet traffic big data analysis method Active CN107086925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710132366.8A CN107086925B (en) 2017-03-07 2017-03-07 Deep learning-based internet traffic big data analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710132366.8A CN107086925B (en) 2017-03-07 2017-03-07 Deep learning-based internet traffic big data analysis method

Publications (2)

Publication Number Publication Date
CN107086925A CN107086925A (en) 2017-08-22
CN107086925B true CN107086925B (en) 2020-04-07

Family

ID=59614491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710132366.8A Active CN107086925B (en) 2017-03-07 2017-03-07 Deep learning-based internet traffic big data analysis method

Country Status (1)

Country Link
CN (1) CN107086925B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241383B (en) * 2018-07-20 2019-06-21 北京开普云信息科技有限公司 A kind of type of webpage intelligent identification Method and system based on deep learning
CN112287199A (en) * 2020-10-29 2021-01-29 黑龙江稻榛通网络技术服务有限公司 Big data center processing system based on cloud server
CN112765439A (en) * 2021-02-25 2021-05-07 重庆三峡学院 Data processing method and device based on big data platform
CN115297023A (en) * 2022-09-21 2022-11-04 天津沄讯网络科技有限公司 Internet flow big data analysis method of improved deep learning algorithm model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617230A (en) * 2013-11-26 2014-03-05 中国科学院深圳先进技术研究院 Method and system for advertisement recommendation based microblog

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015134665A1 (en) * 2014-03-04 2015-09-11 SignalSense, Inc. Classifying data with deep learning neural records incrementally refined through expert input

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617230A (en) * 2013-11-26 2014-03-05 中国科学院深圳先进技术研究院 Method and system for advertisement recommendation based microblog

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大数据分析的无限深度神经网络方法;张蕾 等;《计算机研究与发展》;20161231(第1期);第68-79页 *

Also Published As

Publication number Publication date
CN107086925A (en) 2017-08-22

Similar Documents

Publication Publication Date Title
CN105022827B (en) A kind of Web news dynamic aggregation method of domain-oriented theme
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN103365924B (en) A kind of method of internet information search, device and terminal
CN107086925B (en) Deep learning-based internet traffic big data analysis method
CN110543595B (en) In-station searching system and method
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN103546326B (en) Website traffic statistic method
CN110705288A (en) Big data-based public opinion analysis system
CN103226578A (en) Method for identifying websites and finely classifying web pages in medical field
CN105279277A (en) Knowledge data processing method and device
CN103218431A (en) System and method for identifying and automatically acquiring webpage information
CN104598536B (en) A kind of distributed network information structuring processing method
CN106844588A (en) A kind of analysis method and system of the user behavior data based on web crawlers
CN112732995A (en) Animal husbandry news information recommendation system
CN111966946A (en) Method, device, equipment and storage medium for identifying authority value of page
Oo Pattern discovery using association rule mining on clustered data
CN108647263B (en) Network address confidence evaluation method based on webpage segmentation crawling
Bhujbal et al. News aggregation using web scraping news portals
Annam et al. Entropy based informative content density approach for efficient web content extraction
CN114722304A (en) Community search method based on theme on heterogeneous information network
Sharma et al. Review of features and machine learning techniques for web searching
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
Nguyen et al. Pagerank-based approach on ranking social events: a case study with flickr
CN104615605B (en) The method and apparatus of classification for prediction data object
CN113157857A (en) Hot topic detection method, device and equipment for news

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant