CN112597141A

CN112597141A - Network flow detection method based on public opinion analysis

Info

Publication number: CN112597141A
Application number: CN202011554194.1A
Authority: CN
Inventors: 张志伟; 李钢锋; 梁卫国; 郭栋; 王文辉; 刘达; 孙衡; 徐晓强; 王淦; 吕显斌; 曹华; 齐云雷; 闫昊; 刘震; 李鑫; 王少伟; 焦健
Original assignee: State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Shandong Electric Power Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-02
Anticipated expiration: 2040-12-24
Also published as: CN112597141B

Abstract

The invention relates to a network flow detection method based on public opinion analysis, which comprises the steps of reading in original flow data, obtaining time sequence characteristic data, dividing all the flow data according to data flow according to quintuple information, extracting statistical characteristics and load characteristics of the data flow, and preliminarily predicting public opinion flow by combining the time sequence characteristic data; carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method; and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow. The method effectively improves the public opinion flow prediction precision, further reduces the prediction error, enhances the prediction effect, realizes the self-adaptive deep learning of the network public opinion analysis system, excavates key nodes and the evolution law thereof from a dynamic view angle, and provides a detection method based on influence according to various public opinion situations of the network in a targeted manner.

Description

Network flow detection method based on public opinion analysis

Technical Field

The invention relates to the technical field of information source influence assessment and deep learning, in particular to a network flow detection method based on public sentiment analysis.

Background

The quality of the information source is a precondition for the accuracy and quality assurance of the public sentiment big data, and in order to provide accurate data support for public sentiment analysis and public sentiment prediction by the public sentiment data, how to evaluate the high-quality information source from the mass information source also becomes a great importance.

How to effectively evaluate public opinion information sources is a very challenging task, and there are two main methods for evaluating influence of websites at home and abroad: qualitative and quantitative methods. Most current studies for impact assessment use quantitative methods: in other words, the influence of the website is evaluated and analyzed by some quantifiable indexes such as the number of incoming links, the number of outgoing links, the influence factor of the network, the access amount of the website and the like based on the perspective of network metrology. However, there is little work for evaluating the influence of internet public opinion information sources, and there is little method for applying deep learning techniques to the evaluation of the influence of public opinion information sources.

The sender of public sentiment information is the information source, and the receiver is the netizen. The information source transmits public opinion information to the netizens in the modes of publishing information, transferring information or quoting information and the like. Meanwhile, the netizens express the interest degree of various public opinion information in the modes of article publication, clicking, replying and the like, and the modes also imply the degree of influence of the netizens by the information source. Therefore, when evaluating the influence of the online public opinion information source, first, the self expression of the information source is considered, and the expression can be represented by information factors such as the frequency of published articles. In addition, since cyber opinion is spread through the internet, people are stimulated by various events to generate a set of all of the cognitive, attitude, emotional, and behavioral tendencies of the event. The netizens, as receivers influenced by public opinion information sources, also take a very important position in the spreading of network public opinions. Therefore, in evaluating influence, attention of netizens to information sources is also considered. The attention can be reflected by the frequency of information source issuing information, the click frequency, the reply frequency and the like of netizens on the information, and finally, the degree of engagement between the information source and the public sentiment theme is considered.

The information explosion of social media has marked the advent of the big data age. However, with the formation of various sub-cultural circles of network users, a great deal of non-canonical expressions are continuously generated, and the use of these characters constitutes a huge network Chinese language corpus. These non-canonical tables are the main components in Chinese expression, and have crucial significance for the effect of Chinese natural language processing. Most of the existing network recognition systems are traditional systems based on word banks, the accuracy of the system on recognition of non-standard expressions is poor, and if the traditional method based on the word banks is still adopted, the non-standard expressions can not be recognized accurately, so that loss and misjudgment of some important information are caused, and many problems and challenges are brought to natural language processing public opinion analysis and related tasks. In the era of explosive growth of network information, many social networks use non-standard expressions for communication, so that the social network environment and public opinion are very complex. The network public opinion analysis system based on deep learning can effectively analyze the actual network public opinion, so the establishment of the system is very important.

Disclosure of Invention

In order to solve the monitoring and identification of public opinion data in current network traffic data, the application requests to protect a network traffic detection method based on public opinion analysis.

The application requests to protect a network flow detection method based on public sentiment analysis, which is characterized by comprising the following steps:

reading in original flow data, acquiring time sequence characteristic data, and dividing all the flow data according to data flow according to quintuple information;

extracting statistical characteristics and load characteristics of the data stream, and preliminarily predicting public opinion flow by combining time sequence characteristic data;

carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method;

and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow.

Reading in the original flow data, acquiring time sequence characteristic data, and dividing all the flow data according to the quintuple information, wherein the dividing comprises the following steps:

preprocessing the acquired time sequence characteristic data;

extracting quintuple information < source IP address, destination IP address, source port, destination port and transport layer protocol > at the head part of the data packet, comparing the extracted port number with the information in the table to complete the identification of the data packet flow, counting the occurrence time of the data flow, judging whether the occurrence time of each data flow exceeds a set time threshold value, and if the occurrence time is greater than the time threshold value, carrying out characteristic weighted calculation on the data flow;

and preliminarily predicting public opinion flow according to the preprocessed time sequence characteristic data.

The method for extracting the statistical characteristics and the load characteristics of the data flow and preliminarily predicting the public opinion flow by combining the time sequence characteristic data comprises the following steps:

extracting specific fields, characters and character strings in the application layer load content of the data packet;

the method comprises the step of detecting the small-range load of the first N data packets of each flow direction in the data flow.

The data cleaning of the feature set is used for removing noise samples extracted from isolated forests, and the step of dividing the de-noised feature set into a training set and a testing set by a non-replaced random sampling method comprises the following steps:

training and testing a Bagging learning algorithm model based on Spark by using a training set and a testing set, and obtaining a testing result;

the Bagging learning algorithm based on Spark starts a Spark computing cluster and submits the Bagging integrated learning algorithm needing to be executed in parallel to a Spark driver;

submitting the Bagging integrated learning algorithm to be executed in parallel to a Spark driver for operation, and triggering an Action function in the parallel program;

the Spark Context module calculates and generates a corresponding RDD for the Bagging Application data set, and generates a corresponding job at the same time;

after the jobs are submitted to the system main node, the DAG scheduler generates a DAG graph and a corresponding task set for each job through calculation and submits the DAG graph and the corresponding task set to the task scheduler;

the task scheduler allocates the actual physical scheduling and execution of each task in the task set, calculates the resources required by the task of the computing resources to be allocated through the rear end of the scheduler, completes the task allocation of the task execution unit according to the calculation result of the rear end of the scheduler, and starts the corresponding task execution unit.

The method comprises the steps of reading original flow data, obtaining time sequence characteristic data, dividing all flow data according to data streams according to quintuple information, extracting statistical characteristics and load characteristics of the data streams, and preliminarily predicting public opinion flow by combining the time sequence characteristic data; carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method; and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow. The method effectively improves the public opinion flow prediction precision, further reduces the prediction error, enhances the prediction effect, realizes the self-adaptive deep learning of the network public opinion analysis system, excavates key nodes and the evolution law thereof from a dynamic view angle, and provides a detection method based on influence according to various public opinion situations of the network in a targeted manner.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a public opinion analysis-based network traffic detection method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a public opinion analysis-based network traffic detection method, which is characterized by comprising:

According to the method, positive sample data is constructed through purposeful crawling data, so that the quality of the positive sample can be higher; meanwhile, by combining PU-learning, a more reliable negative sample which is farther away from the positive sample can be obtained from the unmarked sample; meanwhile, the PU-learning and multi-model collaborative training technology is combined, so that a relatively ideal effect can be obtained under the condition that a small amount of positive samples and a large amount of non-label sample data universally exist in the industry, the problems and events concerned by business personnel are identified from public sentiment data with relatively high accuracy, and the problems and the events are timely pushed and early warned, so that the working efficiency of the business personnel is greatly improved, and the business personnel can conveniently take risk management measures according to the analysis of identification results.

Preferably, the reading in the original traffic data to obtain the time sequence feature data, and dividing all the traffic data according to the quintuple information and the data stream includes:

preprocessing the acquired time sequence characteristic data;

Further, the extracting of statistical features and load features of the data stream and the preliminary public opinion traffic prediction by combining the time sequence feature data comprise:

Wherein, still include: collecting a plurality of real network traffic data sets and combing the real network traffic data sets into a set of TCP and UDP flows;

the flow data set comprises information required for determining the type of a flow protocol and extracting a characteristic field through a port number, payload inspection and the like, and simultaneously comprises characteristic parameters required by subsequent steps such as a data packet length, a packet interval time and the like; obviously, the real traffic data set described in S101 may be obtained in a network carrying a plurality of users by a probe method, and also includes real traffic obtained by other methods, such as artificially generating traffic of a specific service type at some terminals and collecting the traffic on a transmission path. After the collection is completed, the flow is separated into different streams, namely TCP and UDP streams, according to the form of a { source address, a destination address, a source port, a destination port and a transport layer protocol type } quintuple, so that the flow data set becomes a set of the TCP and UDP streams.

Dividing the data flow into data flow blocks according to port numbers, extracting port numbers and protocol characteristic fields, extracting head characteristics of the data flow, and separating the flow into different flows, namely TCP and UDP flows according to a { source address, a destination address, a source port, a destination port and a transport layer protocol type } quintuple, so that the flow data set becomes a set of the TCP and UDP flows. The header of the TCP stream is determined according to, but not limited to, the Setup, Setup/ACK, and ACK packets of TCP, and the packets in a data stream must be arranged according to the sequence reaching the observation point. However, UDP data streams are usually intercepted by a time-limited method because they have no obvious start packet, and are also divided into different UDP streams according to five-tuple. Here, the Service protocol types of TCP and UDP data streams, such as WWW, MAIL, FTP, P2P, Service, IM, etc., are obtained using a protocol analysis method. Then, for TCP and UDP data streams, the statistical characteristics of the data packets in each stream are respectively extracted, and a characteristic sequence is constructed according to the sequence of the data packets in the stream. That is, it is necessary to extract the parameter values corresponding to several attributes such as the packet length, the corrected packet interval time, and the transmission direction of several packets at the head of each stream, and the number of packets can be adjusted according to the actual test condition, and the experiment shows that the ideal number is 5-10. Due to the large packet length and correction interval time range and the need for discretization, data normalization may be required, for example, a Log function or an arctan function is used to process the original data, and a suitable discrete scale is selected.

Network flow is adaptively decomposed into IMFs with single frequency through EMD, clustering analysis is carried out on all IMF components by adopting an improved K-means clustering algorithm, the IMFs with similar complexity are gathered together, and prediction is carried out on the clustered IMF components by using Kalman models with simple parameters and rapid calculation respectively, so that the effects of low complexity and high prediction precision are achieved.

Data preprocessing, namely acquiring original flow data and preprocessing the original flow data, so that the basic characteristics of network flow are not influenced and the data are conveniently processed subsequently; and (4) modal decomposition. And decomposing the preprocessed normalized flow time sequence into a group of subsequences IMFs which are arranged from high frequency to low frequency and have simple components by adopting an EEMD algorithm. White noise is added before each decomposition, all IMFs are screened out finally by calculating the only remainder, the complex original time sequence is decomposed into a plurality of subsequences which are easy to analyze and model, and the problem of mode aliasing existing in the EMD algorithm is solved. And denoising the components. The IMF component containing more noise is screened from the subsequence IMFs through the constructed autocorrelation function energy criterion, the IMF component containing more noise is denoised by applying a soft threshold function in wavelet denoising to the denoising of the IMF component, and the denoised IMF and the non-denoised IMF are used for subsequent experiments.

Component prediction, namely testing the stationarity of each IMF component and separately predicting the stationary IMFs and the non-stationary IMFs; and predicting the unstable IMF component by adopting an Elman neural network, storing a network weight according to a gradient descent method, and outputting to obtain the optimal public opinion flow predicted value by judging whether the error E of the public opinion flow predicted value and the true value meets a set threshold value or whether the iteration frequency reaches an upper limit value. And meanwhile, determining the order according to an AIC criterion, performing parameter estimation by using a least square method, and finally predicting a stable IMF component sequence by using the established ARMA model. Modality reconstruction; and summing the prediction results of each part of IMF subsequences, and processing by adopting an inverse normalization formula to obtain the final prediction value of the whole public opinion flow.

Crawling existing enterprise capital problem news data by utilizing a keyword combination, and labeling capital problem positive sample data; meanwhile, a good enterprise without fund problems, such as Tencent and Ali Bara, is used as a keyword to crawl relevant news as a non-labeled sample, and the news possibly comprises part of news with fund problems, so that the news cannot be called as a negative sample, and an unknown sample is also called as a non-labeled sample; thus, a small number of positive samples (network crawling and partial manual labeling confirmation) and a large number of unlabeled samples exist; and (2) iteratively finding out samples which are as far as possible from the cosine of the Positive samples from a large quantity of unmarked sample sets in the step (1) by utilizing Positive sample label-free learning (PU-learning), regarding the samples as more reliable negative samples, and constructing a training set together with the Positive samples.

Preferably, the data cleaning of the feature set to remove the noise sample extracted from the isolated forest, and the dividing of the de-noised feature set into the training set and the test set by the non-replaced random sampling method includes:

Wherein, a non-negative calculation function is used for each attribute Ai by adopting a hierarchical random forest algorithm

Calculating the information value of Ai

Then normalizing the obtained value to obtain theta i; a decision tree classifier hi (Xi) is constructed from each training data subset Xi. In each node, randomly extracting attributes from As and Aw respectively according to proportion to form a valueAn attribute subspace of size p (& gt 1). A Boolean test function tau is obtained based on the attribute values of the subspaces to divide the training data into left child nodes and right child nodes. The process is iterated until the following stop conditions are met: all data belong to the same category, or each attribute has the same value, or the number of training data is less than a minimum value; integrating K non-pruned decision tree classifiers h1(X1), h2(X2),. h, hK (XK) into a random forest, and using a plurality of decision tree integration voting strategies as a classification decision method of the random forest.

Preparing target training data for a first classifier: and converting the quoted result of the training public sentiment with the quoted result being greater than 0 in the training characteristic data into 1, keeping the quoted result of the training public sentiment with the quoted result being equal to 0 unchanged, and obtaining the target training data of the first classifier.

Preparing target training data for a second classifier: and converting the quoted result of the training public sentiment of which the quoted result is greater than or equal to a first preset threshold value in the training characteristic data into 1, converting the quoted result of the training public sentiment of which the quoted result is less than the first preset threshold value into 0, and obtaining target training data of a second classifier.

Preparing target training data of a first regression: removing the training characteristic data of the training public sentiment with the quoted result being greater than or equal to a first preset threshold value in the training characteristic data and the corresponding quoted result, and carrying out log2(1+ x) transformation processing on the quoted results of the rest training public sentiments to obtain target training data of the first regression.

Preparing target training data for the second regressor: and removing the training characteristic data of the training public sentiments with the quoted results smaller than a first preset threshold value and the corresponding quoted results in the training characteristic data, and carrying out log2(1+ x) transformation processing on the quoted results of the rest training public sentiments to obtain target training data of a second regressor.

And training a preset initial emotion analysis model by using the training set, verifying the tested emotion analysis model by using the verification set, testing the trained emotion analysis model by using the testing set, and generating an emotion analysis model and a dictionary corresponding to the target field.

And respectively carrying out parallel processing on the training process of each base learner of the Bagging algorithm and the characteristic data calculation process of each base learner. Because each base learner is constructed by independently training through self-help sampling to generate a sub-training set, corresponding logic dependence and data dependence do not exist among the base learners. In the construction process of the base learner, the splitting of the tree nodes of the decision tree needs to calculate the information gain ratio corresponding to all the features in the current feature subset, and selects the feature with the highest information gain ratio, and at the moment, no corresponding dependency relationship exists between the tree nodes of the same level; the naive bayes algorithm only needs to calculate the conditional probability of each feature separately, so that no so-called logic dependence or data dependence exists in the model. Therefore, in the method, when the decision tree algorithm and the naive Bayes algorithm are used in the construction process, the calculation tasks do not have logic dependence and data dependence. And outer training, namely, respectively training all the base learners in parallel, firstly, constructing k training subsets by using an autonomous sampling method according to the number of the base learners by using a training data set, and then respectively training all the base learners by using independent training subsets and a learning algorithm to obtain k trained base learners in total. Inner layer training, namely, the calculation tasks in the base learning training process are performed in parallel, and the tree nodes at the same level can realize the parallel splitting of the tree nodes by performing the simultaneous calculation on the characteristic variables of the current training subsets; the naive bayes model then computes the conditional probabilities of each feature simultaneously.

The training task of the decision tree model comprises a plurality of operation stages, and each operation stage corresponds to each tree node level of the decision tree model. In the first stage, m information gain ratio calculation tasks (TGR1.1-TGR1.m) are correspondingly generated. And the information gain ratio calculation task is mainly responsible for related calculation tasks of the characteristic variables, calculates the contents of the characteristic variables such as information entropy, information gain ratio and the like, and submits calculation results to a corresponding tree node splitting task (TNS1) after the calculation is finished. And in the tree node splitting task, selecting the optimal characteristic variable for the tree node splitting according to the received calculation result, and completing the splitting of the first tree node of the current decision tree model. Assuming that the best splitting characteristic variable of the current tree node is ff1 and the value range of ff1 is { v1, v2, v3}, the current tree node is composed of ff1, and 3 child nodes are split below the current tree node. After the current tree node is split, the tree node splitting task needs to distribute a result containing contents such as splitting characteristic information, a value range, a corresponding index table and the like to related computing nodes so as to compute the subsequent node splitting of the tree.

After the emotion analysis model and the dictionary corresponding to the target field are generated, the customization of the emotion analysis model can be realized, large-scale emotion corpora in the general field, cloud platforms of service providers and deep learning frame resources can be fully utilized, the generated personalized emotion analysis model is low in cost, high in precision and capable of meeting personalized requirements of users.

Preferably, the determining the preliminarily predicted public opinion traffic according to the training set and the testing set and the determining the public opinion traffic type of the public opinion traffic based on the influence of the public opinion traffic comprise:

constructing an information interaction relationship network of the Baidu stick HPV theme by using a web crawler and a social network analysis tool Gephi;

determining an accurate public opinion flow predicted value according to the flow components, the external influence factors and the output result of the decision tree flow model;

correcting model parameters of the decision tree flow model according to the accurate public opinion flow prediction value;

analyzing the actual influence according to the specific key nodes, and specifically mining factors for restraining the influence in the aspects of directness and indirection;

importing the cleaned user data into Gephi to perform social network analysis, constructing an information interaction relationship network according to requirements, obtaining network node centrality index data of the Baidu stick HPV theme, and further calculating a key node influence function;

the time sequence characteristic data comprises at least one of time sequence characteristic data global flow direction matrix information and associated network performance indexes, wherein the associated network performance indexes comprise at least one of packet loss rate, time delay and jitter rate.

Crawling a corresponding first source data set from a webpage corresponding to a preset first website list according to a preset public opinion news title; the first source data set comprises financial data of an enterprise corresponding to a public opinion news title. And obtaining first source data and a corresponding first influence weight according to the first source data set and a preset first influence conversion strategy. Crawling a corresponding second source data set from a webpage corresponding to a preset second website list according to a preset public opinion news title; and the second source data set comprises academic degree information corresponding to the public opinion news title. And obtaining second source data and a corresponding second influence weight according to the second source data set and a preset second influence conversion strategy. Crawling a corresponding third source data set from a webpage corresponding to a preset third website list according to a preset public opinion news title; and the third source data set comprises public opinion information corresponding to the public opinion news headline. And obtaining a third influence weight corresponding to the third source data set according to a plurality of text data included in the third source data set and a preset activeness model. And calculating the influence value corresponding to the title of the public opinion news according to the first influence weight, the second influence weight and the third influence weight, and the first influence weight corresponding to the first influence weight, the second influence weight corresponding to the second influence weight and the third influence weight corresponding to the third influence weight. Classifying the public opinion news headlines according to the corresponding industry fields to obtain classification results; and the classification result comprises sub-classification results corresponding to a plurality of industry fields. And sorting the influence values corresponding to the public opinion news titles in each sub-classification result of the classification result in a descending order to obtain sorted sub-classification results corresponding to each industry field, and sending the sorted sub-classification results corresponding to each industry field to a corresponding target terminal for displaying.

The first, second and third influence weights are the following indexes:

(1) counting microblogs: the number of microblogs can reflect the popularity of the network event.

(2) Number of comments: the number of the comments can reflect the discussion popularity of the event

(3) Forwarding number: more forwarding indicates more users are involved in the network event and continue to flood down.

(4) Number of independent participating users: how many users can actually react to the discussion of the network event.

(5) User activity: the method has the same function as the average influence and can reflect the difference of the user composition of the network space groupware event and the common network event.

(7) And (3) authenticating the user ratio: the user composition in the event is better reflected, and whether the network water army participates in the network event is reflected on the side face.

(8) Proportion of paid users: the method has the same function as the authentication user ratio, better reflects the user composition in the event, and reflects whether the network water army participates in the network event or not.

And part of data is normalized, so that the convergence rate of the classification model is improved and the timeliness of the early warning system is improved. And then classifying the processed data according to a pre-constructed classification model. If the identification result is a common network event which is not developed into a network space group event (such as capturing the crown of a certain sports event), the identification result and the values of the early warning indexes do not need to be input into the early warning module. The method and the system are beneficial to improving the efficiency of the early warning system and preventing the common network events which do not need early warning from occupying resources.

The network flow detection method based on public sentiment analysis of the invention can finally perform visual processing on the detected public sentiment data:

data normalization processing, namely mapping the value of the centrality index into [0,1 ]]The real number in (1); the observed value of the jth index of the ith key node is x_j(i) Wherein j is 1,2, 3.

Determining a key node influence evaluation matrix X ═ (X)_j(i))_3*n。

Determining an initial weight w_j(t-1). The value of the weight is always equal to the final weight result obtained in the last evaluation process, so that the influence is ensured to be transmissible. The weight may be assigned to 0 during the first evaluation, or assigned according to specific criteria.

Method for determining entropy value H of index by using information entropy method_j(ii) a Wherein

In order to be the information entropy coefficient,

and when f_j(i) When equal to 0, f_j(i)·lnf_j(i)＝0.

Calculating the entropy weight w of each index_j(t) of (d). If the evaluation work is taken as a first experiment and the initial weight average value is 0, directly entering a step of calculating the comprehensive weight of each index;

calculating entropy weight increment delta w of each index_j

Δw_j＝w_j(t)-w_j(t-1) (9)

And carrying out weight assignment on the entropy weight increment of each index, wherein,

calculating the comprehensive weight W of each index_j. The weight can reduce the sensitivity of the entropy weight and the change of the influence of the key node, so that the influence of the key node has a continuous meaning.

W_j＝w_j(t)-μ_jΔw_j (11)

And calculating evaluation results of various indexes of different online visual network nodes. Obviously, the evaluation value of each index is less than the comprehensive weight value of each index, namely, Y is more than or equal to 0_j(i)≤W_j

Y_j(i)＝W_j·X_j(i) (12)

Y(i)＝∑Y_j(i) (13)

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A network flow detection method based on public opinion analysis is characterized by comprising the following steps:

2. The public opinion traffic detection method according to claim 1, wherein the reading in of original traffic data to obtain time sequence feature data, and the dividing of all traffic data according to five-tuple information and data stream comprises:

preprocessing the acquired time sequence characteristic data;

3. The method as claimed in claim 2, wherein the extracting of statistical features and load features of data streams and the preliminary prediction of public opinion traffic by combining time series feature data includes:

4. The public opinion flow detection method according to claim 3, wherein the data cleaning of the feature set to remove noise samples extracted from isolated forests and the dividing of the de-noised feature set into a training set and a testing set by a non-return random sampling method comprises:

5. The method as claimed in claim 4, wherein the determining the preliminarily predicted public opinion traffic according to the training set and the testing set and determining the public opinion traffic type of the public opinion traffic based on the influence of the public opinion traffic comprises: