CN112597141A - Network flow detection method based on public opinion analysis - Google Patents

Network flow detection method based on public opinion analysis Download PDF

Info

Publication number
CN112597141A
CN112597141A CN202011554194.1A CN202011554194A CN112597141A CN 112597141 A CN112597141 A CN 112597141A CN 202011554194 A CN202011554194 A CN 202011554194A CN 112597141 A CN112597141 A CN 112597141A
Authority
CN
China
Prior art keywords
data
flow
public opinion
network
influence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011554194.1A
Other languages
Chinese (zh)
Other versions
CN112597141B (en
Inventor
张志伟
李钢锋
梁卫国
郭栋
王文辉
刘达
孙衡
徐晓强
王淦
吕显斌
曹华
齐云雷
闫昊
刘震
李鑫
王少伟
焦健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Shandong Electric Power Co Ltd filed Critical State Grid Shandong Electric Power Co Ltd
Priority to CN202011554194.1A priority Critical patent/CN112597141B/en
Publication of CN112597141A publication Critical patent/CN112597141A/en
Application granted granted Critical
Publication of CN112597141B publication Critical patent/CN112597141B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Fuzzy Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a network flow detection method based on public opinion analysis, which comprises the steps of reading in original flow data, obtaining time sequence characteristic data, dividing all the flow data according to data flow according to quintuple information, extracting statistical characteristics and load characteristics of the data flow, and preliminarily predicting public opinion flow by combining the time sequence characteristic data; carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method; and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow. The method effectively improves the public opinion flow prediction precision, further reduces the prediction error, enhances the prediction effect, realizes the self-adaptive deep learning of the network public opinion analysis system, excavates key nodes and the evolution law thereof from a dynamic view angle, and provides a detection method based on influence according to various public opinion situations of the network in a targeted manner.

Description

Network flow detection method based on public opinion analysis
Technical Field
The invention relates to the technical field of information source influence assessment and deep learning, in particular to a network flow detection method based on public sentiment analysis.
Background
The quality of the information source is a precondition for the accuracy and quality assurance of the public sentiment big data, and in order to provide accurate data support for public sentiment analysis and public sentiment prediction by the public sentiment data, how to evaluate the high-quality information source from the mass information source also becomes a great importance.
How to effectively evaluate public opinion information sources is a very challenging task, and there are two main methods for evaluating influence of websites at home and abroad: qualitative and quantitative methods. Most current studies for impact assessment use quantitative methods: in other words, the influence of the website is evaluated and analyzed by some quantifiable indexes such as the number of incoming links, the number of outgoing links, the influence factor of the network, the access amount of the website and the like based on the perspective of network metrology. However, there is little work for evaluating the influence of internet public opinion information sources, and there is little method for applying deep learning techniques to the evaluation of the influence of public opinion information sources.
The sender of public sentiment information is the information source, and the receiver is the netizen. The information source transmits public opinion information to the netizens in the modes of publishing information, transferring information or quoting information and the like. Meanwhile, the netizens express the interest degree of various public opinion information in the modes of article publication, clicking, replying and the like, and the modes also imply the degree of influence of the netizens by the information source. Therefore, when evaluating the influence of the online public opinion information source, first, the self expression of the information source is considered, and the expression can be represented by information factors such as the frequency of published articles. In addition, since cyber opinion is spread through the internet, people are stimulated by various events to generate a set of all of the cognitive, attitude, emotional, and behavioral tendencies of the event. The netizens, as receivers influenced by public opinion information sources, also take a very important position in the spreading of network public opinions. Therefore, in evaluating influence, attention of netizens to information sources is also considered. The attention can be reflected by the frequency of information source issuing information, the click frequency, the reply frequency and the like of netizens on the information, and finally, the degree of engagement between the information source and the public sentiment theme is considered.
The information explosion of social media has marked the advent of the big data age. However, with the formation of various sub-cultural circles of network users, a great deal of non-canonical expressions are continuously generated, and the use of these characters constitutes a huge network Chinese language corpus. These non-canonical tables are the main components in Chinese expression, and have crucial significance for the effect of Chinese natural language processing. Most of the existing network recognition systems are traditional systems based on word banks, the accuracy of the system on recognition of non-standard expressions is poor, and if the traditional method based on the word banks is still adopted, the non-standard expressions can not be recognized accurately, so that loss and misjudgment of some important information are caused, and many problems and challenges are brought to natural language processing public opinion analysis and related tasks. In the era of explosive growth of network information, many social networks use non-standard expressions for communication, so that the social network environment and public opinion are very complex. The network public opinion analysis system based on deep learning can effectively analyze the actual network public opinion, so the establishment of the system is very important.
Disclosure of Invention
In order to solve the monitoring and identification of public opinion data in current network traffic data, the application requests to protect a network traffic detection method based on public opinion analysis.
The application requests to protect a network flow detection method based on public sentiment analysis, which is characterized by comprising the following steps:
reading in original flow data, acquiring time sequence characteristic data, and dividing all the flow data according to data flow according to quintuple information;
extracting statistical characteristics and load characteristics of the data stream, and preliminarily predicting public opinion flow by combining time sequence characteristic data;
carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method;
and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow.
Reading in the original flow data, acquiring time sequence characteristic data, and dividing all the flow data according to the quintuple information, wherein the dividing comprises the following steps:
preprocessing the acquired time sequence characteristic data;
extracting quintuple information < source IP address, destination IP address, source port, destination port and transport layer protocol > at the head part of the data packet, comparing the extracted port number with the information in the table to complete the identification of the data packet flow, counting the occurrence time of the data flow, judging whether the occurrence time of each data flow exceeds a set time threshold value, and if the occurrence time is greater than the time threshold value, carrying out characteristic weighted calculation on the data flow;
and preliminarily predicting public opinion flow according to the preprocessed time sequence characteristic data.
The method for extracting the statistical characteristics and the load characteristics of the data flow and preliminarily predicting the public opinion flow by combining the time sequence characteristic data comprises the following steps:
extracting specific fields, characters and character strings in the application layer load content of the data packet;
the method comprises the step of detecting the small-range load of the first N data packets of each flow direction in the data flow.
The data cleaning of the feature set is used for removing noise samples extracted from isolated forests, and the step of dividing the de-noised feature set into a training set and a testing set by a non-replaced random sampling method comprises the following steps:
training and testing a Bagging learning algorithm model based on Spark by using a training set and a testing set, and obtaining a testing result;
the Bagging learning algorithm based on Spark starts a Spark computing cluster and submits the Bagging integrated learning algorithm needing to be executed in parallel to a Spark driver;
submitting the Bagging integrated learning algorithm to be executed in parallel to a Spark driver for operation, and triggering an Action function in the parallel program;
the Spark Context module calculates and generates a corresponding RDD for the Bagging Application data set, and generates a corresponding job at the same time;
after the jobs are submitted to the system main node, the DAG scheduler generates a DAG graph and a corresponding task set for each job through calculation and submits the DAG graph and the corresponding task set to the task scheduler;
the task scheduler allocates the actual physical scheduling and execution of each task in the task set, calculates the resources required by the task of the computing resources to be allocated through the rear end of the scheduler, completes the task allocation of the task execution unit according to the calculation result of the rear end of the scheduler, and starts the corresponding task execution unit.
The method comprises the steps of reading original flow data, obtaining time sequence characteristic data, dividing all flow data according to data streams according to quintuple information, extracting statistical characteristics and load characteristics of the data streams, and preliminarily predicting public opinion flow by combining the time sequence characteristic data; carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method; and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow. The method effectively improves the public opinion flow prediction precision, further reduces the prediction error, enhances the prediction effect, realizes the self-adaptive deep learning of the network public opinion analysis system, excavates key nodes and the evolution law thereof from a dynamic view angle, and provides a detection method based on influence according to various public opinion situations of the network in a targeted manner.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a public opinion analysis-based network traffic detection method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a public opinion analysis-based network traffic detection method, which is characterized by comprising:
reading in original flow data, acquiring time sequence characteristic data, and dividing all the flow data according to data flow according to quintuple information;
extracting statistical characteristics and load characteristics of the data stream, and preliminarily predicting public opinion flow by combining time sequence characteristic data;
carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method;
and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow.
According to the method, positive sample data is constructed through purposeful crawling data, so that the quality of the positive sample can be higher; meanwhile, by combining PU-learning, a more reliable negative sample which is farther away from the positive sample can be obtained from the unmarked sample; meanwhile, the PU-learning and multi-model collaborative training technology is combined, so that a relatively ideal effect can be obtained under the condition that a small amount of positive samples and a large amount of non-label sample data universally exist in the industry, the problems and events concerned by business personnel are identified from public sentiment data with relatively high accuracy, and the problems and the events are timely pushed and early warned, so that the working efficiency of the business personnel is greatly improved, and the business personnel can conveniently take risk management measures according to the analysis of identification results.
Preferably, the reading in the original traffic data to obtain the time sequence feature data, and dividing all the traffic data according to the quintuple information and the data stream includes:
preprocessing the acquired time sequence characteristic data;
extracting quintuple information < source IP address, destination IP address, source port, destination port and transport layer protocol > at the head part of the data packet, comparing the extracted port number with the information in the table to complete the identification of the data packet flow, counting the occurrence time of the data flow, judging whether the occurrence time of each data flow exceeds a set time threshold value, and if the occurrence time is greater than the time threshold value, carrying out characteristic weighted calculation on the data flow;
and preliminarily predicting public opinion flow according to the preprocessed time sequence characteristic data.
Further, the extracting of statistical features and load features of the data stream and the preliminary public opinion traffic prediction by combining the time sequence feature data comprise:
extracting specific fields, characters and character strings in the application layer load content of the data packet;
the method comprises the step of detecting the small-range load of the first N data packets of each flow direction in the data flow.
Wherein, still include: collecting a plurality of real network traffic data sets and combing the real network traffic data sets into a set of TCP and UDP flows;
the flow data set comprises information required for determining the type of a flow protocol and extracting a characteristic field through a port number, payload inspection and the like, and simultaneously comprises characteristic parameters required by subsequent steps such as a data packet length, a packet interval time and the like; obviously, the real traffic data set described in S101 may be obtained in a network carrying a plurality of users by a probe method, and also includes real traffic obtained by other methods, such as artificially generating traffic of a specific service type at some terminals and collecting the traffic on a transmission path. After the collection is completed, the flow is separated into different streams, namely TCP and UDP streams, according to the form of a { source address, a destination address, a source port, a destination port and a transport layer protocol type } quintuple, so that the flow data set becomes a set of the TCP and UDP streams.
Dividing the data flow into data flow blocks according to port numbers, extracting port numbers and protocol characteristic fields, extracting head characteristics of the data flow, and separating the flow into different flows, namely TCP and UDP flows according to a { source address, a destination address, a source port, a destination port and a transport layer protocol type } quintuple, so that the flow data set becomes a set of the TCP and UDP flows. The header of the TCP stream is determined according to, but not limited to, the Setup, Setup/ACK, and ACK packets of TCP, and the packets in a data stream must be arranged according to the sequence reaching the observation point. However, UDP data streams are usually intercepted by a time-limited method because they have no obvious start packet, and are also divided into different UDP streams according to five-tuple. Here, the Service protocol types of TCP and UDP data streams, such as WWW, MAIL, FTP, P2P, Service, IM, etc., are obtained using a protocol analysis method. Then, for TCP and UDP data streams, the statistical characteristics of the data packets in each stream are respectively extracted, and a characteristic sequence is constructed according to the sequence of the data packets in the stream. That is, it is necessary to extract the parameter values corresponding to several attributes such as the packet length, the corrected packet interval time, and the transmission direction of several packets at the head of each stream, and the number of packets can be adjusted according to the actual test condition, and the experiment shows that the ideal number is 5-10. Due to the large packet length and correction interval time range and the need for discretization, data normalization may be required, for example, a Log function or an arctan function is used to process the original data, and a suitable discrete scale is selected.
Network flow is adaptively decomposed into IMFs with single frequency through EMD, clustering analysis is carried out on all IMF components by adopting an improved K-means clustering algorithm, the IMFs with similar complexity are gathered together, and prediction is carried out on the clustered IMF components by using Kalman models with simple parameters and rapid calculation respectively, so that the effects of low complexity and high prediction precision are achieved.
Data preprocessing, namely acquiring original flow data and preprocessing the original flow data, so that the basic characteristics of network flow are not influenced and the data are conveniently processed subsequently; and (4) modal decomposition. And decomposing the preprocessed normalized flow time sequence into a group of subsequences IMFs which are arranged from high frequency to low frequency and have simple components by adopting an EEMD algorithm. White noise is added before each decomposition, all IMFs are screened out finally by calculating the only remainder, the complex original time sequence is decomposed into a plurality of subsequences which are easy to analyze and model, and the problem of mode aliasing existing in the EMD algorithm is solved. And denoising the components. The IMF component containing more noise is screened from the subsequence IMFs through the constructed autocorrelation function energy criterion, the IMF component containing more noise is denoised by applying a soft threshold function in wavelet denoising to the denoising of the IMF component, and the denoised IMF and the non-denoised IMF are used for subsequent experiments.
Component prediction, namely testing the stationarity of each IMF component and separately predicting the stationary IMFs and the non-stationary IMFs; and predicting the unstable IMF component by adopting an Elman neural network, storing a network weight according to a gradient descent method, and outputting to obtain the optimal public opinion flow predicted value by judging whether the error E of the public opinion flow predicted value and the true value meets a set threshold value or whether the iteration frequency reaches an upper limit value. And meanwhile, determining the order according to an AIC criterion, performing parameter estimation by using a least square method, and finally predicting a stable IMF component sequence by using the established ARMA model. Modality reconstruction; and summing the prediction results of each part of IMF subsequences, and processing by adopting an inverse normalization formula to obtain the final prediction value of the whole public opinion flow.
Crawling existing enterprise capital problem news data by utilizing a keyword combination, and labeling capital problem positive sample data; meanwhile, a good enterprise without fund problems, such as Tencent and Ali Bara, is used as a keyword to crawl relevant news as a non-labeled sample, and the news possibly comprises part of news with fund problems, so that the news cannot be called as a negative sample, and an unknown sample is also called as a non-labeled sample; thus, a small number of positive samples (network crawling and partial manual labeling confirmation) and a large number of unlabeled samples exist; and (2) iteratively finding out samples which are as far as possible from the cosine of the Positive samples from a large quantity of unmarked sample sets in the step (1) by utilizing Positive sample label-free learning (PU-learning), regarding the samples as more reliable negative samples, and constructing a training set together with the Positive samples.
Preferably, the data cleaning of the feature set to remove the noise sample extracted from the isolated forest, and the dividing of the de-noised feature set into the training set and the test set by the non-replaced random sampling method includes:
training and testing a Bagging learning algorithm model based on Spark by using a training set and a testing set, and obtaining a testing result;
the Bagging learning algorithm based on Spark starts a Spark computing cluster and submits the Bagging integrated learning algorithm needing to be executed in parallel to a Spark driver;
submitting the Bagging integrated learning algorithm to be executed in parallel to a Spark driver for operation, and triggering an Action function in the parallel program;
the Spark Context module calculates and generates a corresponding RDD for the Bagging Application data set, and generates a corresponding job at the same time;
after the jobs are submitted to the system main node, the DAG scheduler generates a DAG graph and a corresponding task set for each job through calculation and submits the DAG graph and the corresponding task set to the task scheduler;
the task scheduler allocates the actual physical scheduling and execution of each task in the task set, calculates the resources required by the task of the computing resources to be allocated through the rear end of the scheduler, completes the task allocation of the task execution unit according to the calculation result of the rear end of the scheduler, and starts the corresponding task execution unit.
Wherein, a non-negative calculation function is used for each attribute Ai by adopting a hierarchical random forest algorithm
Figure BDA0002858478980000081
Calculating the information value of Ai
Figure BDA0002858478980000082
Then normalizing the obtained value to obtain theta i; a decision tree classifier hi (Xi) is constructed from each training data subset Xi. In each node, randomly extracting attributes from As and Aw respectively according to proportion to form a valueAn attribute subspace of size p (& gt 1). A Boolean test function tau is obtained based on the attribute values of the subspaces to divide the training data into left child nodes and right child nodes. The process is iterated until the following stop conditions are met: all data belong to the same category, or each attribute has the same value, or the number of training data is less than a minimum value; integrating K non-pruned decision tree classifiers h1(X1), h2(X2),. h, hK (XK) into a random forest, and using a plurality of decision tree integration voting strategies as a classification decision method of the random forest.
Preparing target training data for a first classifier: and converting the quoted result of the training public sentiment with the quoted result being greater than 0 in the training characteristic data into 1, keeping the quoted result of the training public sentiment with the quoted result being equal to 0 unchanged, and obtaining the target training data of the first classifier.
Preparing target training data for a second classifier: and converting the quoted result of the training public sentiment of which the quoted result is greater than or equal to a first preset threshold value in the training characteristic data into 1, converting the quoted result of the training public sentiment of which the quoted result is less than the first preset threshold value into 0, and obtaining target training data of a second classifier.
Preparing target training data of a first regression: removing the training characteristic data of the training public sentiment with the quoted result being greater than or equal to a first preset threshold value in the training characteristic data and the corresponding quoted result, and carrying out log2(1+ x) transformation processing on the quoted results of the rest training public sentiments to obtain target training data of the first regression.
Preparing target training data for the second regressor: and removing the training characteristic data of the training public sentiments with the quoted results smaller than a first preset threshold value and the corresponding quoted results in the training characteristic data, and carrying out log2(1+ x) transformation processing on the quoted results of the rest training public sentiments to obtain target training data of a second regressor.
And training a preset initial emotion analysis model by using the training set, verifying the tested emotion analysis model by using the verification set, testing the trained emotion analysis model by using the testing set, and generating an emotion analysis model and a dictionary corresponding to the target field.
And respectively carrying out parallel processing on the training process of each base learner of the Bagging algorithm and the characteristic data calculation process of each base learner. Because each base learner is constructed by independently training through self-help sampling to generate a sub-training set, corresponding logic dependence and data dependence do not exist among the base learners. In the construction process of the base learner, the splitting of the tree nodes of the decision tree needs to calculate the information gain ratio corresponding to all the features in the current feature subset, and selects the feature with the highest information gain ratio, and at the moment, no corresponding dependency relationship exists between the tree nodes of the same level; the naive bayes algorithm only needs to calculate the conditional probability of each feature separately, so that no so-called logic dependence or data dependence exists in the model. Therefore, in the method, when the decision tree algorithm and the naive Bayes algorithm are used in the construction process, the calculation tasks do not have logic dependence and data dependence. And outer training, namely, respectively training all the base learners in parallel, firstly, constructing k training subsets by using an autonomous sampling method according to the number of the base learners by using a training data set, and then respectively training all the base learners by using independent training subsets and a learning algorithm to obtain k trained base learners in total. Inner layer training, namely, the calculation tasks in the base learning training process are performed in parallel, and the tree nodes at the same level can realize the parallel splitting of the tree nodes by performing the simultaneous calculation on the characteristic variables of the current training subsets; the naive bayes model then computes the conditional probabilities of each feature simultaneously.
The training task of the decision tree model comprises a plurality of operation stages, and each operation stage corresponds to each tree node level of the decision tree model. In the first stage, m information gain ratio calculation tasks (TGR1.1-TGR1.m) are correspondingly generated. And the information gain ratio calculation task is mainly responsible for related calculation tasks of the characteristic variables, calculates the contents of the characteristic variables such as information entropy, information gain ratio and the like, and submits calculation results to a corresponding tree node splitting task (TNS1) after the calculation is finished. And in the tree node splitting task, selecting the optimal characteristic variable for the tree node splitting according to the received calculation result, and completing the splitting of the first tree node of the current decision tree model. Assuming that the best splitting characteristic variable of the current tree node is ff1 and the value range of ff1 is { v1, v2, v3}, the current tree node is composed of ff1, and 3 child nodes are split below the current tree node. After the current tree node is split, the tree node splitting task needs to distribute a result containing contents such as splitting characteristic information, a value range, a corresponding index table and the like to related computing nodes so as to compute the subsequent node splitting of the tree.
After the emotion analysis model and the dictionary corresponding to the target field are generated, the customization of the emotion analysis model can be realized, large-scale emotion corpora in the general field, cloud platforms of service providers and deep learning frame resources can be fully utilized, the generated personalized emotion analysis model is low in cost, high in precision and capable of meeting personalized requirements of users.
Preferably, the determining the preliminarily predicted public opinion traffic according to the training set and the testing set and the determining the public opinion traffic type of the public opinion traffic based on the influence of the public opinion traffic comprise:
constructing an information interaction relationship network of the Baidu stick HPV theme by using a web crawler and a social network analysis tool Gephi;
determining an accurate public opinion flow predicted value according to the flow components, the external influence factors and the output result of the decision tree flow model;
correcting model parameters of the decision tree flow model according to the accurate public opinion flow prediction value;
analyzing the actual influence according to the specific key nodes, and specifically mining factors for restraining the influence in the aspects of directness and indirection;
importing the cleaned user data into Gephi to perform social network analysis, constructing an information interaction relationship network according to requirements, obtaining network node centrality index data of the Baidu stick HPV theme, and further calculating a key node influence function;
the time sequence characteristic data comprises at least one of time sequence characteristic data global flow direction matrix information and associated network performance indexes, wherein the associated network performance indexes comprise at least one of packet loss rate, time delay and jitter rate.
Crawling a corresponding first source data set from a webpage corresponding to a preset first website list according to a preset public opinion news title; the first source data set comprises financial data of an enterprise corresponding to a public opinion news title. And obtaining first source data and a corresponding first influence weight according to the first source data set and a preset first influence conversion strategy. Crawling a corresponding second source data set from a webpage corresponding to a preset second website list according to a preset public opinion news title; and the second source data set comprises academic degree information corresponding to the public opinion news title. And obtaining second source data and a corresponding second influence weight according to the second source data set and a preset second influence conversion strategy. Crawling a corresponding third source data set from a webpage corresponding to a preset third website list according to a preset public opinion news title; and the third source data set comprises public opinion information corresponding to the public opinion news headline. And obtaining a third influence weight corresponding to the third source data set according to a plurality of text data included in the third source data set and a preset activeness model. And calculating the influence value corresponding to the title of the public opinion news according to the first influence weight, the second influence weight and the third influence weight, and the first influence weight corresponding to the first influence weight, the second influence weight corresponding to the second influence weight and the third influence weight corresponding to the third influence weight. Classifying the public opinion news headlines according to the corresponding industry fields to obtain classification results; and the classification result comprises sub-classification results corresponding to a plurality of industry fields. And sorting the influence values corresponding to the public opinion news titles in each sub-classification result of the classification result in a descending order to obtain sorted sub-classification results corresponding to each industry field, and sending the sorted sub-classification results corresponding to each industry field to a corresponding target terminal for displaying.
The first, second and third influence weights are the following indexes:
(1) counting microblogs: the number of microblogs can reflect the popularity of the network event.
(2) Number of comments: the number of the comments can reflect the discussion popularity of the event
(3) Forwarding number: more forwarding indicates more users are involved in the network event and continue to flood down.
(4) Number of independent participating users: how many users can actually react to the discussion of the network event.
(5) User activity: the method has the same function as the average influence and can reflect the difference of the user composition of the network space groupware event and the common network event.
(7) And (3) authenticating the user ratio: the user composition in the event is better reflected, and whether the network water army participates in the network event is reflected on the side face.
(8) Proportion of paid users: the method has the same function as the authentication user ratio, better reflects the user composition in the event, and reflects whether the network water army participates in the network event or not.
And part of data is normalized, so that the convergence rate of the classification model is improved and the timeliness of the early warning system is improved. And then classifying the processed data according to a pre-constructed classification model. If the identification result is a common network event which is not developed into a network space group event (such as capturing the crown of a certain sports event), the identification result and the values of the early warning indexes do not need to be input into the early warning module. The method and the system are beneficial to improving the efficiency of the early warning system and preventing the common network events which do not need early warning from occupying resources.
The network flow detection method based on public sentiment analysis of the invention can finally perform visual processing on the detected public sentiment data:
data normalization processing, namely mapping the value of the centrality index into [0,1 ]]The real number in (1); the observed value of the jth index of the ith key node is xj(i) Wherein j is 1,2, 3.
Figure BDA0002858478980000121
Determining a key node influence evaluation matrix X ═ (X)j(i))3*n
Determining an initial weight wj(t-1). The value of the weight is always equal to the final weight result obtained in the last evaluation process, so that the influence is ensured to be transmissible. The weight may be assigned to 0 during the first evaluation, or assigned according to specific criteria.
Method for determining entropy value H of index by using information entropy methodj(ii) a Wherein
Figure BDA0002858478980000131
In order to be the information entropy coefficient,
Figure BDA0002858478980000132
and when fj(i) When equal to 0, fj(i)·lnfj(i)=0.
Figure BDA0002858478980000133
Calculating the entropy weight w of each indexj(t) of (d). If the evaluation work is taken as a first experiment and the initial weight average value is 0, directly entering a step of calculating the comprehensive weight of each index;
Figure BDA0002858478980000134
calculating entropy weight increment delta w of each indexj
Δwj=wj(t)-wj(t-1) (9)
And carrying out weight assignment on the entropy weight increment of each index, wherein,
Figure BDA0002858478980000135
Figure BDA0002858478980000136
calculating the comprehensive weight W of each indexj. The weight can reduce the sensitivity of the entropy weight and the change of the influence of the key node, so that the influence of the key node has a continuous meaning.
Wj=wj(t)-μjΔwj (11)
And calculating evaluation results of various indexes of different online visual network nodes. Obviously, the evaluation value of each index is less than the comprehensive weight value of each index, namely, Y is more than or equal to 0j(i)≤Wj
Yj(i)=Wj·Xj(i) (12)
Y(i)=∑Yj(i) (13)
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (5)

1. A network flow detection method based on public opinion analysis is characterized by comprising the following steps:
reading in original flow data, acquiring time sequence characteristic data, and dividing all the flow data according to data flow according to quintuple information;
extracting statistical characteristics and load characteristics of the data stream, and preliminarily predicting public opinion flow by combining time sequence characteristic data;
carrying out data cleaning on the feature set, removing noise samples extracted by the isolated forest, and dividing the de-noised feature set into a training set and a testing set by a non-return random sampling method;
and confirming the preliminarily predicted public opinion flow according to a training set and a testing set, and determining the public opinion flow type of the public opinion flow based on the influence of the public opinion flow.
2. The public opinion traffic detection method according to claim 1, wherein the reading in of original traffic data to obtain time sequence feature data, and the dividing of all traffic data according to five-tuple information and data stream comprises:
preprocessing the acquired time sequence characteristic data;
extracting quintuple information < source IP address, destination IP address, source port, destination port and transport layer protocol > at the head part of the data packet, comparing the extracted port number with the information in the table to complete the identification of the data packet flow, counting the occurrence time of the data flow, judging whether the occurrence time of each data flow exceeds a set time threshold value, and if the occurrence time is greater than the time threshold value, carrying out characteristic weighted calculation on the data flow;
and preliminarily predicting public opinion flow according to the preprocessed time sequence characteristic data.
3. The method as claimed in claim 2, wherein the extracting of statistical features and load features of data streams and the preliminary prediction of public opinion traffic by combining time series feature data includes:
extracting specific fields, characters and character strings in the application layer load content of the data packet;
the method comprises the step of detecting the small-range load of the first N data packets of each flow direction in the data flow.
4. The public opinion flow detection method according to claim 3, wherein the data cleaning of the feature set to remove noise samples extracted from isolated forests and the dividing of the de-noised feature set into a training set and a testing set by a non-return random sampling method comprises:
training and testing a Bagging learning algorithm model based on Spark by using a training set and a testing set, and obtaining a testing result;
the Bagging learning algorithm based on Spark starts a Spark computing cluster and submits the Bagging integrated learning algorithm needing to be executed in parallel to a Spark driver;
submitting the Bagging integrated learning algorithm to be executed in parallel to a Spark driver for operation, and triggering an Action function in the parallel program;
the Spark Context module calculates and generates a corresponding RDD for the Bagging Application data set, and generates a corresponding job at the same time;
after the jobs are submitted to the system main node, the DAG scheduler generates a DAG graph and a corresponding task set for each job through calculation and submits the DAG graph and the corresponding task set to the task scheduler;
the task scheduler allocates the actual physical scheduling and execution of each task in the task set, calculates the resources required by the task of the computing resources to be allocated through the rear end of the scheduler, completes the task allocation of the task execution unit according to the calculation result of the rear end of the scheduler, and starts the corresponding task execution unit.
5. The method as claimed in claim 4, wherein the determining the preliminarily predicted public opinion traffic according to the training set and the testing set and determining the public opinion traffic type of the public opinion traffic based on the influence of the public opinion traffic comprises:
constructing an information interaction relationship network of the Baidu stick HPV theme by using a web crawler and a social network analysis tool Gephi;
determining an accurate public opinion flow predicted value according to the flow components, the external influence factors and the output result of the decision tree flow model;
correcting model parameters of the decision tree flow model according to the accurate public opinion flow prediction value;
analyzing the actual influence according to the specific key nodes, and specifically mining factors for restraining the influence in the aspects of directness and indirection;
importing the cleaned user data into Gephi to perform social network analysis, constructing an information interaction relationship network according to requirements, obtaining network node centrality index data of the Baidu stick HPV theme, and further calculating a key node influence function;
the time sequence characteristic data comprises at least one of time sequence characteristic data global flow direction matrix information and associated network performance indexes, wherein the associated network performance indexes comprise at least one of packet loss rate, time delay and jitter rate.
CN202011554194.1A 2020-12-24 2020-12-24 Network flow detection method based on public opinion analysis Active CN112597141B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011554194.1A CN112597141B (en) 2020-12-24 2020-12-24 Network flow detection method based on public opinion analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011554194.1A CN112597141B (en) 2020-12-24 2020-12-24 Network flow detection method based on public opinion analysis

Publications (2)

Publication Number Publication Date
CN112597141A true CN112597141A (en) 2021-04-02
CN112597141B CN112597141B (en) 2022-07-15

Family

ID=75202032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011554194.1A Active CN112597141B (en) 2020-12-24 2020-12-24 Network flow detection method based on public opinion analysis

Country Status (1)

Country Link
CN (1) CN112597141B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568954A (en) * 2021-08-02 2021-10-29 湖北工业大学 Parameter optimization method and system for network flow prediction data preprocessing stage
CN113672500A (en) * 2021-07-27 2021-11-19 浙江大华技术股份有限公司 Deep learning algorithm testing method and device, electronic device and storage medium
CN114828030A (en) * 2022-03-30 2022-07-29 华中科技大学 Traffic-based WIFI coverage condition identification method, device and system
CN116208512A (en) * 2023-03-07 2023-06-02 武汉精阅数字传媒科技有限公司 Flow forward influence analysis method for implicit interaction behavior
CN117914733A (en) * 2024-03-15 2024-04-19 深圳尚米网络技术有限公司 Flow analysis and prediction method based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239529A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of public sentiment hot category classification method based on deep learning
CN110263166A (en) * 2019-06-18 2019-09-20 北京海致星图科技有限公司 Public sentiment file classification method based on deep learning
CN111859074A (en) * 2020-07-29 2020-10-30 东北大学 Internet public opinion information source influence assessment method and system based on deep learning
CN112104570A (en) * 2020-09-11 2020-12-18 南方电网科学研究院有限责任公司 Traffic classification method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239529A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of public sentiment hot category classification method based on deep learning
CN110263166A (en) * 2019-06-18 2019-09-20 北京海致星图科技有限公司 Public sentiment file classification method based on deep learning
CN111859074A (en) * 2020-07-29 2020-10-30 东北大学 Internet public opinion information source influence assessment method and system based on deep learning
CN112104570A (en) * 2020-09-11 2020-12-18 南方电网科学研究院有限责任公司 Traffic classification method and device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘兆禄等: "基于Spark的网络流量分类方法研究", 《通信学报》 *
吴东阳: ""基于机器学习的加密流量识别算法研究"", 《万方数据知识服务平台》 *
尚丽维等: ""在线医疗社区信息交互关系网络关键节点影响力机理研究"", 《情报理论与实践》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672500A (en) * 2021-07-27 2021-11-19 浙江大华技术股份有限公司 Deep learning algorithm testing method and device, electronic device and storage medium
CN113672500B (en) * 2021-07-27 2024-05-07 浙江大华技术股份有限公司 Deep learning algorithm testing method and device, electronic device and storage medium
CN113568954A (en) * 2021-08-02 2021-10-29 湖北工业大学 Parameter optimization method and system for network flow prediction data preprocessing stage
CN113568954B (en) * 2021-08-02 2024-03-19 湖北工业大学 Parameter optimization method and system for preprocessing stage of network flow prediction data
CN114828030A (en) * 2022-03-30 2022-07-29 华中科技大学 Traffic-based WIFI coverage condition identification method, device and system
CN114828030B (en) * 2022-03-30 2024-05-24 华中科技大学 WIFI coverage condition identification method, device and system based on traffic
CN116208512A (en) * 2023-03-07 2023-06-02 武汉精阅数字传媒科技有限公司 Flow forward influence analysis method for implicit interaction behavior
CN116208512B (en) * 2023-03-07 2023-10-17 杭州元媒科技有限公司 Flow forward influence analysis method for implicit interaction behavior
CN117914733A (en) * 2024-03-15 2024-04-19 深圳尚米网络技术有限公司 Flow analysis and prediction method based on big data
CN117914733B (en) * 2024-03-15 2024-05-28 深圳尚米网络技术有限公司 Flow analysis and prediction method based on big data

Also Published As

Publication number Publication date
CN112597141B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
CN112597141B (en) Network flow detection method based on public opinion analysis
Buntain et al. Automatically identifying fake news in popular twitter threads
US8498950B2 (en) System for training classifiers in multiple categories through active learning
US20160099848A1 (en) Systems and methods of classifying sessions
Probierz et al. Rapid detection of fake news based on machine learning methods
CN106682208B (en) Microblog forwarding behavior prediction method based on fusion feature screening and random forest
CN111090735B (en) Performance evaluation method of intelligent question-answering method based on knowledge graph
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
Mena et al. Uncertainty-based rejection wrappers for black-box classifiers
Straton et al. Big social data analytics for public health: Predicting facebook post performance using artificial neural networks and deep learning
CN112529638A (en) Service demand dynamic prediction method and system based on user classification and deep learning
Kumar et al. Fake news detection using machine learning and natural language processing
Ali et al. Fake accounts detection on social media using stack ensemble system
Sumathi et al. Machine learning algorithm-based spam detection in social networks
Jan et al. Semi-supervised labeling: a proposed methodology for labeling the twitter datasets
CN112491627A (en) Network quality real-time analysis method and device
Ergu et al. Predicting personality with twitter data and machine learning models
CN116485185A (en) Enterprise risk analysis system and method based on comparison data
Biyani et al. Spam detection in social media using machine learning algorithm
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
Daisy et al. Email Spam Behavioral Sieving Technique using Hybrid Algorithm
Gamboa et al. Further enhancement of knn algorithm based on clustering applied to it support ticket routing
CN112069835A (en) Computer flow analysis mining system and method based on semantic analysis
Siregar et al. Person’s multiple intelligence classification based on tweet post using SentiStrength and processed on the Apache Spark framework
Marinakos et al. Viability prediction for retail business units using data mining techniques: a practical application in the Greek pharmaceutical sector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant