CN113505223B - Network water army identification method and system - Google Patents

Network water army identification method and system Download PDF

Info

Publication number
CN113505223B
CN113505223B CN202110760492.4A CN202110760492A CN113505223B CN 113505223 B CN113505223 B CN 113505223B CN 202110760492 A CN202110760492 A CN 202110760492A CN 113505223 B CN113505223 B CN 113505223B
Authority
CN
China
Prior art keywords
result
network
cart
water army
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110760492.4A
Other languages
Chinese (zh)
Other versions
CN113505223A (en
Inventor
肖玉芝
冶忠林
李明原
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qinghai Normal University
Original Assignee
Qinghai Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qinghai Normal University filed Critical Qinghai Normal University
Priority to CN202110760492.4A priority Critical patent/CN113505223B/en
Publication of CN113505223A publication Critical patent/CN113505223A/en
Application granted granted Critical
Publication of CN113505223B publication Critical patent/CN113505223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention provides a network water army recognition method, which comprises the steps of firstly, training a data set by adopting a support vector machine algorithm and a logistic regression algorithm to obtain a first network water army recognition result and a second network water army recognition result, and then obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of a comment text; and finally, respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result, and performing weighted fusion to obtain the network water army recognition result. According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved. The invention also provides a network water army identification system.

Description

Network water army identification method and system
Technical Field
The invention belongs to the technical field of network water army detection, and particularly relates to a network water army identification method and system.
Background
With the advent of the big data age, the popularity of social networks has become self evident. The user can see the lyrics on the social platform, but the reality and the false are hard to distinguish, the public opinion is complex and variable, and the interfered factors are numerous. For example, the network water force uses malicious pursuit to convert the individual demand into the group demand and convert the small-range event into the hot event, thereby confusing the public audio-visual. If the pilot water army is maliciously fried, the netizens can not trust the network media, and the complete construction of the network basic system is more difficult. The influence of the appearance of the network water army on the social public opinion is huge, and the trend of the social public opinion can be even promoted, so that the water army identification has important social significance for controlling network malignant behaviors and promoting harmonious development.
At present, the relative quantity of identification analysis and research of water army is small, and the potential distribution characteristics and rules of the water army cannot be obtained. Because the currently disclosed network navy data sets are few, the traditional network navy identification algorithm is high in data cost and poor in effect. At present, researches for identification of water army are mainly divided into the following three types:
the first method is to take a hotspot event as a research object and analyze the comment text content of the event with the highest popularity in a certain time period. Shunhe et al propose to recognize water army from a technical level by judging the text generated by the user posting and the value generated on the server side. Wangbaobo et al propose to generate a topic model by performing semantic analysis, clustering and the like on comment contents, and further analyze the deviation degree of user comments from the topic so as to identify the water army. The Li Jian super-class method is characterized in that similarity calculation is carried out on each comment and a history comment document, and identification of the water army is achieved according to the maximum number of comments in the same day.
The second method is to use the user characteristics as research objects to identify the water army by analyzing the difference between the normal user and the water army user. Zhanmei and the like construct a microblog water army classifier through 6 dimensions such as the mutual attention number among users, the attention ratio of fans, the average microblog number released in a fixed time and the like, so that the aim of identifying water army is fulfilled. The method for identifying the water army by the SHEN Huang and the like is characterized in that an equal supervision learning method is used on the basis of mining the microblog characteristics, the behavior characteristics and the attribute characteristics of a user. Suxiujia and the like are used for explaining the index of influence factors of the usefulness of the comments from four aspects of a user who makes the comments, the contents of the comments, the publishing time of the topic comments and a reader of the comments so as to design a water army identification model. Hao qing and the like comprehensively analyze the user characteristics by five dimensions of user information characteristics, question-answer pair characteristics, user social network characteristics, content characteristics and linguistic characteristics so as to achieve the aim of water army identification.
Therefore, the existing water army identification method considers a few factors, so that the water army identification method cannot converge to a global optimum point, and the identification effect is poor.
Disclosure of Invention
The invention aims to provide a network water army identification method and a network water army identification system, and aims to solve the problem that the existing water army identification method is poor in identification effect.
In order to achieve the purpose, the invention adopts the technical scheme that: a network navy identification method comprises the following steps:
step 1: acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;
step 2: performing feature extraction on the comment text to generate a data set;
and step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result;
and 4, step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result;
and 5: performing sentiment analysis on the data set to obtain sentiment features of the comment text;
step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army identification result and the second network water army identification result of the comment text;
and 7: respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;
and 8: and performing weighted fusion on the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.
Preferably, the step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result, wherein the method comprises the following steps:
step 3.1: the formula is adopted:
Figure GDA0003420408160000041
classifying the data set to obtain a classification result; wherein (w, b) is wTxi+ b denotes the hyperplane, w denotes the normal vector on the plane, b denotes the distance from the hyperplane to the origin, yiIndicates the category of the sample when yiWhen is +1, with xiThe corresponding comment text is a normal user when yiWhen is-1, with xiThe corresponding comment text is a water army user;
step 3.2: establishing a first network naval identification model according to the classification result;
step 3.3: dividing the data set into a first training set and a first testing set according to a ratio of 6: 4;
step 3.4: training the first network naval identification model by using the first training set to obtain a trained first network naval identification model;
step 3.5: and carrying out water army recognition on the first test set by utilizing the trained first network water army recognition model to obtain a first network water army recognition result.
Preferably, the first network naval identification model is:
Figure GDA0003420408160000042
wherein, y'iIndicating the label category and m indicating the dataset length.
Preferably, the step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result, wherein the method comprises the following steps:
step 4.1: dividing the data set to obtain a division result; wherein the division result is { (x)1,y1),(x2,y2),...,(xn,yn) In which xi=(x1,x2,...xn1) represents a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label yiE {1, 0}, where yiWhen 1, with xiThe corresponding comment text is the user of the water army, yiWhen equal to 0, with xiThe corresponding comment text is a normal user;
step 4.2: constructing a prediction model according to the division result; wherein the prediction model is:
Figure GDA0003420408160000051
wherein w represents a weight vector;
step 4.3: establishing a likelihood function according to the prediction model; wherein the likelihood function is:
Figure GDA0003420408160000052
step 4.4: dividing the data set into a second training set and a second testing set according to the ratio of 8: 2;
step 4.5: carrying out optimization training on the likelihood function by using the second training set to obtain a trained prediction model;
step 4.6: and classifying the second test set by using the trained prediction model to obtain a second network naval identification result.
Preferably, the step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text, wherein the CART tree classification result comprises the following steps:
step 6.1: dividing the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text to obtain a CART data set; wherein the CART dataset is:
{(a1,b1,c1,Setiment1,d1,e1,y1),...,(an,bn,cn,Setimentn,dn,en,yn) N samples, wherein a represents forwarding number, b represents reply number, c represents like number, Setiment represents emotional characteristics of comment text, and d represents first network water armyThe data characteristics of the identification result, e represents the data characteristics of the identification result of the second network navy, and y represents the data type;
step 6.2: dividing n samples in the CART data set according to the number of the samples to obtain a first CART data set and a second CART data set;
step 6.3: constructing a kini coefficient calculation formula according to the first CART data set and the second CART data set;
step 6.4: dividing the CART data set into a third training set and a third testing set according to the ratio of 8: 2;
step 6.5: obtaining a CART tree according to the Gini coefficient calculation formula and the third training set;
step 6.6: pruning the CART tree to obtain a pruned CART tree;
step 6.7: and classifying the third test set according to the pruned CART tree to obtain a CART tree classification result.
Preferably, the calculation formula of the kini coefficient is as follows:
Figure GDA0003420408160000061
wherein D issRepresenting a CART data set, Ds1Representing a first CART dataset, n1Representing the number of samples, D, in the first CART datasets2Representing a second CART dataset, n2Representing the number of samples in the second CART dataset.
Preferably, the step 6.6: pruning the CART tree to obtain a pruned CART tree, comprising:
pruning the CART tree by adopting a penalty function to obtain a pruned CART tree; wherein the penalty function is:
Figure GDA0003420408160000071
Figure GDA0003420408160000072
wherein T is the number of leaf nodes, alpha is a penalty parameter, and NtNumber of samples at leaf node during training, HtFor empirical entropy, k is the number of classes, NtkAre sample points.
The invention also provides a network water army identification system, which comprises:
the microblog comment information acquisition module is used for acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;
the comment text feature extraction module is used for performing feature extraction on the comment text to generate a data set;
the support vector machine algorithm training module is used for training the data set by adopting a support vector machine algorithm to obtain a first network water army recognition result;
the logistic regression algorithm training module is used for training the data set by adopting a logistic regression algorithm to obtain a second network navy identification result;
the emotion analysis module is used for carrying out emotion analysis on the data set to obtain emotion characteristics of the comment text;
the CART tree training module is used for obtaining a CART tree classification result according to the emotional features of the comment text, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result;
the result feature extraction module is used for respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;
and the characteristic weighting and fusing module is used for weighting and fusing the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network naval identification result.
The network navy identification method and the network navy identification system have the advantages that: compared with the prior art, the network water army recognition method comprises the steps of firstly, training a data set by adopting a support vector machine algorithm and a logistic regression algorithm to obtain a first network water army recognition result and a second network water army recognition result, and then obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of a comment text; and finally, respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result, and performing weighted fusion to obtain the network water army recognition result. According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a structural diagram of a network water force identification method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a network water force identification method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a result of a training part of the fusion model according to the embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention aims to provide a network water army identification method and a network water army identification system, and aims to solve the problem that the existing water army identification method is poor in identification effect.
In order to achieve the purpose, the invention adopts the technical scheme that: a network navy identification method comprises the following steps:
according to the characteristics of the microblog comment information, the network water force is identified by considering two attribute characteristics. The first is based on text characteristics of microblog comments; the other is based on the user behavior characteristics of the microblog comment information. The classification based on the microblog comment text is finished by integrating a plurality of classifiers by using the idea, and the classification result is vectorized. The forwarding number, the reply number, the praise number, the comment text sentiment value, the first network water army recognition result and the second network water army recognition result are used as multiple features, classification is carried out by using a tree, and finally, a plurality of models are weighted and fused together to form a strong classifier, so that the effect of recognizing the microblog water army is achieved. The algorithm structure is shown in fig. 1.
Fig. 2 is a flowchart of a network water force identification method according to the present invention, please refer to fig. 2;
s1: acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding numbers, reply numbers and praise numbers.
S2: performing feature extraction on the comment text to generate a data set;
the present invention combines the PV-DM and PV-DBOw sentence vector models, treating each sentence vector in the text dataset as a combination of two vectors trained and two vectors trained. And splicing the finally obtained vectors to obtain a 400-dimensional sentence vector. The effect is illustrated below with a simple example.
The following is optional simple text in the dataset:
text 1: the morning and New day begins with a foolish smile of the lungs of commiphora guidotti. The partial results of the vector model training using PV-DM and PV-DBOw fusion are shown in fig. 3.
S3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result;
s3 specifically includes:
s3.1: the formula is adopted:
Figure GDA0003420408160000101
classifying the data set to obtain a classification result; wherein (w, b) is wTxi+ b denotes the hyperplane, w denotes the normal vector on the plane, b denotes the distance from the hyperplane to the origin, yiIndicates the category of the sample when yiWhen is +1, with xiThe corresponding comment text is a normal user when yiWhen is-1, with xiThe corresponding comment text is a water army user;
s3.2: establishing a first network naval identification model according to the classification result; the first network navy identification model is as follows:
Figure GDA0003420408160000102
wherein s.t represents such that y'iIndicating the label category and m indicating the dataset length.
S3.3: dividing the data set into a first training set and a first testing set according to a ratio of 6: 4;
s3.4: training the first network naval identification model by using a first training set to obtain a trained first network naval identification model;
s3.5: and carrying out water army recognition on the first test set by utilizing the trained first network water army recognition model to obtain a first network water army recognition result.
The support vector machine algorithm is further described below:
let data set Dm1 have { (x)1,y1),(x2,y2),...,(xn,yn)}. Wherein y isiFor the class of sample, when the node data xiWhen it is a normal user, yiWhen the node data x is +1iWhen it is a water force user, yiIs-1. Given a constraint yi(wTxi+ b) > 0, for (x) in dataseti,yi) It is desirable to have:
Figure GDA0003420408160000111
for data set sample xiThe representation of the substituted hyperplane is wTxi+ b. If w isTxi+ b > 0, then y is output i1 is sample xiThe user is a normal user; if wTxiB is less than 0, then y is output i1 is sample xiIs a water army user. Obviously, this hyperplane can be arbitrary as long as it can be classified correctly. Considering that the model is robust enough, a certain rule is needed to select the optimal decision plane. Converting the two classification problems into a mathematical formula according to rules, namely the first network water army identification model is as follows:
Figure GDA0003420408160000112
by controlling w and b to make the distance the farthest, by controlling xiThe closest point to the hyperplane is selected. By the definition, the algorithm can divide the comment text into a water army user and a normal user. The present invention divides 62554 pieces of data into training set and testing set according to the ratio of 6: 4. Considering the uncertainty of the distribution of random sampling, a more rigorous hierarchical sampling method is adopted, so that the key characteristics of the hierarchical sampling method have the distribution basically consistent with that of the overall data set. Data set distribution, as shown in table 1.
TABLE 1 data set distribution
Figure GDA0003420408160000121
50843 pieces of data are trained by using the data set, when the deviation and the variance are small, convergence is realized, the error is small, and a good training effect can be achieved.
12711 pieces of data in the test set are predicted through the experimental results, and a confusion matrix is constructed according to the obtained results.
TABLE 2 confusion matrix
Figure GDA0003420408160000122
And analyzing the classified results according to the evaluation indexes of the algorithm, and finding that FN is 2680 naval users, and the number of the naval users in the test set is 6333. The evaluation indexes of the algorithm model are shown in Table 3.
TABLE 3 Algorithm model evaluation index
Figure GDA0003420408160000123
S4: and training the data set by adopting a logistic regression algorithm to obtain a second network water army recognition result.
S4 specifically includes:
s4.1: dividing the data set to obtain a division result; wherein the division result is { (x)1,y1),(x2,y2),...,(xn,yn) In which xi=(x1,x2,...xn1) represents a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label yiE {1, 0}, where yiWhen 1, with xiThe corresponding comment text is the user of the water army, yiWhen equal to 0, with xiThe corresponding comment text is a normal user;
s4.2: constructing a prediction model according to the division result; wherein the prediction model is:
Figure GDA0003420408160000131
wherein w represents a weight vector;
s4.3: establishing a likelihood function according to the prediction model; wherein the likelihood function is:
Figure GDA0003420408160000132
s4.4: dividing the data set into a second training set and a second testing set according to the ratio of 8: 2;
s4.5: carrying out optimization training on the likelihood function by utilizing a second training set to obtain a trained prediction model;
s4.6: and classifying the second test set by using the trained prediction model to obtain a second network naval identification result.
The logistic regression algorithm is further described below:
let data set Dm2 have { (x)1,y1),(x2,y2),...,(xn,yn)}. Wherein xi=(x1,x2,...xn1) is a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label yiE {1, 0} represents one of two classes of the data set, y i1 is a water force user, yi0 is a normal user. Assume that weight vector w of the model is equal to (w)1,w2,...wn) (ii) a Defining the probability that the output of the model is that the sample data set belongs to 1, namely the probability of the water army, and then for the feature vector xiPredicted value of model output thereof
Figure GDA0003420408160000145
The expression of (a) is:
Figure GDA0003420408160000141
the output of the model of the weight vector w in the training set is better as being closer to a given label, that is, if the label is a water army user, the output value of the model is closer to 1, and if the label is a normal user, the output value of the model is closer to 0. The loss function can therefore be estimated using maximum likelihood to establish a likelihood function L and maximize it.
Figure GDA0003420408160000142
The present invention divides 62554 pieces of data into training set and testing set according to 8: 2 ratio. Considering the uncertainty of the distribution of random sampling, a more rigorous hierarchical sampling method is adopted, so that the key characteristics of the hierarchical sampling method have the distribution basically consistent with that of the overall data set. Table 4 below is a data set distribution.
TABLE 4 data set distribution
Figure GDA0003420408160000143
The test set 11440 pieces of data are predicted through the above experimental results, and the obtained results construct a confusion matrix as shown in table 5.
TABLE 5 confusion matrix
Figure GDA0003420408160000144
Figure GDA0003420408160000151
And analyzing the classified results according to the evaluation indexes of the algorithm, and finding that FN is 2351 water army users and the number of the water army users in the test set is 5738. The evaluation indexes of the algorithm model are shown in Table 6.
TABLE 6 evaluation index of algorithm model
Figure GDA0003420408160000152
S5: performing sentiment analysis on the data set to obtain sentiment characteristics of the comment text;
s6: obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army identification result and the second network water army identification result of the comment text;
s6 specifically includes:
s6.1: dividing emotional characteristics, forwarding number, reply number, praise number, first network water army identification result and second network water army identification result of the comment text to obtain a CART data set; wherein the CART dataset is:
{(a1,b1,c1,Setiment1,d1,e1,y1),...,(an,bn,cn,Setimentn,dn,en,yn) N samples, wherein a represents forwarding number, b represents replying number, c represents praise number, Setimed represents emotional characteristics of the comment text, d represents data characteristics of the first network naval identification result, e represents data characteristics of the second network naval identification result, and y represents label category;
s6.2: dividing n samples in the CART data set according to the number of the samples to obtain a first CART data set and a second CART data set;
s6.3: constructing a kini coefficient calculation formula according to the first CART data set and the second CART data set; wherein, the calculation formula of the kini coefficient is as follows:
Figure GDA0003420408160000161
wherein D issRepresenting a CART data set, Ds1Representing a first CART dataset, n1Representing the number of samples, D, in the first CART datasets2Representing a second CART dataset, n2Representing the number of samples in the second CART dataset.
S6.4: dividing the CART data set into a third training set and a third testing set according to the ratio of 8: 2;
s6.5: obtaining a CART tree according to a Gini coefficient calculation formula and a third training set;
s6.6: pruning the CART tree to obtain a pruned CART tree; specifically, a penalty function is adopted to prune the CART tree to obtain the pruned CART tree; wherein the penalty function is:
Figure GDA0003420408160000162
Figure GDA0003420408160000163
wherein T is the number of leaf nodes, alpha is a penalty parameter, Nt is the number of samples at the leaf nodes during training, HtFor empirical entropy, k is the number of classes, NtkAre sample points.
S6.7: and classifying the third test set according to the pruned CART tree to obtain a CART tree classification result.
This process is further described below:
the present invention divides 62554 pieces of data into training set and testing set according to 8: 2 ratio. Considering the uncertainty of the distribution of random sampling, a more rigorous hierarchical sampling method is adopted, so that the key characteristics of the hierarchical sampling method have the distribution basically consistent with that of the overall data set. The following table 7 shows the data set distribution in the present invention.
TABLE 7 data set distribution
Figure GDA0003420408160000171
And constructing a CART tree according to the data characteristics a, b, c and time of the microblog comment information and the recognition results d and e of the water army based on two algorithms of the microblog comment text. The CART tree differs from other trees in that the information gain selection feature is used in the ID3 tree, with a higher gain preference. In the C4.5 tree, the information gain rate is selected to select the features, so that the problem of large information gain caused by excessive feature values is avoided. The CART classification tree algorithm selects features by using the kini coefficient and determines the optimal binary segmentation points of the features.
The CART tree algorithm is described in relation to:
in the classification problem, K classes are assumed, and the probability that each sample point belongs to the K classes is PkFor the binary problem of text, K ═ 2, i.e., normal users and naval users, can simplify the formula of the kini index as:
Gini(p)=2P(1-P)
order data set DsComprises the following steps:
{(a1,b1,c1,Setiment1,d1,e1,y1),...,(an,bn,cn,Setimentn,dn,en,yn) And n samples are counted, wherein a, b, c, Setimed, d and e are data characteristics of each sample respectively, a is forwarding number, b is reply number, c is praise number, Setimed is emotional characteristics of comment texts, and d and e are two algorithm naval recognition results based on microblog comment texts. According to the ith attribute of the data set, namely (a)i,bi,ci,Setimenti,di,ei,yi) Dividing the data set into two parts Ds1And Ds2Then the kini coefficient is calculated as follows:
Figure GDA0003420408160000181
wherein n is1And n2Respectively a data set Ds1And Ds2The number of samples. And comparing the four kinds of the kini coefficients, selecting the smallest one, and taking the obtained attribute value and the ith attribute value as the optimal splitting attribute of the sample.
The test set 11440 pieces of data are predicted through the above experimental results, and the obtained results construct a confusion matrix as shown in table 8.
TABLE 8 confusion matrix
Figure GDA0003420408160000182
Because CART trees have a problem of overfitting, pruning is required to improve generalization capability. The present invention uses a penalty function to measure the degree of overfitting.
The pruning process is that the CART tree is traversed from bottom to top, and the branch is continuously pruned until the root node, so that a sub-tree sequence is generated. The pruning principle is a penalty function for comparing sub-tree sequences before and after pruning, and if the comparison result is less than that before pruning, pruning is carried out. Complexity can be easily reduced by pruning.
Analyzing the classified result according to the evaluation index of the algorithm, and finding that FN is 707 water army users, and the number of water army users in the test set is 5677, which shows that the result output d of the water army recognition algorithm based on the microblog comment text is used as the input of the algorithm of the bar so as to achieve the purpose that multiple features (Setiment, a, b, c, d, e) are used as the input of the CART tree, and the effect is good. Table 9 shows the evaluation indexes of the CART tree algorithm model.
TABLE 9 Algorithm model evaluation index
Figure GDA0003420408160000191
S7: respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;
s8: and performing weighted fusion on the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.
In practical application, a hot water military identification model based on a microblog comment text and a hot water military identification model based on microblog comment information are fused, the Boosting idea is adopted, and the two classifiers are weighted to obtain a strong classifier. The naval identification algorithm is described as follows.
Figure GDA0003420408160000192
Figure GDA0003420408160000201
The process of the water army recognition algorithm based on the microblog comments is described, the Boosting idea is utilized, the water army recognition model based on the microblog comment text and the water army recognition model based on the microblog comment information are fused, different weights are given, and finally the algorithm is subjected to iterative training, so that the effect of recognizing the water army can be achieved. The confusion matrix resulting from predicting the test set according to the algorithm described above is shown in table 10.
TABLE 10 confusion matrix
Figure GDA0003420408160000202
Figure GDA0003420408160000211
Through comparison, the fusion algorithm has better effect. As shown in table 11, the evaluation indexes of the algorithm are:
evaluation index of the algorithm of Table 11
Figure GDA0003420408160000212
According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved.
The invention also provides a network water army identification system, which comprises:
the microblog comment information acquisition module is used for acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;
the comment text feature extraction module is used for performing feature extraction on the comment text to generate a data set;
the support vector machine algorithm training module is used for training the data set by adopting a support vector machine algorithm to obtain a first network water army recognition result;
the logistic regression algorithm training module is used for training the data set by adopting a logistic regression algorithm to obtain a second network navy identification result;
the emotion analysis module is used for carrying out emotion analysis on the data set to obtain emotion characteristics of the comment text;
the CART tree training module is used for obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text;
the result feature extraction module is used for respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;
and the characteristic weighting and fusing module is used for weighting and fusing the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.
The invention discloses a network water army recognition method and a system, and the network water army recognition method provided by the invention comprises the steps of firstly, training a data set by adopting a support vector machine algorithm and a logistic regression algorithm to obtain a first network water army recognition result and a second network water army recognition result, and then obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of a comment text; and finally, respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result, and performing weighted fusion to obtain the network water army recognition result. According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A network navy identification method is characterized by comprising the following steps:
step 1: acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;
step 2: performing feature extraction on the comment text to generate a data set;
and step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result;
and 4, step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result;
and 5: performing sentiment analysis on the data set to obtain sentiment features of the comment text;
step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army identification result and the second network water army identification result of the comment text;
and 7: respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;
and 8: and performing weighted fusion on the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.
2. The network naval identification method according to claim 1, wherein the step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result, wherein the method comprises the following steps:
step 3.1: the formula is adopted:
Figure FDA0003420408150000021
classifying the data set to obtain a classification result; wherein (w, b) is wTxi+ b denotes a hyperplane, w denotes a normal vector on the plane, b denotes the distance from the hyperplane to the origin, xiRepresenting the node data, yiIndicates the category of the sample when yiWhen is +1, with xiThe corresponding comment text is a normal user when yiWhen is-1, with xiThe corresponding comment text is a water army user;
step 3.2: establishing a first network naval identification model according to the classification result;
step 3.3: dividing the data set into a first training set and a first testing set according to a ratio of 6: 4;
step 3.4: training the first network naval identification model by using the first training set to obtain a trained first network naval identification model;
step 3.5: and carrying out water army recognition on the first test set by utilizing the trained first network water army recognition model to obtain a first network water army recognition result.
3. The network naval identification method of claim 2, wherein the first network naval identification model is:
Figure FDA0003420408150000022
wherein, y'iIndicating the label category and m indicating the dataset length.
4. The network naval identification method according to claim 1, wherein the step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result, wherein the method comprises the following steps:
step 4.1: dividing the data set to obtain a division result; wherein the division result is { (x)1,y1),(x2,y2),...,(xn,yn) In which xi=(x1,x2,...xn1) represents a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label yiE {1, 0}, where yiWhen 1, with xiThe corresponding comment text is the user of the water army, yiWhen equal to 0, with xiThe corresponding comment text is a normal user;
step 4.2: constructing a prediction model according to the division result; wherein the prediction model is:
Figure FDA0003420408150000031
wherein w represents a weight vector;
step 4.3: establishing a likelihood function according to the prediction model; wherein the likelihood function is:
Figure FDA0003420408150000032
step 4.4: dividing the data set into a second training set and a second testing set according to the ratio of 8: 2;
step 4.5: carrying out optimization training on the likelihood function by using the second training set to obtain a trained prediction model;
step 4.6: and classifying the second test set by using the trained prediction model to obtain a second network naval identification result.
5. The network naval identification method according to claim 1, wherein the step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text, wherein the CART tree classification result comprises the following steps:
step 6.1: dividing the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text to obtain a CART data set; wherein the CART dataset is:
{(a1,b1,c1,Setiment1,d1,e1,y1),...,(an,bn,cn,Setimentn,dn,en,yn) N samples, wherein a represents forwarding number, b represents replying number, c represents praise number, Setimed represents emotional characteristics of the comment text, d represents data characteristics of the first network naval identification result, e represents data characteristics of the second network naval identification result, and y represents label category;
step 6.2: dividing n samples in the CART data set according to the number of the samples to obtain a first CART data set and a second CART data set;
step 6.3: constructing a kini coefficient calculation formula according to the first CART data set and the second CART data set;
step 6.4: dividing the CART data set into a third training set and a third testing set according to the ratio of 8: 2;
step 6.5: obtaining a CART tree according to the Gini coefficient calculation formula and the third training set;
step 6.6: pruning the CART tree to obtain a pruned CART tree;
step 6.7: and classifying the third test set according to the pruned CART tree to obtain a CART tree classification result.
6. The network naval identification method of claim 5, wherein the calculation formula of the kini coefficient is as follows:
Figure FDA0003420408150000041
wherein D issRepresenting a CART data set, Ds1Representing a first CART dataset, n1Representing the number of samples, D, in the first CART datasets2Representing a second CART dataset, n2Representing the number of samples in the second CART dataset.
7. The network naval identification method according to claim 5, wherein the step 6.6: pruning the CART tree to obtain a pruned CART tree, comprising:
pruning the CART tree by adopting a penalty function to obtain a pruned CART tree; wherein the penalty function is:
Figure FDA0003420408150000051
Figure FDA0003420408150000052
wherein T is the number of leaf nodes, alpha is a penalty parameter, and NtNumber of samples at leaf node during training, HtFor empirical entropy, k is the representation class, NtkAre sample points.
8. A network naval identification system, comprising:
the microblog comment information acquisition module is used for acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;
the comment text feature extraction module is used for performing feature extraction on the comment text to generate a data set;
the support vector machine algorithm training module is used for training the data set by adopting a support vector machine algorithm to obtain a first network water army recognition result;
the logistic regression algorithm training module is used for training the data set by adopting a logistic regression algorithm to obtain a second network navy identification result;
the emotion analysis module is used for carrying out emotion analysis on the data set to obtain emotion characteristics of the comment text;
the CART tree training module is used for obtaining a CART tree classification result according to the emotional features of the comment text, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result;
the result feature extraction module is used for respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;
and the characteristic weighting and fusing module is used for weighting and fusing the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network naval identification result.
CN202110760492.4A 2021-07-06 2021-07-06 Network water army identification method and system Active CN113505223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110760492.4A CN113505223B (en) 2021-07-06 2021-07-06 Network water army identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110760492.4A CN113505223B (en) 2021-07-06 2021-07-06 Network water army identification method and system

Publications (2)

Publication Number Publication Date
CN113505223A CN113505223A (en) 2021-10-15
CN113505223B true CN113505223B (en) 2022-01-28

Family

ID=78011266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110760492.4A Active CN113505223B (en) 2021-07-06 2021-07-06 Network water army identification method and system

Country Status (1)

Country Link
CN (1) CN113505223B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905548B (en) * 2023-03-03 2024-05-10 美云智数科技有限公司 Water army recognition method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10063582B1 (en) * 2017-05-31 2018-08-28 Symantec Corporation Securing compromised network devices in a network
CN109241518A (en) * 2017-07-11 2019-01-18 北京交通大学 A kind of detection network navy method based on sentiment analysis

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020035B (en) * 2017-09-06 2023-05-12 腾讯科技(北京)有限公司 Data identification method and device, storage medium and electronic device
CN108228853A (en) * 2018-01-11 2018-06-29 北京信息科技大学 A kind of microblogging rumour recognition methods and system
US20200372400A1 (en) * 2019-05-22 2020-11-26 The Regents Of The University Of California Tree alternating optimization for learning classification trees
CN110990683B (en) * 2019-11-29 2022-08-23 重庆邮电大学 Microblog rumor integrated identification method and device based on region and emotional characteristics
CN112200638A (en) * 2020-10-30 2021-01-08 福州大学 Water army comment detection system and method based on attention mechanism and bidirectional GRU network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10063582B1 (en) * 2017-05-31 2018-08-28 Symantec Corporation Securing compromised network devices in a network
CN109241518A (en) * 2017-07-11 2019-01-18 北京交通大学 A kind of detection network navy method based on sentiment analysis

Also Published As

Publication number Publication date
CN113505223A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
RU2628431C1 (en) Selection of text classifier parameter based on semantic characteristics
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
Barushka et al. Review spam detection using word embeddings and deep neural networks
CN108875051A (en) Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
Hassan et al. Sentiment analysis on bangla and romanized bangla text (BRBT) using deep recurrent models
CN110990670B (en) Growth incentive book recommendation method and recommendation system
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Bansal et al. An Evolving Hybrid Deep Learning Framework for Legal Document Classification.
CN111538846A (en) Third-party library recommendation method based on mixed collaborative filtering
Anhar et al. Question classification on question-answer system using bidirectional-LSTM
Valero-Mas et al. On the suitability of Prototype Selection methods for kNN classification with distributed data
Ciaburro et al. Python Machine Learning Cookbook: Over 100 recipes to progress from smart data analytics to deep learning using real-world datasets
CN113505223B (en) Network water army identification method and system
El-Alfy et al. Empirical study on imbalanced learning of Arabic sentiment polarity with neural word embedding
Jayakody et al. Sentiment analysis on product reviews on twitter using Machine Learning Approaches
Yao et al. Online deception detection refueled by real world data collection
Mir et al. Online fake review detection using supervised machine learning and BERT model
CN114443846A (en) Classification method and device based on multi-level text abnormal composition and electronic equipment
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
Trisal et al. K-RCC: A novel approach to reduce the computational complexity of KNN algorithm for detecting human behavior on social networks
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
Niu Music Emotion Recognition Model Using Gated Recurrent Unit Networks and Multi‐Feature Extraction
Yafooz et al. Enhancing multi-class web video categorization model using machine and deep learning approaches
CN115269846A (en) Text processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant