CN113505223B

CN113505223B - Network water army identification method and system

Info

Publication number: CN113505223B
Application number: CN202110760492.4A
Authority: CN
Inventors: 肖玉芝; 冶忠林; 李明原; 张伟
Original assignee: Qinghai Normal University
Current assignee: Qinghai Normal University
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2022-01-28
Anticipated expiration: 2041-07-06
Also published as: CN113505223A

Abstract

The invention provides a network water army recognition method, which comprises the steps of firstly, training a data set by adopting a support vector machine algorithm and a logistic regression algorithm to obtain a first network water army recognition result and a second network water army recognition result, and then obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of a comment text; and finally, respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result, and performing weighted fusion to obtain the network water army recognition result. According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved. The invention also provides a network water army identification system.

Description

Network water army identification method and system

Technical Field

The invention belongs to the technical field of network water army detection, and particularly relates to a network water army identification method and system.

Background

With the advent of the big data age, the popularity of social networks has become self evident. The user can see the lyrics on the social platform, but the reality and the false are hard to distinguish, the public opinion is complex and variable, and the interfered factors are numerous. For example, the network water force uses malicious pursuit to convert the individual demand into the group demand and convert the small-range event into the hot event, thereby confusing the public audio-visual. If the pilot water army is maliciously fried, the netizens can not trust the network media, and the complete construction of the network basic system is more difficult. The influence of the appearance of the network water army on the social public opinion is huge, and the trend of the social public opinion can be even promoted, so that the water army identification has important social significance for controlling network malignant behaviors and promoting harmonious development.

At present, the relative quantity of identification analysis and research of water army is small, and the potential distribution characteristics and rules of the water army cannot be obtained. Because the currently disclosed network navy data sets are few, the traditional network navy identification algorithm is high in data cost and poor in effect. At present, researches for identification of water army are mainly divided into the following three types:

the first method is to take a hotspot event as a research object and analyze the comment text content of the event with the highest popularity in a certain time period. Shunhe et al propose to recognize water army from a technical level by judging the text generated by the user posting and the value generated on the server side. Wangbaobo et al propose to generate a topic model by performing semantic analysis, clustering and the like on comment contents, and further analyze the deviation degree of user comments from the topic so as to identify the water army. The Li Jian super-class method is characterized in that similarity calculation is carried out on each comment and a history comment document, and identification of the water army is achieved according to the maximum number of comments in the same day.

The second method is to use the user characteristics as research objects to identify the water army by analyzing the difference between the normal user and the water army user. Zhanmei and the like construct a microblog water army classifier through 6 dimensions such as the mutual attention number among users, the attention ratio of fans, the average microblog number released in a fixed time and the like, so that the aim of identifying water army is fulfilled. The method for identifying the water army by the SHEN Huang and the like is characterized in that an equal supervision learning method is used on the basis of mining the microblog characteristics, the behavior characteristics and the attribute characteristics of a user. Suxiujia and the like are used for explaining the index of influence factors of the usefulness of the comments from four aspects of a user who makes the comments, the contents of the comments, the publishing time of the topic comments and a reader of the comments so as to design a water army identification model. Hao qing and the like comprehensively analyze the user characteristics by five dimensions of user information characteristics, question-answer pair characteristics, user social network characteristics, content characteristics and linguistic characteristics so as to achieve the aim of water army identification.

Therefore, the existing water army identification method considers a few factors, so that the water army identification method cannot converge to a global optimum point, and the identification effect is poor.

Disclosure of Invention

The invention aims to provide a network water army identification method and a network water army identification system, and aims to solve the problem that the existing water army identification method is poor in identification effect.

In order to achieve the purpose, the invention adopts the technical scheme that: a network navy identification method comprises the following steps:

step 1: acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;

step 2: performing feature extraction on the comment text to generate a data set;

and step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result;

and 4, step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result;

and 5: performing sentiment analysis on the data set to obtain sentiment features of the comment text;

step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army identification result and the second network water army identification result of the comment text;

and 7: respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;

and 8: and performing weighted fusion on the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.

Preferably, the step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result, wherein the method comprises the following steps:

step 3.1: the formula is adopted:

classifying the data set to obtain a classification result; wherein (w, b) is w^Tx_i+ b denotes the hyperplane, w denotes the normal vector on the plane, b denotes the distance from the hyperplane to the origin, y_iIndicates the category of the sample when y_iWhen is +1, with x_iThe corresponding comment text is a normal user when y_iWhen is-1, with x_iThe corresponding comment text is a water army user;

step 3.2: establishing a first network naval identification model according to the classification result;

step 3.3: dividing the data set into a first training set and a first testing set according to a ratio of 6: 4;

step 3.4: training the first network naval identification model by using the first training set to obtain a trained first network naval identification model;

step 3.5: and carrying out water army recognition on the first test set by utilizing the trained first network water army recognition model to obtain a first network water army recognition result.

Preferably, the first network naval identification model is:

wherein, y'_iIndicating the label category and m indicating the dataset length.

Preferably, the step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result, wherein the method comprises the following steps:

step 4.1: dividing the data set to obtain a division result; wherein the division result is { (x)₁，y₁)，(x₂，y₂)，...，(x_n，y_n) In which x_i＝(x₁，x₂，...x_n1) represents a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label y_iE {1, 0}, where y_iWhen 1, with x_iThe corresponding comment text is the user of the water army, y_iWhen equal to 0, with x_iThe corresponding comment text is a normal user;

step 4.2: constructing a prediction model according to the division result; wherein the prediction model is:

wherein w represents a weight vector;

step 4.3: establishing a likelihood function according to the prediction model; wherein the likelihood function is:

step 4.4: dividing the data set into a second training set and a second testing set according to the ratio of 8: 2;

step 4.5: carrying out optimization training on the likelihood function by using the second training set to obtain a trained prediction model;

step 4.6: and classifying the second test set by using the trained prediction model to obtain a second network naval identification result.

Preferably, the step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text, wherein the CART tree classification result comprises the following steps:

step 6.1: dividing the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text to obtain a CART data set; wherein the CART dataset is:

{(a₁，b₁，c₁，Setiment₁，d₁，e₁，y₁)，...，(a_n，b_n，c_n，Setiment_n，d_n，e_n，y_n) N samples, wherein a represents forwarding number, b represents reply number, c represents like number, Setiment represents emotional characteristics of comment text, and d represents first network water armyThe data characteristics of the identification result, e represents the data characteristics of the identification result of the second network navy, and y represents the data type;

step 6.2: dividing n samples in the CART data set according to the number of the samples to obtain a first CART data set and a second CART data set;

step 6.3: constructing a kini coefficient calculation formula according to the first CART data set and the second CART data set;

step 6.4: dividing the CART data set into a third training set and a third testing set according to the ratio of 8: 2;

step 6.5: obtaining a CART tree according to the Gini coefficient calculation formula and the third training set;

step 6.6: pruning the CART tree to obtain a pruned CART tree;

step 6.7: and classifying the third test set according to the pruned CART tree to obtain a CART tree classification result.

Preferably, the calculation formula of the kini coefficient is as follows:

wherein D is_sRepresenting a CART data set, D_s1Representing a first CART dataset, n₁Representing the number of samples, D, in the first CART dataset_s2Representing a second CART dataset, n₂Representing the number of samples in the second CART dataset.

Preferably, the step 6.6: pruning the CART tree to obtain a pruned CART tree, comprising:

pruning the CART tree by adopting a penalty function to obtain a pruned CART tree; wherein the penalty function is:

wherein T is the number of leaf nodes, alpha is a penalty parameter, and N_tNumber of samples at leaf node during training, H_tFor empirical entropy, k is the number of classes, N_tkAre sample points.

The invention also provides a network water army identification system, which comprises:

the microblog comment information acquisition module is used for acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding quantity, reply quantity and praise quantity;

the comment text feature extraction module is used for performing feature extraction on the comment text to generate a data set;

the support vector machine algorithm training module is used for training the data set by adopting a support vector machine algorithm to obtain a first network water army recognition result;

the logistic regression algorithm training module is used for training the data set by adopting a logistic regression algorithm to obtain a second network navy identification result;

the emotion analysis module is used for carrying out emotion analysis on the data set to obtain emotion characteristics of the comment text;

the CART tree training module is used for obtaining a CART tree classification result according to the emotional features of the comment text, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result;

the result feature extraction module is used for respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;

and the characteristic weighting and fusing module is used for weighting and fusing the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network naval identification result.

The network navy identification method and the network navy identification system have the advantages that: compared with the prior art, the network water army recognition method comprises the steps of firstly, training a data set by adopting a support vector machine algorithm and a logistic regression algorithm to obtain a first network water army recognition result and a second network water army recognition result, and then obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of a comment text; and finally, respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result, and performing weighted fusion to obtain the network water army recognition result. According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a structural diagram of a network water force identification method according to an embodiment of the present invention.

Fig. 2 is a flowchart of a network water force identification method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a result of a training part of the fusion model according to the embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

according to the characteristics of the microblog comment information, the network water force is identified by considering two attribute characteristics. The first is based on text characteristics of microblog comments; the other is based on the user behavior characteristics of the microblog comment information. The classification based on the microblog comment text is finished by integrating a plurality of classifiers by using the idea, and the classification result is vectorized. The forwarding number, the reply number, the praise number, the comment text sentiment value, the first network water army recognition result and the second network water army recognition result are used as multiple features, classification is carried out by using a tree, and finally, a plurality of models are weighted and fused together to form a strong classifier, so that the effect of recognizing the microblog water army is achieved. The algorithm structure is shown in fig. 1.

Fig. 2 is a flowchart of a network water force identification method according to the present invention, please refer to fig. 2;

s1: acquiring microblog comment information; the microblog comment information comprises comment texts, forwarding numbers, reply numbers and praise numbers.

S2: performing feature extraction on the comment text to generate a data set;

the present invention combines the PV-DM and PV-DBOw sentence vector models, treating each sentence vector in the text dataset as a combination of two vectors trained and two vectors trained. And splicing the finally obtained vectors to obtain a 400-dimensional sentence vector. The effect is illustrated below with a simple example.

The following is optional simple text in the dataset:

text 1: the morning and New day begins with a foolish smile of the lungs of commiphora guidotti. The partial results of the vector model training using PV-DM and PV-DBOw fusion are shown in fig. 3.

S3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result;

s3 specifically includes:

s3.1: the formula is adopted:

s3.2: establishing a first network naval identification model according to the classification result; the first network navy identification model is as follows:

wherein s.t represents such that y'_iIndicating the label category and m indicating the dataset length.

S3.3: dividing the data set into a first training set and a first testing set according to a ratio of 6: 4;

s3.4: training the first network naval identification model by using a first training set to obtain a trained first network naval identification model;

s3.5: and carrying out water army recognition on the first test set by utilizing the trained first network water army recognition model to obtain a first network water army recognition result.

The support vector machine algorithm is further described below:

let data set Dm1 have { (x)₁，y₁)，(x₂，y₂)，...，(x_n，y_n)}. Wherein y is_iFor the class of sample, when the node data x_iWhen it is a normal user, y_iWhen the node data x is +1_iWhen it is a water force user, y_iIs-1. Given a constraint y_i(w^Tx_i+ b) > 0, for (x) in dataset_i，y_i) It is desirable to have:

for data set sample x_iThe representation of the substituted hyperplane is w^Tx_i+ b. If w is^Tx_i+ b > 0, then y is output _i1 is sample x_iThe user is a normal user; if w^Tx_iB is less than 0, then y is output _i1 is sample x_iIs a water army user. Obviously, this hyperplane can be arbitrary as long as it can be classified correctly. Considering that the model is robust enough, a certain rule is needed to select the optimal decision plane. Converting the two classification problems into a mathematical formula according to rules, namely the first network water army identification model is as follows:

by controlling w and b to make the distance the farthest, by controlling x_iThe closest point to the hyperplane is selected. By the definition, the algorithm can divide the comment text into a water army user and a normal user. The present invention divides 62554 pieces of data into training set and testing set according to the ratio of 6: 4. Considering the uncertainty of the distribution of random sampling, a more rigorous hierarchical sampling method is adopted, so that the key characteristics of the hierarchical sampling method have the distribution basically consistent with that of the overall data set. Data set distribution, as shown in table 1.

TABLE 1 data set distribution

50843 pieces of data are trained by using the data set, when the deviation and the variance are small, convergence is realized, the error is small, and a good training effect can be achieved.

12711 pieces of data in the test set are predicted through the experimental results, and a confusion matrix is constructed according to the obtained results.

TABLE 2 confusion matrix

And analyzing the classified results according to the evaluation indexes of the algorithm, and finding that FN is 2680 naval users, and the number of the naval users in the test set is 6333. The evaluation indexes of the algorithm model are shown in Table 3.

TABLE 3 Algorithm model evaluation index

S4: and training the data set by adopting a logistic regression algorithm to obtain a second network water army recognition result.

S4 specifically includes:

s4.1: dividing the data set to obtain a division result; wherein the division result is { (x)₁，y₁)，(x₂，y₂)，...，(x_n，y_n) In which x_i＝(x₁，x₂，...x_n1) represents a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label y_iE {1, 0}, where y_iWhen 1, with x_iThe corresponding comment text is the user of the water army, y_iWhen equal to 0, with x_iThe corresponding comment text is a normal user;

s4.2: constructing a prediction model according to the division result; wherein the prediction model is:

wherein w represents a weight vector;

s4.3: establishing a likelihood function according to the prediction model; wherein the likelihood function is:

s4.4: dividing the data set into a second training set and a second testing set according to the ratio of 8: 2;

s4.5: carrying out optimization training on the likelihood function by utilizing a second training set to obtain a trained prediction model;

s4.6: and classifying the second test set by using the trained prediction model to obtain a second network naval identification result.

The logistic regression algorithm is further described below:

let data set Dm2 have { (x)₁，y₁)，(x₂，y₂)，...，(x_n，y_n)}. Wherein x_i＝(x₁，x₂，...x_n1) is a feature vector with dimension n, the end of the vector is 1, and represents a bias term; label y_iE {1, 0} represents one of two classes of the data set, y _i1 is a water force user, y_i0 is a normal user. Assume that weight vector w of the model is equal to (w)₁，w₂，...w_n) (ii) a Defining the probability that the output of the model is that the sample data set belongs to 1, namely the probability of the water army, and then for the feature vector x_iPredicted value of model output thereof

The expression of (a) is:

the output of the model of the weight vector w in the training set is better as being closer to a given label, that is, if the label is a water army user, the output value of the model is closer to 1, and if the label is a normal user, the output value of the model is closer to 0. The loss function can therefore be estimated using maximum likelihood to establish a likelihood function L and maximize it.

The present invention divides 62554 pieces of data into training set and testing set according to 8: 2 ratio. Considering the uncertainty of the distribution of random sampling, a more rigorous hierarchical sampling method is adopted, so that the key characteristics of the hierarchical sampling method have the distribution basically consistent with that of the overall data set. Table 4 below is a data set distribution.

TABLE 4 data set distribution

The test set 11440 pieces of data are predicted through the above experimental results, and the obtained results construct a confusion matrix as shown in table 5.

TABLE 5 confusion matrix

And analyzing the classified results according to the evaluation indexes of the algorithm, and finding that FN is 2351 water army users and the number of the water army users in the test set is 5738. The evaluation indexes of the algorithm model are shown in Table 6.

TABLE 6 evaluation index of algorithm model

S5: performing sentiment analysis on the data set to obtain sentiment characteristics of the comment text;

s6: obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army identification result and the second network water army identification result of the comment text;

s6 specifically includes:

s6.1: dividing emotional characteristics, forwarding number, reply number, praise number, first network water army identification result and second network water army identification result of the comment text to obtain a CART data set; wherein the CART dataset is:

{(a₁，b₁，c₁，Setiment₁，d₁，e₁，y₁)，...，(a_n，b_n，c_n，Setiment_n，d_n，e_n，y_n) N samples, wherein a represents forwarding number, b represents replying number, c represents praise number, Setimed represents emotional characteristics of the comment text, d represents data characteristics of the first network naval identification result, e represents data characteristics of the second network naval identification result, and y represents label category;

s6.2: dividing n samples in the CART data set according to the number of the samples to obtain a first CART data set and a second CART data set;

s6.3: constructing a kini coefficient calculation formula according to the first CART data set and the second CART data set; wherein, the calculation formula of the kini coefficient is as follows:

S6.4: dividing the CART data set into a third training set and a third testing set according to the ratio of 8: 2;

s6.5: obtaining a CART tree according to a Gini coefficient calculation formula and a third training set;

s6.6: pruning the CART tree to obtain a pruned CART tree; specifically, a penalty function is adopted to prune the CART tree to obtain the pruned CART tree; wherein the penalty function is:

wherein T is the number of leaf nodes, alpha is a penalty parameter, Nt is the number of samples at the leaf nodes during training, H_tFor empirical entropy, k is the number of classes, N_tkAre sample points.

S6.7: and classifying the third test set according to the pruned CART tree to obtain a CART tree classification result.

This process is further described below:

the present invention divides 62554 pieces of data into training set and testing set according to 8: 2 ratio. Considering the uncertainty of the distribution of random sampling, a more rigorous hierarchical sampling method is adopted, so that the key characteristics of the hierarchical sampling method have the distribution basically consistent with that of the overall data set. The following table 7 shows the data set distribution in the present invention.

TABLE 7 data set distribution

And constructing a CART tree according to the data characteristics a, b, c and time of the microblog comment information and the recognition results d and e of the water army based on two algorithms of the microblog comment text. The CART tree differs from other trees in that the information gain selection feature is used in the ID3 tree, with a higher gain preference. In the C4.5 tree, the information gain rate is selected to select the features, so that the problem of large information gain caused by excessive feature values is avoided. The CART classification tree algorithm selects features by using the kini coefficient and determines the optimal binary segmentation points of the features.

The CART tree algorithm is described in relation to:

in the classification problem, K classes are assumed, and the probability that each sample point belongs to the K classes is P_kFor the binary problem of text, K ═ 2, i.e., normal users and naval users, can simplify the formula of the kini index as:

Gini(p)＝2P(1-P)

order data set D_sComprises the following steps:

{(a₁，b₁，c₁，Setiment₁，d₁，e₁，y₁)，...，(a_n，b_n，c_n，Setiment_n，d_n，e_n，y_n) And n samples are counted, wherein a, b, c, Setimed, d and e are data characteristics of each sample respectively, a is forwarding number, b is reply number, c is praise number, Setimed is emotional characteristics of comment texts, and d and e are two algorithm naval recognition results based on microblog comment texts. According to the ith attribute of the data set, namely (a)_i，b_i，c_i，Setiment_i，d_i，e_i，y_i) Dividing the data set into two parts D_s1And D_s2Then the kini coefficient is calculated as follows:

wherein n is₁And n₂Respectively a data set D_s1And D_s2The number of samples. And comparing the four kinds of the kini coefficients, selecting the smallest one, and taking the obtained attribute value and the ith attribute value as the optimal splitting attribute of the sample.

The test set 11440 pieces of data are predicted through the above experimental results, and the obtained results construct a confusion matrix as shown in table 8.

TABLE 8 confusion matrix

Because CART trees have a problem of overfitting, pruning is required to improve generalization capability. The present invention uses a penalty function to measure the degree of overfitting.

The pruning process is that the CART tree is traversed from bottom to top, and the branch is continuously pruned until the root node, so that a sub-tree sequence is generated. The pruning principle is a penalty function for comparing sub-tree sequences before and after pruning, and if the comparison result is less than that before pruning, pruning is carried out. Complexity can be easily reduced by pruning.

Analyzing the classified result according to the evaluation index of the algorithm, and finding that FN is 707 water army users, and the number of water army users in the test set is 5677, which shows that the result output d of the water army recognition algorithm based on the microblog comment text is used as the input of the algorithm of the bar so as to achieve the purpose that multiple features (Setiment, a, b, c, d, e) are used as the input of the CART tree, and the effect is good. Table 9 shows the evaluation indexes of the CART tree algorithm model.

TABLE 9 Algorithm model evaluation index

S7: respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result to generate a first prediction result feature, a second prediction result feature and a third prediction result feature;

s8: and performing weighted fusion on the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.

In practical application, a hot water military identification model based on a microblog comment text and a hot water military identification model based on microblog comment information are fused, the Boosting idea is adopted, and the two classifiers are weighted to obtain a strong classifier. The naval identification algorithm is described as follows.

The process of the water army recognition algorithm based on the microblog comments is described, the Boosting idea is utilized, the water army recognition model based on the microblog comment text and the water army recognition model based on the microblog comment information are fused, different weights are given, and finally the algorithm is subjected to iterative training, so that the effect of recognizing the water army can be achieved. The confusion matrix resulting from predicting the test set according to the algorithm described above is shown in table 10.

TABLE 10 confusion matrix

Through comparison, the fusion algorithm has better effect. As shown in table 11, the evaluation indexes of the algorithm are:

evaluation index of the algorithm of Table 11

According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved.

the CART tree training module is used for obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text;

and the characteristic weighting and fusing module is used for weighting and fusing the first prediction result characteristic, the second prediction result characteristic and the third prediction result characteristic to obtain a network water army recognition result.

The invention discloses a network water army recognition method and a system, and the network water army recognition method provided by the invention comprises the steps of firstly, training a data set by adopting a support vector machine algorithm and a logistic regression algorithm to obtain a first network water army recognition result and a second network water army recognition result, and then obtaining a CART tree classification result according to the emotional characteristics, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of a comment text; and finally, respectively extracting the classification features of the first network water army recognition result, the classification features of the second network water army recognition result and the classification features of the CART tree classification result, and performing weighted fusion to obtain the network water army recognition result. According to the invention, the first network water army recognition result, the second network water army recognition result and the CART tree classification result are subjected to weighted fusion, so that the behavior characteristics of each network water army can be fused, and the recognition precision of the network water army is greatly improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A network navy identification method is characterized by comprising the following steps:

2. The network naval identification method according to claim 1, wherein the step 3: training the data set by adopting a support vector machine algorithm to obtain a first network naval identification result, wherein the method comprises the following steps:

step 3.1: the formula is adopted:

classifying the data set to obtain a classification result; wherein (w, b) is w^Tx_i+ b denotes a hyperplane, w denotes a normal vector on the plane, b denotes the distance from the hyperplane to the origin, x_iRepresenting the node data, y_iIndicates the category of the sample when y_iWhen is +1, with x_iThe corresponding comment text is a normal user when y_iWhen is-1, with x_iThe corresponding comment text is a water army user;

3. The network naval identification method of claim 2, wherein the first network naval identification model is:

wherein, y'_iIndicating the label category and m indicating the dataset length.

4. The network naval identification method according to claim 1, wherein the step 4: training the data set by adopting a logistic regression algorithm to obtain a second network naval identification result, wherein the method comprises the following steps:

wherein w represents a weight vector;

5. The network naval identification method according to claim 1, wherein the step 6: obtaining a CART tree classification result according to the emotional features, the forwarding number, the reply number, the praise number, the first network water army recognition result and the second network water army recognition result of the comment text, wherein the CART tree classification result comprises the following steps:

step 6.6: pruning the CART tree to obtain a pruned CART tree;

6. The network naval identification method of claim 5, wherein the calculation formula of the kini coefficient is as follows:

7. The network naval identification method according to claim 5, wherein the step 6.6: pruning the CART tree to obtain a pruned CART tree, comprising:

wherein T is the number of leaf nodes, alpha is a penalty parameter, and N_tNumber of samples at leaf node during training, H_tFor empirical entropy, k is the representation class, N_tkAre sample points.

8. A network naval identification system, comprising: