CN113821706A

CN113821706A - Social network user reliability evaluation method based on soft interval support vector machine

Info

Publication number: CN113821706A
Application number: CN202111119250.3A
Authority: CN
Inventors: 邢玲; 高建平; 吴红海; 赵康; 姚景龙
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2021-12-21
Anticipated expiration: 2041-09-24
Also published as: CN113821706B

Abstract

The invention discloses a social network user reliability evaluation method based on a soft interval support vector machine, which is used for crawling user configuration file information and generating content information from a social network and marking users, calculating user profile information reliability according to the profile information of each user, calculating user generated content information reliability according to the generated content information of each user, using a vector formed by the user profile information reliability of each user and the user generated content information reliability as an input of a training sample, using a label of the user as a label of the training sample, training the soft interval support vector machine, when the credibility evaluation of the users in the social network is needed, and obtaining the information credibility of the user configuration file of the user and the credibility of the user generated content information, and inputting the information into a soft interval support vector machine to obtain a user credibility evaluation result. The invention improves the accuracy of user reliability evaluation by a soft interval support vector machine.

Description

Social network user reliability evaluation method based on soft interval support vector machine

Technical Field

The invention belongs to the technical field of social network user reliability evaluation, and particularly relates to a social network user reliability evaluation method based on a soft interval support vector machine.

Background

In the big data era, the number of social network platforms and users is increased explosively, so that the social network platforms not only become indispensable information interaction platforms and information transmission media in daily life of people, but also become huge and complex user groups. The users in the social network are important nodes for information transmission of the social platform, the smooth and healthy development of the information transmission in the social platform can be influenced by the flooding of malicious users, and meanwhile, the reliability evaluation of the users in the social network has important research significance in the fields of information screening, public opinion governance, network security, user identification and the like. Therefore, quantifying and evaluating the credibility of users in the social network becomes an important research topic in the research on the credibility of users in the social network.

The social network brings convenience to information exchange and emotional expression of people, and meanwhile, the characteristics of openness of the social network enable the network to be full of a large number of malicious users. Malicious users can generate a large amount of false and malicious behaviors or information in the social network, and the credibility of the malicious users in the social network is increased through fictitious configuration file information. In order to better identify malicious users and trusted users in the social network, reasonable and accurate evaluation on the user credibility is required. The evaluation of the user credibility of the social network mainly comprises the steps of carrying out quantitative analysis on user information in the network and representing the user credibility in the social network by calculating the user information. In order to ensure the accuracy and the reasonableness of the user reliability evaluation, the processing and the quantification of each feature item in the user configuration file information and the user generated content information need to be enhanced, and the precision of the user reliability evaluation algorithm is improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a social network user reliability evaluation method based on a soft-interval support vector machine.

In order to achieve the purpose, the social network user reliability evaluation method based on the soft interval support vector machine comprises the following steps:

s1: crawling configuration file information and generated content information of N users from a social network, wherein the configuration file information of the users comprises user nicknames, user education degrees, user profiles and mutual power numbers, the generated content information of the users comprises user Bowen praise numbers, Bowen forwarding numbers and Bowen comment numbers, then marking the users, and when a label flag (i) is 1, the user i is credible, and when the label flag (i) is 0, the user i is not credible, i is 1,2, …, N;

s2: extracting characteristic attribute data from configuration file information of each user, and calculating user configuration file information credibility U_P(i)；

S3: extracting characteristic attribute data from generated content information of each user, and calculating user generated content information credibility U_ucg(i)；

S4: credibility U of user configuration file information of each user_P(i) And user generated content information confidence U_ucg(i) Constructed vector (U)_P(i),U_ucg(i) As input in the training sample, the label flag (i) of the user is taken as the label in the training sample;

s5: a soft interval support vector machine is adopted as a social network user reliability evaluation model, and the training sample obtained in the step S4 is adopted to train the soft interval support vector machine;

s6: when the user in the social network needs to be evaluated in credibility, the credibility of the user profile information of the user is calculated by the same method in the step S2, the credibility of the user generated content information of the user is calculated by the same method in the step S3, and the vector is formed and then input into the soft interval support vector machine trained in the step S5 to obtain a user credibility evaluation result.

The invention discloses a social network user reliability evaluation method based on a soft interval support vector machine, which is used for crawling user configuration file information and generating content information from a social network and marking the user, calculating user profile information reliability according to the profile information of each user, calculating user generated content information reliability according to the generated content information of each user, using a vector formed by the user profile information reliability of each user and the user generated content information reliability as an input in a training sample, using a label of the user as a label of the training sample, training the soft interval support vector machine, when the credibility evaluation of the users in the social network is needed, and obtaining the information credibility of the user configuration file of the user and the credibility of the user generated content information, and inputting the information into a soft interval support vector machine to obtain a user credibility evaluation result.

The method uses the soft interval support vector machine to transform the user reliability evaluation dimension from the one-dimensional linear space of linear summation into the two-dimensional coordinate system space, solves the problem of aliasing of the user reliability evaluation result at the threshold value, and improves the accuracy of user reliability evaluation.

Drawings

FIG. 1 is a flowchart of an embodiment of a social network user reliability assessment method based on a soft interval support vector machine according to the present invention;

FIG. 2 is a graph comparing the accuracy of the confidence evaluation results of the present invention with three other comparison methods;

FIG. 3 is a graph comparing the accuracy of the confidence evaluation results of the present invention with three other comparison methods;

FIG. 4 is a chart comparing recall of confidence evaluation results of three other comparison methods in accordance with the present invention;

FIG. 5 is a comparison graph of F1 scores of the results of the confidence evaluation of the present invention and three other comparison methods.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of a specific embodiment of a social network user credibility assessment method based on a soft interval support vector machine according to the present invention. As shown in fig. 1, the social network user reliability evaluation method based on the soft interval support vector machine of the present invention specifically includes the steps of:

s101: acquiring user data:

the method comprises the steps of crawling configuration file information of N users from a social network and generating content information, wherein the configuration file information of the users comprises user nicknames, user education degrees, user profiles and mutual power numbers, the generated content information of the users comprises user Bowen praise numbers, Bowen forwarding numbers and Bowen comment numbers, then marking the users, and when a label flag (i) is 1, the user i is credible, and when the label flag (i) is 0, the user i is not credible, i is 1,2, …, N.

S102: and (3) calculating the information credibility of the user configuration file:

the user profile information in the social network is a reflection of the authenticity of the user and has high credibility, so the credibility of the user profile information can be adopted to evaluate the credibility of the user in the social network. For example, the Sina microblog platform has a complete personal information system, and when various personal information is filled in, the microblog platform designs strict format correction, so that the reality and the effectiveness of the information are ensured. The user information involved includes 20 types, 14 user profile information and 6 user generated content information. The user profile information includes: user nickname, UID, gender, birthday, educational background, user profile, URL, occupation, company, hometown, fan count, correlation count, mutual fan count, and interest tag. The user-generated content information includes: the number of the Bowens, the number of the Bowen praise, the number of the Bowen forwarding, the number of the Bowen comments, the number of the Bowen labels and the special character of the Bowen.

Extracting characteristic attribute data from configuration file information of each user, and calculating user configuration file information credibility U_P(i)。

In the embodiment, when the information credibility of the user configuration file is calculated, the information credibility is divided into the integral credibility and the local credibility, the integral credibility of the information of the user configuration file is characterized by adopting the information integrity table of the user configuration file, the credibility of the information locality of the user configuration file is represented by the information integrity index of the user configuration file, and the information credibility of the user configuration file can be obtained by linearly summing the integrity index of the user configuration file and the information influence index of the user configuration file.

The user profile information integrity is the ratio of the number of personal information tags which a user is willing to disclose to other users in the social network to the total number of tags of the user information integrity evaluation system. The formula for the user profile information integrity ui (i) is thus calculated as follows:

where a (i) represents the number of personal information tags actually disclosed by user i, and n represents the total number of personal information tags of users in the social network.

The user profile information influence index refers to the quantitative summation of a limited number of feature items which have high contribution to the calculation of the reliability in calculating the user reliability of the user profile information. The feature items in the user profile information are complicated, and the selection of a large number of feature items causes calculation errors and increases calculation overhead, so that in the embodiment, only the user profile, the user nickname, the user education degree and the mutual power number are selected to represent the influence index of the user profile information. The user profile information influence index g (i) for each user is therefore calculated as follows:

G(i)＝λ₁F(i)+λ₂E(i)+λ₃P(i)+λ₄H(i)

wherein the content of the first and second substances,f (i) user nickname type number indicating user i, f (i) 1,2, …, K_F，K_FRepresenting the number of nickname categories of the user; e (i) indicates the education level of the user i, e (i) 1,2, …, K_E，K_EA number of levels representing the level of education of the user; p (i) indicates the profile status of user i, where p (i) ═ 0 indicates that user i has no profile, and p (i) ═ 1 indicates that user i has a profile; h (i) represents the mutual power rating of user i, h (i) 1,2, …, K_H，K_HRepresenting the mutual powder number grade quantity of the user i; lambda [ alpha ]₁、λ₂、λ₃、λ₄Respectively representing the preset weights of the user nickname type F (i), the user education degree E (i), the user profile state P (i) and the user mutual power number H (i).

Then, the following formula is adopted to calculate and obtain the user configuration file information credibility U of each user_P(i)：

U_P(i)＝UI(i)+G(i)

Weight lambda required for user profile information influence index₁、λ₂、λ₃、λ₄In order to solve the type difference and magnitude difference between different pieces of information, the weight is calculated by using an entropy weight distribution method in the embodiment, which includes the following specific steps:

according to the numeric values of four characteristic items in the user profile information influence index, a weight distribution judgment matrix A of the characteristic items of the user profile information influence index is constructed_G：

Wherein, K_P2, the profile status of the user takes a value.

Judgment matrix A_GRepresenting the ratio between the four characteristic terms. The ratio between feature terms represents the ratio of importance levels in the influence index between different feature terms. For judgment matrix A_GPerforming characteristic decomposition to obtain maximum characteristic value lambda_maxNormalizing the corresponding characteristic vector, and taking the normalized vector as a weight directionQuantity (lambda)₄,λ₃,λ₂,λ₁)。

In this example, assume K_F＝2，K_E＝4，K_HIf 6, the matrix a is determined_G：

Maximum eigenvalue lambda obtained after feature decomposition_max4, and the consistency ratio CR is 0.0006, which is much smaller than 0.1, meeting the relevant requirements of the consistency test, and indicating that the judgment matrix is reasonable. Characteristic value lambda_maxNormalizing the feature vectors of 4 to obtain the weight of each feature term of lambda₁＝0.375、λ₂＝0.25、λ₃＝0.25、λ₄＝0.125。

S103: and calculating the credibility of the user generated content information:

the method defines the credibility of the user generated content information from two aspects of the influence extent and the propagation extent of the user issuing the blog, determines the user credibility calculation content and the characteristic item based on the user generated content, and respectively represents the credibility of the user from two different angles of the propagation and the influence of the user generated content in the social network, so that the credibility of the user generated content information can be obtained by linearly summing the calculation results of the two parts.

The influence breadth of the user for publishing the blog article is the influence degree of the blog article published by the user on other users, and is mainly reflected in the frequency of praise and comment behaviors of the other users on the blog article of the target user. The calculation formula of the influence extent iu (i) of the blog release of each user is as follows:

wherein M is_iIndicating the number of messages issued by user i, D_i,mIndicates the praise number, C, of the mth blog article released by the user i_i,mIndicating user i to issue a blog articleThe number of comments of the mth Bowen, M ═ 1,2, …, M_i. The addition of 1 to the denominator in the formula is to prevent the denominator from taking zero. Obviously, the larger the value of the influence extent iu (i), the larger the influence extent of the user-generated content.

The spread of the user published the blog article is the frequency of browsing the blog article published by the user by other users, and is mainly measured by the length of the forwarding chain of the user blog article, that is, the longer the forwarding chain of the user published the blog article is, the wider the spread of the user generated content is. The calculation formula of the broadcast extent cu (i) of each user issued the blog is as follows:

wherein, RT_i,mIndicating that the user i issues the forwarding chain length of the mth blog article in the blog article.

Then, the following formula is adopted to calculate and obtain the user generated content information credibility U of each user_ucg(i)：

U_P(i)＝IU(i)+CU(i)

S104: determining a training sample:

the user profile information influence index G (i) of each user and the user generated content information credibility U_ucg(i) Constructed vector (U)_P(i),U_ucg(i) As input of the training sample, the label flag (i) of the user is used as the label of the training sample.

S105: training a soft interval support vector machine:

and (5) adopting the soft interval support vector machine as a social network user credibility evaluation model, and adopting the training samples obtained in the step S104 to train the soft interval support vector machine.

The input data of the social network user credibility evaluation model in the invention is two-dimensional data (U)_P(i),U_ucg(i) Linear discriminant function f ═ W) in two-dimensional space^Tx + b, then usable hyperplane W^TThe separation is performed with x + b equal to 0, where x denotes the input, W the weight vector, b the classification threshold, and superscript T denotes the transposition. But requires that the classification line correctly classify all samplesClass, it is required that it satisfies the following formula:

y_i(W^T+b)-1≥0,i＝1,2,…,N,y_i＝±1

wherein W ═ { ω ═ ω₁,ω₂,…,ω_dThe term "is a normal vector, which determines the direction of the hyperplane, d is the number of eigenvalues, and b determines the distance between the hyperplane and the origin. Once W and b are determined, a partition hyperplane can be uniquely determined. The distance between the boundary hyperplane at the two sides of the boundary hyperplane and the boundary hyperplane is divided into

Specifically, the algebraic expressions of the sample points in the support vector machine that all need to satisfy the constraint condition are as follows:

the soft-spaced support vector machine allows some samples not to satisfy the constraint because the linear inseparability means that some sample points cannot satisfy the condition that the function spacing is greater than or equal to 1, that is

1-y_i(W^Tx_i+ b) > 0. The solution is to introduce a relaxation variable ζ for each sample point_iFor those sample points that do not satisfy the constraint, the function interval plus the slack variable is made to be greater than or equal to 1, and then our constraint becomes the following equation:

therein, ζ_0/1Expressed is the 0/1 loss function, as follows:

on one hand, in order to optimize the soft interval support vector machine and improve the evaluation accuracy, a relaxation variable is introduced into a constraint condition, and a balance coefficient C is added into an objective function to solve the problem. On the other hand, the function hyperplane needs to be satisfied, a maximization interval exists, samples which do not satisfy the constraint are enabled to be as few as possible, and a balance coefficient C is added for blending

And y_i(W^Tx_i+b)+ζ_iCoefficient of more than or equal to 1. The expression function of the soft-space support vector machine can then be written as the following equation:

s.t.y_i(W^Tx_i+b)≥1-ζ_i,ζ_i≥0,i＝1,2,…,N

wherein C > 0 is called balance coefficient, the restraint to misclassification is increased when the C value is large, the restraint to misclassification is reduced when the C value is small, and the value of the balance coefficient C is C-10^kAnd k is-3, -2, -1,0,1,2, 3.

By adopting a soft interval support vector machine, the user reliability evaluation dimension can be transformed from a linear summation one-dimensional linear space to a two-dimensional coordinate system space, so that the accuracy of user reliability evaluation is improved.

S106: user credibility assessment:

when the user in the social network needs to be evaluated in credibility, the same method of step S102 is used to calculate the information credibility of the user profile of the user, the same method of step S103 is used to calculate the information credibility of the user generated content of the user, and the information is input into the soft interval support vector machine trained in step S105 after forming a vector, so as to obtain the evaluation result of the user credibility.

In order to better illustrate the technical effects of the invention, the invention is experimentally verified by using a specific example. In the experimental verification, user data is selected from the social network Xinlang microblog, and the accuracy (accuracy), precision (precision), recall (call) and F-measure (F1) are used as evaluation indexes to evaluate the reliability evaluation result.

In order to compare the effectiveness and the rationality of the invention in the evaluation of the user reliability, the embodiment selects three user reliability evaluation methods as comparison methods and the invention evaluates the user reliability and compares the results. The comparison method 1 adopts the method in the documents of "A.Narayanan, A.Garg, I.Arora, T.Sureka et al," ironSense: directions the Identification of Fake User-Profiles on Twitter Using Machine Learning, "2018 Fourent International Conference Information Processing (ICINPRO), pp.1-7, Bangalore, India, 2018", and adopts the related algorithm of Machine Learning to quantitatively learn the Information of the User profile so as to achieve the purpose of User credibility evaluation. The comparative method 2 adopts the methods in the documents "h.slim, i.bounhas and y.slim", "URL-Based two creation Evaluation", "2019IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pp.1-6, Abu Dhabi, United arm emerates, nov.2019", and represents the user Credibility by quantifying the user-generated content information. The comparison method 3 adopts a method in the documents of ' Identification of information on line Social Network Users Based on Multi-features ', International Journal of Pattern Recognition and alarm significance vol.30, No.6, pp.1659015.1-1659015.15,2016 ', comprehensively considers various types of information of a user, and quantifies and processes the user information by adopting a PageRank algorithm, thereby evaluating the credibility of the user.

FIG. 2 is a graph comparing the accuracy of the confidence evaluation results of the present invention with three other comparison methods. FIG. 3 is a graph comparing the accuracy of the confidence evaluation results of the present invention with three other comparison methods. FIG. 4 is a chart comparing recall of confidence evaluation results of the present invention with three other comparison methods. FIG. 5 is a comparison graph of F1 scores of the results of the confidence evaluation of the present invention and three other comparison methods. As can be seen from fig. 2,3, 4 and 5, the method disclosed by the invention can be used for evaluating the user reliability under a two-dimensional plane, so that the problem that aliasing of trusted users and malicious users at a classification threshold is easily caused by linear summation is avoided. The number of the users and the evaluation index are in a negative correlation relationship, namely when the number of the users is continuously increased, the number of the users in the interval hyperplane is increased, so that the relaxation variable is increased, the tolerance to noise data is reduced, the error of the evaluation result of the users is increased, and the evaluation index is in a descending trend. The reduction rates of the accuracy rates of the three comparison methods and the user reliability evaluation result of the invention are respectively 0.13, 0.14, 0.08 and 0.07, and the method provided by the invention has the lowest reduction rate and better robustness.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A social network user credibility assessment method based on a soft interval support vector machine is characterized by comprising the following steps:

2. The method for assessing social network user credibility as claimed in claim 1, wherein the user profile information credibility U in step S2_P(i) The calculation method of (2) is as follows:

calculating the user profile information integrity UI (i) by adopting the following formula:

wherein A (i) represents the number of personal information tags actually disclosed by the user i, and n represents the total number of the personal information tags of the user in the social network;

calculating the user profile information influence index G (i) by adopting the following formula:

G(i)＝λ₁F(i)+λ₂E(i)+λ₃P(i)+λ₄H(i)

wherein f (i) indicates the user nickname type number of the user i, and f (i) is 1,2, …, K_F，K_FRepresenting the number of nickname categories of the user; e (i) indicates the education level of the user i, e (i) 1,2, …, K_E，K_EA number of levels representing the level of education of the user; p (i) indicates the profile status of user i, where p (i) ═ 0 indicates that user i has no profile, and p (i) ═ 1 indicates that user i has a profile; h (i) represents the mutual power rating of user i, h (i) 1,2, …, K_H，K_HRepresenting the mutual powder number grade quantity of the user i; lambda [ alpha ]₁、λ₂、λ₃、λ₄Respectively representing the preset weights of the nickname type F (i), the education degree E (i), the user profile state P (i) and the mutual powder number H (i) of the user;

U_P(i)＝UI(i)+G(i)。

3. The social network user credibility assessment method of claim 1, wherein the weight λ₁、λ₂、λ₃、λ₄The entropy is calculated by adopting an information entropy weight distribution method, and the specific method is as follows:

Wherein, K_P2, representing the number of profile states of the user;

for judgment matrix A_GPerforming characteristic decomposition to obtain maximum characteristic value lambda_maxNormalizing the corresponding characteristic vector, and taking the normalized vector as a weight vector (lambda)₄,λ₃,λ₂,λ₁)。

4. The method for assessing social network user credibility as claimed in claim 1, wherein the user generated content information credibility U in step S3_ucg(i) Is calculated as follows：

Calculating the influence extent IU (i) of the released blog articles of each user by adopting the following formula:

wherein M is_iIndicating the number of messages issued by user i, D_i,mIndicates the praise number, C, of the mth blog article released by the user i_i,mThe number of comments of the mth blog article in the blog article, M is 1,2, …, M_i；

Calculating the spreading degree CU (i) of the blog release of each user by adopting the following formula:

wherein, RT_i,mThe length of a forwarding chain for the mth blog article in the blog article issued by the user i is represented;

U_P(i)＝IU(i)+CU(i)。