CN113821706B

CN113821706B - Social network user credibility assessment method based on soft interval support vector machine

Info

Publication number: CN113821706B
Application number: CN202111119250.3A
Authority: CN
Inventors: 邢玲; 高建平; 吴红海; 赵康; 姚景龙
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2024-03-19
Anticipated expiration: 2041-09-24
Also published as: CN113821706A

Abstract

The invention discloses a social network user credibility assessment method based on a soft interval support vector machine, which comprises the steps of crawling configuration file information and generated content information of users from a social network, marking the users, calculating the user configuration file information credibility according to the configuration file information of each user, calculating the user generated content information credibility according to the generated content information of each user, taking a vector formed by the user configuration file information credibility of each user and the user generated content information credibility as an input of a training sample, taking a label of the user as a label of the training sample, training the soft interval support vector machine, acquiring the user configuration file information credibility and the user generated content information credibility of the users when the credibility assessment of the users in the social network is required, and inputting the soft interval support vector machine to obtain a user credibility assessment result. The invention improves the accuracy of user credibility assessment through the soft interval support vector machine.

Description

Social network user credibility assessment method based on soft interval support vector machine

Technical Field

The invention belongs to the technical field of reliability evaluation of social network users, and particularly relates to a reliability evaluation method of social network users based on a soft interval support vector machine.

Background

The social network platform and the number of users in the big data age are explosively increased, so that the social network platform not only becomes an indispensable information interaction platform and an indispensable information transmission medium in daily life of people, but also has a huge and complex user group. Users in the social network are important nodes for information transmission of the social platform, the inundation of malicious users can influence the smoothness and the healthy development of the information transmission in the social platform, and meanwhile, the credibility evaluation of the social network users has important research significance in the fields of information screening, public opinion governance, network security, user identification and the like. Therefore, quantification and evaluation of the credibility of users in the social network become an important research topic in the credibility research of the users in the social network.

The social network brings convenience to people's information exchange and emotion expression, and the characteristic of openness also enables a large number of malicious users to be filled in the network. Malicious users can generate a great deal of false and malicious behaviors or information in the social network, and the credibility of the malicious users in the social network is increased through fictitious configuration file information. In order to better identify malicious users and trusted users in a social network, reasonable and accurate assessment of user credibility is required. The evaluation of the user credibility of the social network mainly comprises the steps of quantitatively analyzing the user information in the network, and representing the user credibility in the social network through the calculation of the user information. In order to ensure the accuracy and the reasonability of the user credibility assessment, the processing and the quantification of each characteristic item in the user configuration file information and the user generated content information are required to be enhanced, and the accuracy of the user credibility assessment algorithm is improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a social network user credibility assessment method based on a soft interval support vector machine, wherein a soft interval support vector machine algorithm is used for transforming a user credibility assessment dimension from a linear sum one-dimensional linear space to a two-dimensional coordinate system space so as to improve the accuracy of user credibility assessment.

In order to achieve the above purpose, the social network user credibility assessment method based on the soft interval support vector machine comprises the following steps:

s1: crawling configuration file information and generated content information of N users from a social network, wherein the configuration file information of the users comprises nicknames, education degrees of the users, user profiles and mutual powder numbers of the users, the generated content information of the users comprises the number of endorsements of the users Wen Dian, the number of forwarding the blogs and the number of comments of the blogs, then marking the users, when a label flag (i) =1 indicates that the user i is trusted, when a label flag (i) =0 indicates that the user i is not trusted, i=1, 2, … and N;

s2: extracting characteristic attribute data from the configuration file information of each user, and then calculating the credibility U of the configuration file information of the user _P (i)；

S3: extracting characteristic attribute data from the generated content information of each user, and then calculating user generated content information credibility U _ucg (i)；

S4: the user configuration file information credibility U of each user _P (i) And user generated content information trustworthiness U _ucg (i) Vectors (U) _P (i),U _ucg (i) As input in the training sample, taking the label flag (i) of the user as the label in the training sample;

s5: the soft interval support vector machine is adopted as a social network user credibility evaluation model, and the training sample obtained in the step S4 is adopted to train the soft interval support vector machine;

s6: when the credibility of the user in the social network is required to be evaluated, the same method of the step S2 is adopted to calculate the credibility of the user configuration file information of the user, the same method of the step S3 is adopted to calculate the credibility of the user generated content information of the user, and after vectors are formed, the information is input into the soft interval support vector machine trained in the step S5, and a user credibility evaluation result is obtained.

The invention relates to a social network user credibility assessment method based on a soft interval support vector machine, which comprises the steps of crawling configuration file information and generated content information of users from a social network, marking the users, calculating the user configuration file information credibility according to the configuration file information of each user, calculating the user generated content information credibility according to the generated content information of each user, taking a vector formed by the user configuration file information credibility of each user and the user generated content information credibility as input in a training sample, taking a label of the user as a label of the training sample, training the soft interval support vector machine, and when the credibility assessment of the users in the social network is required, obtaining the user configuration file information credibility and the user generated content information credibility of the users, and inputting the soft interval support vector machine to obtain a user credibility assessment result.

According to the invention, the soft interval support vector machine is used for transforming the user credibility assessment dimension from the linear sum one-dimensional linear space to the two-dimensional coordinate system space, so that the problem that the user credibility assessment result is aliased at the threshold value is solved, and the accuracy of the user credibility assessment is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a social network user credibility assessment method based on a soft interval support vector machine of the present invention;

FIG. 2 is a graph showing the accuracy of the reliability assessment results of the present invention compared with the other three comparison methods;

FIG. 3 is a graph showing the accuracy of the confidence assessment results of the present invention versus the other three comparison methods;

FIG. 4 is a graph showing recall ratio comparisons of confidence assessment results for the present invention with three other comparison methods;

FIG. 5 is a graph comparing F1 scores of results of confidence scores of the present invention with other three comparison methods.

Detailed Description

The following description of the embodiments of the invention is presented in conjunction with the accompanying drawings to provide a better understanding of the invention to those skilled in the art. It is to be expressly noted that in the description below, detailed descriptions of known functions and designs are omitted here as perhaps obscuring the present invention.

Examples

FIG. 1 is a flowchart of an embodiment of a social network user credibility assessment method based on a soft interval support vector machine. As shown in FIG. 1, the method for evaluating the credibility of the social network user based on the soft interval support vector machine comprises the following specific steps:

s101: acquiring user data:

and crawling configuration file information and generated content information of N users from a social network, wherein the configuration file information of the users comprises nicknames, education degrees of the users, user profiles and mutual powder numbers of the users, the generated content information of the users comprises the number of user blogs Wen Dian, the number of blogs forwarded and the number of blogs commends, then the users are marked, when a label flag (i) =1 indicates that the user i is trusted, when a label flag (i) =0 indicates that the user i is not trusted, i=1, 2, … and N.

S102: calculating the credibility of the user profile information:

the user profile information is a reflection of the authenticity of the user in the social network, and has high credibility, so that the credibility of the user of the social network can be evaluated by adopting the credibility of the user profile information. For example, the new wave microblog platform is provided with a complete personal information system, and each personal information is designed with strict format correction when being filled in, so that the reality and effectiveness of the information are ensured. The user information it relates to includes 20 types, 14 kinds of user profile information, and 6 kinds of user generated content information. The user profile information includes: user nicknames, UIDs, gender, birthdays, educational backgrounds, user profiles, URLs, professions, companies, hometown, fan numbers, cross-correlation numbers, cross-powder numbers, and interest tags. The user-generated content information includes: number of blogs, number of endorsements Wen Dian, number of blogs forwarding, number of blogs comments, number of blogs labels, and blogs special symbol.

Extracting characteristic attribute data from the configuration file information of each user, and then calculating the credibility U of the configuration file information of the user _P (i)。

When the user profile information credibility is calculated, the method is divided into overall credibility and local credibility, the user profile information integrity is adopted to represent the credibility of the user profile information integrity, the user profile information influence index represents the credibility of the user profile information local, and the user profile information credibility is obtained by linearly summing the user profile integrity and the user profile information influence index.

The information integrity of the user configuration file is the ratio of the number of personal information labels which the user is willing to disclose to other users in the social network and the total number of labels of the user information integrity evaluation system. The calculation formula of the user profile information integrity UI (i) is thus as follows:

wherein A (i) represents the number of personal information tags actually disclosed by the user i, and n represents the total number of personal information tags of the user in the social network.

The user profile information influence index refers to the quantitative summation of a limited number of feature items that highly contribute to the calculation of the confidence in calculating the user confidence of the user profile information. The feature items in the user profile information are complicated, and selecting a larger number of feature items causes calculation errors and increases calculation costs, so that only the user profile, the user nickname, the user education level and the mutual powder number are selected to represent the user profile information influence index in the embodiment. The calculation formula of the user profile information influence index G (i) of each user is thus as follows:

G(i)＝λ ₁ F(i)+λ ₂ E(i)+λ ₃ P(i)+λ ₄ H(i)

wherein F (i) represents a user nickname type number of user i, F (i) =1, 2, …, K _F ，K _F Representing the number of nickname categories of the user; e (i) represents the education level of user i, E (i) =1, 2, …, K _E ，K _E A number of levels representing a degree of education of the user; p (i) represents the profile status of user i, P (i) =0 represents that user i has no profile, and P (i) =1 represents that user i has a profile; h (i) represents the mutual number ranking of user i, H (i) =1, 2, …, K _H ，K _H Representing the number of mutual powder number grades of the user i; lambda (lambda) ₁ 、λ ₂ 、λ ₃ 、λ ₄ Respectively represent the nickname type F (i) and the education of the userThe degree E (i), the user profile state P (i) and the user mutual number H (i).

Then the user configuration file information credibility U of each user is calculated by adopting the following formula _P (i)：

U _P (i)＝UI(i)+G(i)

Weights lambda required for user profile information impact index ₁ 、λ ₂ 、λ ₃ 、λ ₄ In order to solve the type difference and magnitude difference between different information, the embodiment calculates the weight by adopting an information entropy weight distribution method, and the specific method is as follows:

according to the number of values of four characteristic items in the user profile information influence index, constructing a weight distribution judgment matrix A of the user profile information influence index characteristic items _G ：

Wherein K is _P =2, indicating the profile status value number of the user.

Judgment matrix A _G Representing the ratio between the four feature items. The ratio between feature items represents the importance ratio in the influence index between different feature items. For judgment matrix A _G Performing feature decomposition to obtain maximum feature value lambda _max The corresponding feature vector is normalized, and the normalized vector is used as a weight vector (lambda ₄ ,λ ₃ ,λ ₂ ,λ ₁ )。

In the present embodiment, let K _F ＝2，K _E ＝4，K _H =6, then judge matrix a _G ：

Maximum eigenvalue lambda obtained after eigenvalue decomposition _max =4, and the consistency ratio cr=0.0006, much smaller than 0.1, phase conforming to the consistency testThe requirements are that it is reasonable to state the judgment matrix. Eigenvalue lambda _max Normalized feature vector of=4 to obtain weights of each feature item of λ ₁ ＝0.375、λ ₂ ＝0.25、λ ₃ ＝0.25、λ ₄ ＝0.125。

S103: calculating the credibility of the user generated content information:

the invention defines the credibility of the user generated content information from the two aspects of influence breadth and propagation breadth of the user release blog, determines the calculated content and the characteristic item based on the user credibility of the user generated content, and characterizes the credibility of the user from two different angles of propagation and influence of the user generated content in a social network respectively, so that the calculated results of the two parts are linearly summed to obtain the credibility of the user generated content information.

The influence breadth of the blog issued by the user is the influence degree of the blog issued by the user on other users, and is mainly reflected in the frequency of praise and comment behaviors of the other users on the target user blog. The calculation formula of the influence breadth IU (i) of each user release blog is as follows:

wherein M is _i Representing the number of blogs issued by user i, D _i,m Representing the praise number of the mth blog in the user i's release blog, C _i,m Representing the number of comments of the mth blog in the blog posted by user i, m=1, 2, …, M _i . The denominator in the formula is added with 1 to prevent the denominator from taking zero. Obviously, the larger the value of the influence extent IU (i), the larger the influence extent of the user generated content.

The spreading breadth of the user-issued blog is the frequency of browsing the blog issued by the user by other users, and is mainly measured by the length of the forwarding chain of the user blog, namely, the longer the forwarding chain of the user-issued blog is, the wider the spreading of user-generated content is. The calculation formula of the propagation breadth CU (i) of each user release blog is as follows:

wherein RT _i,m And representing the forwarding chain length of the mth blog in the user i published blog.

Then the user generated content information credibility U of each user is calculated by adopting the following formula _ucg (i)：

U _P (i)＝IU(i)+CU(i)

S104: determining a training sample:

the user configuration file information influence index G (i) of each user and the user generated content information credibility U are used for generating the content information _ucg (i) Vectors (U) _P (i),U _ucg (i) As input of training samples, the label flag (i) of the user is used as the label of the training samples.

S105: training a soft interval support vector machine:

and (4) using the soft interval support vector machine as a social network user credibility evaluation model, and training the soft interval support vector machine by using the training sample obtained in the step (S104).

The input data of the social network user credibility assessment model in the invention is two-dimensional data (U _P (i),U _ucg (i) A linear discriminant function f=w in two dimensions ^T x+b, then a hyperplane W may be used ^T x+b=0, where x represents the input, W is the weight vector, b is the classification threshold, and the superscript T represents the transpose. And the classification line is required to classify all samples correctly, i.e. it is required to satisfy the following formula:

y _i (W ^T +b)-1≥0,i＝1,2,…,N,y _i ＝±1

wherein w= { ω ₁ ,ω ₂ ,…,ω _d The vector is a normal vector, determines the direction of the hyperplane, d is the number of eigenvalues, and b determines the distance between the hyperplane and the origin. As long as W and b are determined, one division hyperplane can be uniquely determined. Dividing the distance between the hyperplane and any point on the marginal hyperplane on two sides of the hyperplane into

Specifically, the algebraic expression that the sample points in the support vector machine need to all satisfy the constraint condition is as follows:

the soft-spacing support vector machine allows some samples to fail constraint because the linear inseparability means some sample points fail to satisfy the condition that the function spacing is 1 or more, namely1-y _i (W ^T x _i +b) > 0. The solution is to introduce a relaxation variable ζ for each sample point _i For those sample points that do not meet the constraint, so that the function interval plus the relaxation variable is greater than or equal to 1, then our constraint becomes the following:

wherein ζ _0/1 Represented is a 0/1 loss function, as follows:

on the one hand, in order to optimize the soft interval support vector machine and improve the evaluation accuracy, a relaxation variable is introduced into a constraint condition, and a balance coefficient C is added into an objective function to solve the problem. On the other hand, the maximization interval exists when the function hyperplane needs to be satisfied, so that samples which do not satisfy the constraint are as few as possible, and the balance coefficient C is added for the purpose of reconciliationAnd y _i (W ^T x _i +b)+ζ _i Two parts equal to or more than 1Coefficients of the partitions. The representation function of the soft-interval support vector machine can then be written as the following formula:

s.t.y _i (W ^T x _i +b)≥1-ζ _i ,ζ _i ≥0,i＝1,2,…,N

wherein C > 0 is called a balance coefficient, the constraint on misclassification is increased when the C value is large, and the constraint on misclassification is reduced when the C value is small, wherein the value of the balance coefficient C is C=10 ^k And k= -3, -2, -1,0,1,2,3.

The soft interval support vector machine is adopted, and the user credibility assessment dimension can be transformed from a linear summation one-dimensional linear space to a two-dimensional coordinate system space, so that the accuracy of the user credibility assessment is improved.

S106: user credibility assessment:

when the reliability evaluation needs to be performed on the user in the social network, the user configuration file information reliability of the user is calculated by adopting the same method in the step S102, the user generated content information reliability of the user is calculated by adopting the same method in the step S103, and the user generated content information reliability is input into the soft interval support vector machine trained in the step S105 after the vector is formed, so that the user reliability evaluation result is obtained.

In order to better illustrate the technical effects of the invention, the invention is experimentally verified by adopting a specific example. In the experimental verification, user data are selected from the social network newwave microblogs, and the reliability evaluation results are evaluated by taking accuracy, precision, recall and F-measure (F1) as evaluation indexes.

In order to compare the effectiveness and rationality of the invention in the user credibility evaluation, the embodiment selects three user credibility evaluation methods as comparison methods and the invention herein to evaluate the user credibility and compare the results. The comparison method 1 adopts the method in the document "A.Narayanan, A.Garg, I.Arora, T.Sureka et al," IronSense: towards the Identification of Fake User-Profiles on Twitter Using Machine Learning, "2018Fourteenth International Conference on Information Processing (ICINPLO), pp.1-7, bangalore, india,2018," and adopts the relevant algorithm of machine learning to quantitatively learn the user profile information so as to achieve the purpose of evaluating the user credibility. Comparative method 2 the method in document "H.Slimi, I.Bounhas and y.slimani," URL-Based Tweet Credibility Evaluation, "2019IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pp.1-6,Abu Dhabi,United Arab Emirates,Nov.2019," characterizes user trustworthiness by quantifying user-generated content information. The comparison method 3 adopts the method in the literature of Q.Sun, N.Wang, Y.Zhou, et al, "Identification of Influential Online Social Network Users Based on Multi-features," International Journal of Pattern Recognition & Artificial Intelligence vol.30, no.6, pp.1659015.1-1659015.15,2016, "comprehensively considers various types of information of users, and adopts the PageRank algorithm to quantitatively process the user information so as to evaluate the user credibility.

FIG. 2 is a graph showing the accuracy of the reliability evaluation results of the present invention compared with the other three comparison methods. FIG. 3 is a graph showing the accuracy of the confidence assessment of the present invention versus the other three comparison methods. FIG. 4 is a graph showing the recall ratio of the reliability evaluation results of the present invention and the other three comparison methods. FIG. 5 is a graph comparing F1 scores of results of confidence scores of the present invention with other three comparison methods. As can be seen from fig. 2,3, 4 and 5, the present invention evaluates user credibility in a two-dimensional plane, avoiding the problem that linear summation easily causes aliasing of credible users and malicious users at classification thresholds. Because the number of users and the evaluation index are in a negative correlation relationship, namely when the number of users continuously increases, the number of users in the interval hyperplane is increased, so that the relaxation becomes larger, the tolerance to noise data becomes smaller, and the error of the evaluation result of the users becomes larger, thereby leading the evaluation index to show a descending trend. The three comparison methods and the user credibility evaluation result of the invention have the accuracy rate of 0.13, 0.14, 0.08 and 0.07 respectively, and the method provided by the invention has the advantages of lowest accuracy rate and better robustness.

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. A social network user credibility assessment method based on a soft interval support vector machine is characterized by comprising the following steps:

2. The method for evaluating the credibility of a social network user according to claim 1, wherein the user profile information credibility U in step S2 _P (i) The calculation method of (2) is as follows:

the user profile information integrity UI (i) is calculated using the following formula:

wherein A (i) represents the number of personal information tags actually disclosed by the user i, and n represents the total number of personal information tags of the user in the social network;

the user profile information impact index G (i) is calculated using the following formula:

G(i)＝λ ₁ F(i)+λ ₂ E(i)+λ ₃ P(i)+λ ₄ H(i)

wherein F (i) represents a user nickname type number of user i, F (i) =1, 2, …, K _F ，K _F Representing the number of nickname categories of the user; e (i) represents the education level of user i, E (i) =1, 2, …, K _E ，K _E A number of levels representing a degree of education of the user; p (i) represents the profile status of user i, P (i) =0 represents that user i has no profile, and P (i) =1 represents that user i has a profile; h (i) represents the mutual number ranking of user i, H (i) =1, 2, …, K _H ，K _H Representing the number of mutual powder number grades of the user i; lambda (lambda) ₁ 、λ ₂ 、λ ₃ 、λ ₄ The preset weights respectively represent the nickname type F (i), the education degree E (i), the user profile state P (i) and the user mutual flour number H (i);

then the user configuration file information of each user is calculated by adopting the following formulaConfidence level U _P (i)：

U _P (i)＝UI(i)+G(i)。

3. The method for evaluating the credibility of a social network user according to claim 2, wherein the weight λ is ₁ 、λ ₂ 、λ ₃ 、λ ₄ The method is calculated by adopting an information entropy weight distribution method, and the specific method is as follows:

Wherein K is _P =2, representing the number of profile states of the user;

for judgment matrix A _G Performing feature decomposition to obtain maximum feature value lambda _max The corresponding feature vector is normalized, and the normalized vector is used as a weight vector (lambda ₄ ,λ ₃ ,λ ₂ ,λ ₁ )。

4. The method for evaluating the credibility of a social network user according to claim 1, wherein the user generates the credibility U of the content information in step S3 _ucg (i) The calculation method of (2) is as follows:

the influence breadth IU (i) of each user release blog is calculated by adopting the following formula:

wherein M is _i Representing the number of blogs issued by user i, D _i,m Representing the praise number of the mth blog in the user i's release blog, C _i,m Representing the number of comments of the mth blog in the blog posted by user i, m=1, 2, …, M _i ；

The propagation breadth CU (i) of each user's post blog is calculated using the following formula:

wherein RT _i,m Representing the forwarding chain length of the mth blog in the user i published blog; then the user generated content information credibility U of each user is calculated by adopting the following formula _ucg (i)：

U _P (i)＝IU(i)+CU(i)。