CN107085616B

CN107085616B - False comment suspicious site detection method based on multi-dimensional attribute mining in LBSN (location based service)

Info

Publication number: CN107085616B
Application number: CN201710397805.8A
Authority: CN
Inventors: 曹玖新; 郭一方; 马卓
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2017-05-31
Filing date: 2017-05-31
Publication date: 2021-03-16
Anticipated expiration: 2037-05-31
Also published as: CN107085616A

Abstract

The invention discloses a false comment suspicious site detection method based on multi-dimensional attribute mining in LBSN, which comprises the following steps: firstly, marking suspicious places with false comment activities; secondly, extracting abnormal features aiming at the relationship between the overall comment abnormality of the places and the malicious competition among the places based on the place score, the space-time attribute and the text content of the place comment of the LBSN; training and learning by adopting a logistic regression machine learning method to obtain the abnormal degree of each place and the competition degree between the two places; then constructing a Markov random field detection model based on the competition relationship between the places, and fusing the abnormal characteristics of the competition relationship between the places and the LBSN network topology; calculating the probability that any place is a suspicious place based on the detection model; and finally marking whether the place is a suspicious place with false comment activity. The detection method greatly improves the accuracy of detecting the suspicious site of the false comment activity.

Description

False comment suspicious site detection method based on multi-dimensional attribute mining in LBSN (location based service)

Technical Field

The invention relates to a method for detecting false comment suspicious sites based on multi-dimensional attribute mining in LBSN (location based service).

Background

In recent years, with the rapid development of mobile terminal positioning technology and mobile internet technology, a Location-Based Social network, i.e., lbs n (full name Location-Based Social Networks) platform has been greatly successful. LBSN connects the virtual social space and the real behavior space through the position characteristics, the online relation and the offline relation are fused, a user can publish comments for spatial places by relying on an online network, explore and discover new places by relying on the comments offline, and selectively visit, consume or serve the places. However, various false comments exist in massive information on the lbs n platform, which are mostly organizational false comment activities, and these activities change the public praise of a place by issuing a plurality of false comments, thereby affecting the access decision of a user, capturing illegal benefits for place merchants, destroying the network environment, and seriously affecting the user experience and the network reputation. Therefore, it is of great practical significance to identify and detect suspicious sites where there is false comment activity in this section.

Current detection techniques for merchants with false comment activity are mainly directed to traditional e-commerce websites, with little research on detecting suspicious places in the LBSN where false comment activity exists, and no research considering false comment activity due to competitiveness among place merchants. In the practical LBSN, places can detect whether false comment activities exist or not through the abnormity expressed by the overall comments in dimensions of time, space, score, text and the like, and suspicious places with the false comment activities caused by malicious competition can be further explored through competition relations among the places, so that the detection accuracy of the suspicious places with the false comment activities is improved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for detecting the suspicious places of the false comments based on multi-dimensional attribute mining in the LBSN is provided, wherein the suspicious places with the false comment activities can be identified and detected.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a false comment suspicious site detection method based on multi-dimensional attribute mining in LBSN (location based service) N utilizes the competition relationship between abnormal characteristic information of sites in LBSN and the sites to carry out the detection process of suspicious sites, and comprises the following steps:

1) according to the filtered comment information in the LBSN, the false comment activity is manually identified, suspicious places with the false comment activity and credible places without the false comment behavior are marked, and a training set and a test set of the places are divided; meanwhile, marking competition relation site pairs with malicious competition activities and competition-free site pairs, and dividing training sets and test sets of the competition relation site pairs.

2) And analyzing the places with false comment activities, extracting abnormal features of the overall place comments based on the place scores of the LBSN, the space-time attributes and the text contents of the place comments, and constructing an abnormal feature set of the places.

3) Analyzing the competition among the sites, extracting abnormal features of malicious competition relation between the two sites based on multiple dimensionalities of LBSN (location based service) N (location based service), and constructing an abnormal feature set of the competition relation between the sites.

4) Abnormal program construction method based on logistic regression machine learning methodA degree function, learning the characteristic weight parameters in the function according to the positive and negative examples marked in the step 1) to obtain the abnormal degree epsilon of each place in the data set_lDegree of abnormality e of competition with the site_c。

5) Constructing a Markov random field detection model based on LBSN, wherein the Markov random field detection model comprises nodes and edges, the nodes represent places, and the edges represent competition relations among the places; the nodes include two categories: the suspicious places and the credible places are set in the prior probability that the nodes belong to each category under different categories, and the prior probability is obtained through the abnormal degree of the places in the step 4); setting a correlation degree value matrix between the places under different categories, wherein the correlation degree is obtained by the abnormal degree of competition between the two places in the step 4).

6) According to the Markov random field detection model obtained in the step 5), aiming at the node v_iTo node v_jSetting information values

And iteratively propagating the information value based on the model, and finally for each node v_iGenerating confidence

Representing a node v_iBelong to the class σ_iAs a node v_iBelong to the class σ_iThe edge probability of (2).

7) And finally marking whether the place is a suspicious place with false comment activity or not according to the node confidence coefficient obtained in the step 6).

The specific method for marking the activity place of the false comment in the data set in the step 1) comprises the following steps: according to the comment information automatically filtered in the LBSN network, selecting partial places with high proportion of filtered comments, manually marking the false comments in the partial places, marking the places with the proportion of the false comments higher than a certain threshold value as suspicious places with false comment activities, and randomly selecting the places without the filtered comments and marking the places as credible places.

The specific method for extracting the overall comment abnormal features of any place l in the data set from different dimensions in the step 2) comprises the following steps: extracting total score disparity osd (l) of a place from a score disparity dimension, extracting review explosiveness mrd (l) of a place from a time dimension, extracting check-in period distribution disparity D (r | | c) of a place from a spatio-temporal dimension, and extracting content similarity mcs (l) of a place from a review text dimension.

Extracting two sites l with competition in the data set from different dimensions in the step 3)_m，l_nThe specific method for the abnormal characteristics of the malicious competition comprises the following steps: extracting comment difference URD (l) of two competition location common users from grading difference dimension_nm，l_n) Extracting from the time dimension the review time cooperativity ATI (l) of the co-users of two competing sites_nm，l_n) Extracting content similarity ACS (l) of two competitive site common users from comment text dimension_nm，l_n)。

The specific method for training and learning based on the logistic regression machine learning method in the step 4) to obtain the abnormal degree of the competition relationship between the abnormal degree of each place and the place is divided into the following 3 steps:

a) constructing feature vectors from an abnormal feature set of a place

Based on the training set of the places marked in the step 1), training and learning by adopting a gradient descent method to obtain weight vectors corresponding to abnormal feature vectors of the places

b) Constructing feature vectors according to abnormal feature sets of competitive relations among places

Based on the training set of the competition relationship site pairs marked in the step 1), the weight vector corresponding to the abnormal feature vector of the competition relationship between the sites is obtained by training and learning by adopting maximum likelihood estimation and a gradient descent method

c) Calculating the abnormal degree epsilon of all the places according to the abnormal characteristics and the weight of the places_lCalculating the abnormal degree epsilon of the competition relationship among all the sites according to the abnormal characteristics and the weight of the competition relationship among the sites_cCalculating the degree of abnormality ε_lAnd epsilon_cThe specific method comprises the following steps:

wherein the content of the first and second substances,

to construct a feature vector from the feature set,

and the feature weight vector is corresponding to the feature vector.

The information value is detected based on the detection model in the step 6)

The specific method of iterative propagation is as follows:

wherein M is a class set of nodes,

is a node v_iAnd node v_jIn respective class σ_i，σ_jThe degree of association value of (a) below,

for the node itself in the category σ_iThe value of the prior probability of the lower,

is a node v_iOther neighbor nodes v of_kThe value of the information, N (v), passed to the node_i) Is node v_iAll neighbor node sets of N (v)_i)\v_jIs node v_iDivision node v_jSet of all other neighbor nodes, Z₁Is a standardized constant, with the purpose of ensuring

I.e. information values under all categories

The sum is 1. .

Each node v needs to be calculated in the step 6)_iIn the category σ_iConfidence of

As node v_iBelong to the class σ_iProbability of, node v_iBelong to the class σ_iThe confidence coefficient calculation method comprises the following specific steps:

wherein Z is₂Is a standardized constant, with the purpose of ensuring

I.e. node v_iThe sum of the confidences under all classes is 1.

The invention has the beneficial effects that: according to the abnormal features of the comment of the place in the LBSN expressed in the scoring, time, space and text dimensions, the abnormal features of the place are extracted, the place is classified based on a logistic regression machine learning method, and the suspicious place with false comment activity is effectively detected; introducing competition relations among the sites to improve the detection effect and extracting the abnormal features of the competition among the sites; the abnormal features of the sites and the abnormal features of competition among the sites are fused to jointly act on the detection of the suspicious sites with false comment activities, and the detection performance is improved. In particular, the present invention has the following advantages:

1. abnormal features of the places are extracted by using the abnormal features of the comments of the places in the LBSN in scoring, time, space and text dimensions, the places are classified based on a logistic regression machine learning method, and suspicious places with false comment activities are effectively detected;

2. introducing competition relations among the sites to improve the detection effect, extracting abnormal features of competition among the sites, and deeply mining the sites possibly with false comment activities;

3. the abnormal features of the places and the abnormal features of competition among the places are fused to jointly act on the detection of the false comment activity places, and the detection accuracy is improved.

Drawings

Fig. 1 is a flow chart of the abnormal feature extraction of the present invention.

FIG. 2 is a flow diagram of false comment activity location detection in accordance with the present invention.

Fig. 3 is an overall system framework diagram of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is defined in the appended claims, as interpreted by those skilled in the art.

Referring to fig. 1, fig. 2 and fig. 3, a method for detecting suspicious sites of false comments based on multidimensional attribute mining in an lbs n according to the present invention includes the following steps:

step 1: according to the comment information automatically filtered in the LBSN network, selecting partial places with high proportion of filtered comments, manually marking the false comments in the partial places, marking the places with the proportion of the false comments higher than a certain threshold value as suspicious places with false comment activities, and randomly selecting the places without the filtered comments and marking the places as credible places. Then, the data is divided into two parts according to the proportion of 4: 1 by adopting a random extraction method: s, T, where S is the training set and T is the test set;

selecting common access comment users based on the marked suspicious places, taking the place pairs with the spacing distance smaller than a certain threshold and the label category similarity of the places larger than the certain threshold as a place pair candidate set which possibly has competition relationship, marking the place pairs with malicious competition in the candidate set to cause false comment activity as competition place pairs based on a manual marking mode, and randomly selecting the place pairs without the malicious competition activity in the candidate set as non-competition place pairs. Then, the data is divided into two parts according to the proportion of 4: 1 by adopting a random extraction method: s, T, where S is the training set and T is the test set;

step 2: and analyzing the place with the false comment activity, and extracting abnormal features of any place l in the data set for quantification based on multiple dimensions such as scores, time, space, text and the like of the LBSN.

1) Extracting total score difference osd (l) of location l from score difference dimension:

wherein t represents a certain comment i e R of the place_lTime of release of R_lSet of comments, r, representing location l_i(t) score of comment i at time t, avg_t’＜tr_i(t') represents the average score of location i before time t, d_iExpress comment r_i(t) score and average score avg of site/before review time_t’＜tr_i(t') the difference between the values of (a),

representing the average score difference of all reviews for a location.

2) Extract the commenting explosive mrd (l) for site l from the time dimension:

wherein n is the number of reviews received by location l in a day, avg (n) is the average number of reviews per day for location l in the number of days with reviews, max (n) is the maximum number of reviews for location l,

represents the absolute deviation of the maximum number of reviews per day for a location.

3) Extracting sign-in period distribution difference D (r | | c) of a location l from a space-time dimension:

where k ∈ {1, 2, …, 7} represents a day of the week period, r represents location l commenting on the distribution vector during the week period, c represents location l checking in the distribution vector during the week period,

the difference of the location check-in time distribution and the review time distribution is described for KL divergence.

4) Content similarity mcs (l) of location l is extracted from the comment text dimension:

wherein, all comment texts of the place are used as corpus space, cosine (r)_i，r_j) For any two comments r for location l_i，r_jBased on the text cosine similarity of TF-IDF.

5) Constructing an abnormal feature set of the place through feature values of all the places in the extracted data set

Wherein the content of the first and second substances,

for the overall score of the differential osd (l),

to review the explosive mrd (l),

to distribute the disparity D (r c) for the check-in period,

content similarity mcs (l).

And step 3: analyzing the competition among the places, and extracting any possibly competitive place pair l in the data set based on multiple dimensions of LBSN (location based service)_m，l_nThe abnormal features of the competition are quantified.

1) Two competitive sites l are extracted from the scoring difference dimension_m，l_nReview variability URD (l) of common users_m，l_n)：

URD(l_m，l_n)＝avg_i∈U|d_i|，d_i＝r_i(l_m)-r_i(l_n) (5)

Wherein, the location l_mAnd l_nIs U, r_i(l) Represents the rating of user i for location l, d_iRepresenting user i for two competing sites l_m、l_nThe difference in scores of (a).

2) Two competitive sites l are extracted from the time dimension_m，l_nComment temporal cooperativity ATI (l) of common users_nm，l_n)：

ATI(l_m，l_n)＝avg_i∈U|T_i(l_nm)-T_i(l_n)| (6)

Wherein, T_i(l) Represents the comment time, | T, of user i for location l_i(l_m)-T_i(l_n) I denotes user i for two competing sites l_m、l_nThe review time interval of (c).

3) Extracting two competitive sites l from comment text dimension_m，l_nContent similarity ACS (l) of common users_nm，l_n)：

Wherein R is_UA set of comments for the place of competition representing a common user set U is referred to as a corpus space, cosine (r)_i，r_j) Comment text r published for a common user for a competitive place_i，r_jBased on the cosine similarity of the TF-IDF.

4) Constructing an abnormal feature set of competition among the sites by the feature values of all the possible competition site pairs in the extracted data set

Wherein the content of the first and second substances,

to comment on the variability URD (l)_nm，l_n)，

Is a time-synergistic ATI (l)_m，l_n)，

For content similarity ACS (l)_nm，l_n)。

And 4, step 4: training and learning the feature vectors obtained in the step 2 and the step 3 by adopting a logistic regression machine learning method to obtain the abnormal degree epsilon of each place_lDegree of competition with two sites epsilon_c. The degree of abnormality is calculated by the same method as the method for calculating the degree of competition, and the degree of abnormality is expressed as the degree of abnormality ∈_lThe calculation of (a) is taken as an example and mainly comprises the following steps:

1) set of outlier features Ψ for a site_LTo construct theFeature vector of class

Wherein the content of the first and second substances,

set of representation features Ψ_LThe ith eigenvalue of (a).

2) Setting weight omega for each dimension of feature, and for the feature vector

Constructing corresponding feature weight vectors

Wherein, the weight ω_iRepresenting feature weight vectors

The degree of abnormality epsilon of the ith feature to the location_lThe degree of importance of.

3) Constructing a degree function representing the degree of abnormality of the site based on a binomial logistic regression model:

wherein epsilon_l∈[0，1]，ε_lA closer to 1 indicates a higher degree of abnormality at the point l.

4) The training set based on the constructed location adopts the maximum likelihood estimation and the gradient descent method to learn the function parameters to obtain the characteristic weight vector

5) According to abnormal feature vector of any place l in data set

And the feature weight vector

Calculating the degree of abnormality epsilon of all the sites l in the data set_l。

And 5: the specific steps of constructing the Markov random field detection model based on LBSN are divided into the following 3 steps:

1) and constructing a network G (V, E) based on the LBSN and the Markov random field, wherein V is a node set, E is a set of place-place edges, and the competition relationship between places is represented for the place pair candidate set which is selected in the step 1 and possibly has the competition relationship.

2) For node v_mIs provided with

Is a node v_mAt different classes σ_mThe following prior probability distribution, indicates the likelihood that a location is a different category of location. Setting the degree of abnormality epsilon of the spot obtained in step 4_lRepresenting a priori values of nodes in the category of suspicious sites, 1-epsilon_lRepresenting the prior value of the node in the trusted place category.

3) For site-site edge E, set up

Is a node v_mAnd node v_nThe association degree distribution matrix under each category represents the degree of correlation of the category of the place with which competition exists. If node v_mIs a suspicious site, and sets the abnormal degree epsilon of competition among sites_cIndicating the possibility of malicious competition between sites, 1-epsilon_cIndicating the likelihood of no malicious competition between the sites. When node v_mThe category of the node v is a credible place, and the node v is set without considering the malicious competition characteristics existing between the places_mAnd node v_nIt is assumed that the suspicious site and the trusted site are both 1/2 with the same degree of correlation.

Step 6: calculating the probability that each place is a suspicious place with false comment activity according to the detection model obtained in the step 5, which specifically comprises the following steps:

1) obtained according to step 5Setting an arbitrary node v in the model_iTo node v_jInformation value

The information value transmission method comprises the following steps:

wherein the content of the first and second substances,

in the category σ for the node obtained in step 5_iThe value of the prior probability of the lower,

is a node v_iOther neighbor nodes v of_kThe value of the information, N (v), passed to the node_i) Is the set of all neighbor nodes of node i, Z₁Is a constant value that is normalized to a standard,

2) all information values are initialized to 1.

3) And selecting part of nodes to start information value iterative propagation, and continuously updating the information values in the process.

4) And when the change of all the information values updated continuously twice is smaller than a certain threshold value, the class distribution condition of all the nodes is shown to reach a stable state, and the information value transmission is stopped.

5) Calculate each node v_iIn the category σ_iConfidence of

As node v_iBelong to the class σ_iProbability of, node v_iThe confidence coefficient calculation mode is as follows:

wherein Z is₂Is a standardized constant, with the purpose of ensuring

And 7: any node v obtained according to step 6_iConfidence level in the suspicious site category σ

Selecting a proper partition threshold value delta based on the detection result of the test set, and selecting

Is marked as a suspicious site where there is a false comment activity.

Claims

A false comment suspicious site detection method based on multi-dimensional attribute mining in LBSN (location based service) N is characterized in that a false comment suspicious site detection process is carried out by using competition between abnormal features of sites in LBSN and the sites, and the method comprises the following steps:

1) according to the filtered comment information in the LBSN, the false comment activity is manually identified, suspicious places with the false comment activity and credible places without the false comment behavior are marked, and a training set and a test set are divided;

2) analyzing the places with false comment activities, extracting abnormal features of the overall place comments based on the place scores of the LBSN, the space-time attributes and the text contents of the place comments, and constructing an abnormal feature set of the places;

3) analyzing the competition among the sites, extracting abnormal features of a malicious competition relationship between the two sites based on multiple dimensionalities of LBSN (location based service), and constructing an abnormal feature set of the competition relationship between the sites;

4) are respectively provided withSplicing the features in the feature set obtained in the step 2) and the step 3) into a feature vector, constructing an abnormal degree function by adopting a logistic regression-based machine learning method, learning the weight parameters of the features in the function according to the positive and negative examples marked in the step 1), and obtaining the abnormal degree epsilon of each place in the data set_lDegree of abnormality e of competition with the site_c；

5) Constructing a Markov random field detection model based on LBSN, wherein the Markov random field detection model comprises nodes and edges, the nodes represent places, and the edges represent competition relations among the places; the nodes include two categories: setting prior probabilities of nodes belonging to various categories under different categories for suspicious sites and credible sites, and obtaining the abnormal degree of the sites obtained in the step 4); setting association degree value matrixes between the places under different types, wherein the association degree is obtained through the competition abnormal degree between the two places in the step 4);

6) according to the Markov random field detection model obtained in the step 5), aiming at the node v_iTo node v_jSetting information values
And iteratively propagating the information value based on the model, and finally for each node v_iGenerating confidence
Representing a node v_iBelong to the class σ_iAs a node v_iBelong to the class σ_iThe edge probability of (1);

7) and finally marking whether the place is a suspicious place with false comment activity or not according to the node confidence coefficient obtained in the step 6).
2. The LBSN detection method based on multi-dimensional attribute mining in claim 1, wherein the specific method for labeling the suspicious site with the false comment activity in the data set of the step 1) is as follows: according to the comment information automatically filtered in the LBSN network, the false comments in the LBSN network are manually marked, and suspicious places and credible places with false comment activities are marked according to the false comments.
3. The LBSN detection method based on multi-dimensional attribute mining in claim 1, wherein in the step 2), the overall comment of any place in the data set is extracted with abnormal features from a scoring difference dimension, a time dimension, a space dimension and a comment text dimension.
4. The LBSN detection method based on multi-dimensional attribute mining in claim 1, wherein abnormal features are extracted from competition relationships between two places in the dataset in the step 3) from a score difference dimension, a time dimension and a comment text dimension.
5. The LBSN of claim 3 or 4, wherein said step 4) is performed to obtain the degree of abnormality ε of each point in the dataset_lDegree of abnormality e of competition with the site_cThe specific method comprises the following 3 steps:

a) feature vector is constructed by splicing according to abnormal feature set of location
Based on the training set of the places marked in the step 1), training and learning by adopting a gradient descent method to obtain weight vectors corresponding to abnormal feature vectors of the places

b) Feature vector is constructed by splicing abnormal feature sets according to competition relation among places
Training of competition relation site pairs based on labeling in step 1)The weight vector corresponding to the abnormal feature vector of the competition relationship between the places is obtained by training and learning by adopting maximum likelihood estimation and a gradient descent method

c) Calculating the abnormal degree epsilon of all the places according to the abnormal characteristics and the weight of the places_lCalculating the abnormal degree epsilon of the competition relationship among all the sites according to the abnormal characteristics and the weight of the competition relationship among the sites_cCalculating the degree of abnormality ε_lAnd epsilon_cThe specific method comprises the following steps:
6. the LBSN of claim 5, wherein said step 6) of detecting suspicious sites of false comments based on multi-dimensional attribute mining comprises applying a Markov random field detection model to the information values
The specific method of iterative propagation is as follows:

wherein M is a class set of nodes,
is a node v_iAnd node v_jIn respective class σ_i，σ_jThe degree of association value of (a) below,
is a node v_iIn the category σ_iThe value of the prior probability of the lower,
is a class σ_iLower node v_iOther neighbor nodes v of_kThe value of the information, N (v), passed to the node_i) Is node v_iAll neighbor node sets of N (v)_i)\v_jIs node v_iDivision node v_jSet of all other neighbor nodes, Z₁Is a constant value that is normalized to a standard,
to ensure
I.e. information values under all categories
The sum is 1.
7. The LBSN of claim 6, wherein in said step 6), each node v needs to be calculated_iIn the category σ_iConfidence of
As node v_iBelong to the class σ_iProbability of, node v_iBelong to the class σ_iThe confidence coefficient calculation method comprises the following specific steps:

wherein Z is₂Is a constant value that is normalized to a standard,
to ensure
I.e. node v_iThe sum of the confidences under all classes is 1.