CN108304479A - A kind of fast density cluster double-layer network recommendation method based on graph structure filtering - Google Patents

A kind of fast density cluster double-layer network recommendation method based on graph structure filtering Download PDF

Info

Publication number
CN108304479A
CN108304479A CN201711469928.4A CN201711469928A CN108304479A CN 108304479 A CN108304479 A CN 108304479A CN 201711469928 A CN201711469928 A CN 201711469928A CN 108304479 A CN108304479 A CN 108304479A
Authority
CN
China
Prior art keywords
user
comment
cluster
score
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711469928.4A
Other languages
Chinese (zh)
Other versions
CN108304479B (en
Inventor
陈晋音
吴洋洋
林翔
俞山青
宣琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201711469928.4A priority Critical patent/CN108304479B/en
Publication of CN108304479A publication Critical patent/CN108304479A/en
Application granted granted Critical
Publication of CN108304479B publication Critical patent/CN108304479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Abstract

A kind of fast density cluster double-layer network recommendation method based on graph structure filtering, the described method comprises the following steps:1) it is first depending on historical user's comment information and the false comment very much like with authentic specimen that simulation comment data is used as accurate mark category is automatically generated by TextGAN;2) history is really commented on and is marked false simulation comment as input, in view of generation false comment with really comment on it is very much like, a kind of virtual information filter based on figure of research user access record is designed, is commented on false by continuous iterative user, shop and the confidence level of comment detection of false user;3) for the sparsity of result recommending data the problem of, design the recommendation method that double-layer network is clustered based on fast density, this method can realize the adaptive selection of parameter, and obtain preferable cluster result, it is hereby achieved that the personalized recommendation list of more efficiently user, improves the accuracy rate of recommendation.The present invention using confrontation generate network generate with the very much like false sample of true comment data, and propose a kind of fast density cluster double-layer network recommendation method of high efficient and reliable filter based on graph structure.

Description

A kind of fast density cluster double-layer network recommendation method based on graph structure filtering
Technical field
The invention belongs to information recommendation method, it is related to a kind of fast density cluster double-layer network filtered based on graph structure and pushes away Recommend method.
Background technology
As network technology develops rapidly, information exchange is increasingly frequent, brings the difficulty of information selection.User in face of Effective information, i.e. problem of information overload can not be therefrom obtained when bulk information, and commending system then comes into being.In practical feelings In condition, commending system can have an impact the selection of user, and some shops can then utilize falseness to maximize individual interest User increases the probability recommended in target shop with false comment, and reduces the recommendation probability in other similar shops.Therefore realizing has The false comment of effect filter and realize precisely recommend it is most important.
Recommended technology includes based on commending contents, knowledge based recommendation and collaborative filtering recommending etc., wherein being pushed away based on content It recommends and recommends the content for being based on object to be recommended with knowledge based, the scoring independent of user to shop.Collaborative filtering pushes away It recommends, can be that user find and oneself like similar people or shop similar with oneself favorite shop is recommended, imitate Fruit is good and is widely used.Secondly, most commending system has that user-item association matrix is sparse, i.e. user couple The evaluation of project or consumer record are less.When finding similar users for target user, Sparse directly affects recommendation results Accuracy.Cluster is introduced into commending system, thinking is provided to solve Deta sparseness.Commending system based on cluster is logical It crosses and a large amount of sparse data compressions is solved the problems, such as into Deta sparseness at a series of intensive subsets.Xue et al. utilizes K-means Clustering algorithm clusters user, and the user of a most like degree of K is chosen in the cluster of place as closing on user for each user; Guo et al. proposes a kind of cluster recommendation calculation being constantly iterated cluster to user with community's trusting relationship according to score information Method.Since cluster result can generate the proposed algorithm based on cluster large effect, and in clustering algorithm generally existing cluster The heart is difficult to the problem determined and the robustness of parameter is poor, directly affects recommendation effect.
Recommendation method can preferably solve the problems such as information overload, but be easy the deceptive information that can included in database It influences.In order to reduce influence of the deceptive information to commending system, needs to introduce filter in commending system, detect and reject void False information.The problem of fictitious users and false comment detection being put forward for the first time with Jindal et al., in deceptive information detection field Research be also stepped up.Filter based on supervised learning can effective detection of false information, but be based on supervised learning Filter depend particularly on the training of marking class target data.In the case where training set is less, the mistake based on supervised learning The filter effect of filter is not good enough.
Invention content
In order to effectively filter the influence of fictitious users and deceptive information to commending system, and it is existing in order to overcome The poor deficiency of the less efficient of recommendation method, reliability, the present invention provides a kind of high efficient and reliables to be filtered based on graph structure Fast density cluster double-layer network recommend method.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of fast density cluster double-layer network recommendation method based on graph structure filtering, the method includes following steps Suddenly:
1) according to historical user's comment information by be based on TextGAN generators automatically generate simulation comment data be used as standard The really false comment of mark category, the comment information generated are denoted as similar to true comment;
2) the extremely similar to true comment of false comment is considered, herein according to the design of the access information of user based on figure Virtual information filter, the confidence level for calculating user and comment filters deceptive information;
3) design clusters the recommendation method of double-layer network based on fast density, quickly, efficiently obtains the individual character of user Change recommendation list.
Further, in the step 1), the virtual comment based on TextGAN generates, and is to comment on to make with part real history For input, the more similar virtual comment as attack data is generated according to TextGAN;
Automatic comment technology based on TextGAN can generate according to the text sentence of input and comment on letter as the input phase Breath;The simulation comment information of generation gives different scorings according to the different emotions of text representation, and the purpose of sentiment analysis is root According to emotion word in comment, their tendency is judged to every comment, is classified as actively being inclined to or passiveness is inclined to, for each Each usage of word has corresponding actively score and passive score, and positive score Ps and passiveness score Ns are subtracted each other, obtained The score Score of this usage of the word:
Score=PsNs (1)
The value of the score of final each word is between [- 1,1], it is considered that this usage of this word when more than 0 It is on the contrary then be inclined to passive with positive tendency;
In order to add score information to comment text, extraction can indicate adjective, the pair of Sentiment orientation other than feature Word, verb and noun are as emotion word, the score of all middle emotion words in the sentence that adds up, and consider to generate the true of the comment simultaneously The scoring mean value of real comment sample, is calculated the final score of the sentence.
Further, in the step 2), the network of user and project is mainly made of three parts:User node, project Node and score information, the filter based on graph structure is that corresponding confidence calculations rule is arranged in these elements, by multiple The method of iteration filters out virtual user node and score information;
For any one user node u, confidence level HuIt indicates:
Wherein nuIndicate the score information number that user node u leaves,Indicate the confidence of i-th scoring of user u Degree;
For the confidence level of user to be limited in certain section, enable
Wherein T (u) ∈ (- 1,1).Due to T (u) and HuBetween relationship correspond, and the boundedness of T (u) is more suitable for The setting of user's confidence threshold value in subsequent process, therefore finally use the confidence level of T (u) expression user nodes u;
For arbitrary score information v, the calculation formula of confidence level H (v) is:
Wherein φvIndicate the destination item of comment v, R (φv) indicate that the confidence level of the destination item, A (v) then indicate user Influence of the confidence level to v confidence levels;
For arbitrary project t, the calculation formula of its confidence level R (t) is:
Wherein
UtIndicate that the user for accessing project t gathers, ψvThe specific score value of score information between expression t and r, α are to weigh to comment Divide the threshold parameter of information attribute;
It is 1 first to initialize all T (u) and R (t), calculates the confidence level H (v) of every score information;When H (v) is calculated After, then calculate T (u) and R (t) successively in order;Then the H of next round can be calculated according to updated T (u) and R (t) (v), after such iteration is multiple, T (u), H (v) and R (t) will gradually restrain stabilization, and algorithm will export T (u), H (v) and R (t) End value.
The codomain of T (u) and H (v) is (- 1,1), it is proposed that a kind of channel zapping according to confidence level is set come quickly determination The method of confidence threshold can be effectively right if the confidence threshold value of setting can be fallen at the low ebb between the double peaks of channel zapping Virtual User and real user distinguish, and are finally completed the filtering to Virtual User node.
In the step 3), method is recommended using the double-layer network clustered based on fast density, is included the following steps:
3.1) characteristic information of user and project are extracted, and the upper project of user is gathered according to its characteristic information respectively Class;
3.2) the double-deck two subnetwork models are established, are carried out according to network structure and cluster result consequently recommended.
In the step 3.1), using it is quick determine cluster centre algorithm,
Define 1:For arbitrary sample point i, local density ρiCalculation formula be:
ρi=∑ ξ (dij-dc) (7)
Wherein dijIndicate the distance between sample point i and sample point j value;
Define 2:For arbitrary sample point i, minimum range δiIt is more than in all the points of the point from point i's for local density Lowest distance value.
δi=min (dij)(ρj≥ρi) (9)
In view of automatically determine the algorithm of cluster centre there are the problem of, introduce variable γ, be defined as:
γii×δi (10)
Definition according to γ obtains the probability density distribution of γ, from its distribution it can be found that its shape is similar to normal state point Cloth calculates confidence interval according to approximate normal distribution curve, singular point is determined by confidence interval;
Assuming that corresponding γ is that obey mean value be μ, standard deviation σ, normal distribution, to determine mean value and standard deviation When, sample average is calculated firstIt can then be obtained further according to moments estimation principle with sample variance S:
The γ density profiles of one data set are further analyzed, it is found that the γ values of all data are non-negative, at this It is a little upper to illustrate the distribution for arbitrary number strong point i, γ value and non-critical normal distribution, because being negative section in γ values Inside there is the missing of data point, large effect can be caused to the result of formula (11), is the value that can accurately seek μ and σ, Data in the section to defect are needed to carry out completion:
First find out sample averageIt choosesSample point in range, update obtain sample averageIt chooses againSample point in range, and update and obtain sample averageContinuous iteration no longer changes or becomes until sample average successively Change is very small, and final sample mean value isFoundation symmetry principle, withIt is symmetry axis by sectionInterior number According to be filled into (- ∞, 0], make up γ density profiles the shortage of data of negative semiaxis the problem of, sample calculated according to current data Variance S recycles formula (11) to obtain μ and σ values;
After the value for finding out μ and σ, a normal distribution curve is obtained, confidence is chosen now according to 5 σ principles of normal distribution To find out singular point, process is in section:
+ 5 σ of boundary value Wide=μ are set, the γ values of all the points in data set are compared with Wide.For data point I, if γi>Wide, then it is cluster centre point to mark i.
In the step 3.2), the personalized recommendation of user is carried out using double-deck two network frames, is included the following steps:
3.2.1) user and project are clustered respectively, user's gathering is obtained and closes and the conjunction of project gathering;With user's cluster and Project cluster is node, and the access times between counting user cluster and project cluster build two networks of first layer;To two of structure Proposed algorithm of the Web vector graphic based on two networks obtains the personalized recommendation list of all user's clusters;
3.3.2) for each user's cluster, the top n project cluster in personalized recommendation list obtained in the previous step is chosen, with The user for including in user's cluster and the project for including in selected item cluster are node, are with the score information between user and project Even side builds two networks of the second layer;Similarly, the proposed algorithm based on two networks is also used to two networks of the second layer, most The personalized recommendation list of each user is obtained eventually.
The present invention technical concept be:Goodfellow et al. has been put forward for the first time production confrontation network model (GAN), should Model achieves larger success on application real number space, but not effective when handling discrete data, especially text Data.In order to enable production confrontation network can effectively handle discrete text data, Zhang et al. proposes text life An accepted way of doing sth fights network (TextGAN).The model is made of two parts of generator and arbiter, and wherein generator is passed for the time Return neural network, and arbiter is convolutional neural networks.
The frame of TextGAN is with time recurrent neural network generator, with smooth close approximation time recurrent neural network Output extract most important semantic feature using convolutional neural networks as arbiter and differentiated.Convolutional Neural under the frame Network is made of a Ge Juan bases and a maximum pond layer.Maximum pond layer can effectively filter the less list of information Word extracts the most important feature in sentence.
The object function of the object function of TextGAN frames and the GAN of standard are different.The object function of TextGAN The majorized function of characteristic matching is increased, wherein iterative optimization procedure includes following two steps:
It minimizes:
It minimizes:Wherein, ΣsAnd ΣrIndicate respectively true feature to Measure fsWith the feature vector f of simulation sentencerCorresponding covariance matrix;μsAnd μrF is indicated respectivelysAnd frAverage vector.Its In second loss function LGIt is two multivariate Gaussian distribution N (μrr) and N (μss) between Jensen-Shannon Divergence.
Recommendation method based on cluster:Often there is sparsity and make the recommendation knot of conventional recommendation algorithm in real data Fruit is relatively poor.The concept of cluster is introduced into a large amount of sparse data compressions in proposed algorithm into a series of intensive subsets, energy The problem of enough effective solution Deta sparseness.
Joseph et al. is classified user by topic model, can distinguish the type (trip of user simultaneously Visitor or driver) and interest;Rana et al. proposes the Dynamic recommendation system by evolution algorithm cluster user;Wang et al. profits User is clustered with K-means algorithms, and estimates the scoring in user-shop matrix, and obtains the inclined of target user It is good;Puntheeranurak et al. proposes a kind of mixing proposed algorithm obscuring K-means clustering algorithm cluster users; Connor et al. clusters project using some row partitioning algorithms, and calculates the predicted value of each subset.
Influence of the deceptive information to commending system is also increasingly prominent, and the test problems of deceptive information are also concerned.Supervision The mode detection of false information of study is one of technology mostly important in detection technique.Jindal et al. utilizes supervised learning Algorithm is commented on according to the important feature of comment and the feature detection of false of user, and the wherein higher comment of multiplicity is considered as Falseness comment;Li et al. people proposes a kind of method of the mode detection of false information according to coorinated training;Lim et al. usage behaviors Feature is commented on to analyze detection;Wang et al. proposes the filter algorithm based on figure, according between user, shop and access record Existing relationship filters deceptive information.
Deceptive information can generate large effect to the recommendation results of commending system.Recommend in order to improve commending system Accuracy, need in commending system the filter of addition filtering deceptive information, filter deceptive information, improve recommend it is accurate Rate.
Beneficial effects of the present invention are mainly manifested in:1, using the automatic comment side for fighting network based on text generation formula Method.This method generates simulation comment according to historical review using confrontation production network, and is each simulation according to sentiment analysis Comment generates score information;Obtaining has the data set of accurate comment category information;2, design is a kind of quickly determining that node is set The filter based on figure of confidence threshold.The graph structure that network is commented between user and project based on the filter of figure divides Analysis, can effectively delete the Virtual User node in network structure and virtual comment information, improve the accuracy of proposed algorithm;3, it carries Go out a kind of double-layer network proposed algorithm clustered based on fast density, the local density and minimum which can be according to data point The distribution relation of distance determines cluster centre point, realizes the adaptive of parameter, and have preferable cluster result, is pushed away to improve The accuracy rate recommended.
Description of the drawings
Fig. 1 is the flow chart that the fast density cluster double-layer network filtered based on graph structure recommends method.
Fig. 2 is user's confidence level channel zapping figure containing virtual information.
Fig. 3 is the fundamental block diagram for clustering recommendation method.
Specific implementation mode
The invention will be further described below in conjunction with the accompanying drawings.
Referring to Fig.1~Fig. 3, a kind of fast density cluster double-layer network recommendation method based on graph structure filtering, including with Lower step:
1) simulation comment data is automatically generated by TextGAN according to historical user's comment information and is used as accurate mark category False comment;
2) history is really commented on and is marked false simulation comment as input, design the virtual information filtering based on figure Device extracts true comment information.
3) design clusters the proposed algorithm of double-layer network based on fast density, obtains the personalized recommendation list of user.
In the step 1), the virtual comment based on TextGAN generates, and is with the higher real history comment of partial evaluation As input, higher virtual comment of scoring is generated according to TextGAN.Similarly, it is commented on the lower real history of partial evaluation Lower virtual comment of scoring is generated as input.Generate virtual comment information function also there are two types of:(1) it scores higher It comments on to increase the recommended probability in target shop in commending system;(2) the lower Virtual matching that scores can be used for dropping Low commending system recommends the probability in shop similar with target shop.
According to TextGAN models, commented with input using being generated as input data to the comment data in dining room in Yelp data sets By similar virtual comment.On the whole, the comment information on Yelp data sets is usually associated with scoring of the user to project Information.Integer of the value of score information between 1-5 can be that boundary judges user to project with 3 when handling data Sentiment orientation.If score information is more than 3 points, illustrate that user is just to the Sentiment orientation of project;It is on the contrary then be negative.By this Method, we can effectively screen the Sentiment orientation of user comment information, be roughly divided into actively tendency comment and passive tendency It comments on two classes and marks category.
Automatic comment technology based on TextGAN can generate according to the text sentence of input and comment on letter as the input phase Breath.The Sentiment orientation of its Sentiment orientation and input text sentence is consistent.For generating the comment information being actively inclined to, I Need that a large amount of actively tendency comment the training of model will be carried out as input in Yelp data sets, and according to model output void Quasi- comment information.Wherein true input sample is for example:“Very nice and clean place to have A large amount of text sentences such as breakfast or lunch ".The virtual comment generated is such as:“Great food and service.” " It was amazing.I am a fan, and the service was really great. ".
The simulation comment information of generation gives different scorings according to the different emotions of text representation.The purpose of sentiment analysis It is that their tendency is judged to every comment according to emotion word in comment, is classified as actively being inclined to or passiveness is inclined to.Emotion There are many ways to analysis, we select SentiWordNet.SentiWordNet is a huge dictionary resource, it includes One prodigious text file has the usage and score of each word in dictionary.For each usage of each word There are corresponding actively score and passive score, positive score Ps and passiveness score Ns are subtracted each other, this use of the word can be obtained The score Score of method:
Score=PsNs (14)
The value of the score of final each word is between [- 1,1], it is considered that this usage of this word when more than 0 It is on the contrary then be inclined to passive with positive tendency.
In order to add score information to comment text, we extract can indicate describing for Sentiment orientation other than feature Word, adverbial word, verb and noun are as emotion word, the score of all middle emotion words in the sentence that adds up, and consider that generating this comments simultaneously The scoring mean value of the true comment sample of opinion, is calculated the final score of the sentence.
To carrying out sentiment analysis for the virtual comment enumerated above generated based on TextGAN:
“Great food and service.”:For scoring of the analysis of sentence obtained by it:By " food " therein and " service " is used as Feature Words, does not consider its influence to sentence Sentiment orientation.It can be obtained according to above sentiment analysis Score=0.25 " Great " corresponding in sentiment dictionary, and the scoring mean value by really inputting comment sample can obtain for 4, The scoring virtually commented on is 4.25.
“It was amazing.I ama fan,and the service was really great.”:It will wherein " service " be used as Feature Words, according to sentiment analysis can obtain, the Score=0.15 corresponding to " amazing ", " great " institute Corresponding Score=0.25, the Score=0.375 corresponding to " really ", and the scoring mean value by really inputting comment sample It can be obtained for 4, which is 4.75.
In the step 2), Wang etc. has been put forward for the first time the filter based on graph structure in.The algorithm is to user and item Cyberrelationship between mesh is analyzed, by simply iterating to calculate the confidence level of all user nodes, final filtration confidence level Lower Virtual User node improves the anti-interference ability of network recommendation algorithm.However, this algorithm cannot effectively choose user The confidence threshold value of node, filter effect are affected by data set.User's confidence level can be quickly determined this paper presents a kind of The method of threshold value can effectively delete the Virtual User node in network, and improve the accuracy of proposed algorithm.
The network of user and project is mainly made of three parts:User node, item nodes and score information.It is tied based on figure The filter of structure is that corresponding confidence calculations rule is arranged in these elements, and virtual use is filtered out by the method for successive ignition Family node and score information.
For any one user node u, confidence level HuIt indicates:
Wherein nuIndicate the score information number that user node u leaves,Indicate the confidence of i-th scoring of user u Degree.
For the confidence level of user to be limited in certain section, enable
Wherein T (u) ∈ (- 1,1).Due to T (u) and HuBetween relationship correspond, and the boundedness of T (u) is more suitable for The setting of user's confidence threshold value in subsequent process, therefore finally use the confidence level of T (u) expression user nodes u.
For arbitrary score information v, the calculation formula of confidence level H (v) is:
Wherein φvIndicate the destination item of comment v, R (φv) indicate that the confidence level of the destination item, A (v) then indicate user Influence of the confidence level to v confidence levels.
For arbitrary project t, the calculation formula of its confidence level R (t) is:
Wherein
UtIndicate that the user for accessing project t gathers, ψvThe specific score value of score information between expression t and r, α are to weigh to comment Divide the threshold parameter of information attribute.
According to foregoing description, it is seen that T (u), H (v) and R (t) communication with one another are close.It under normal circumstances, can be first initial It is 1 to change all T (u) and R (t), calculates the confidence level H (v) of every score information;After H (v) is calculated, then by suitable Sequence calculates T (u) and R (t) successively;Then the H (v) of next round can be calculated according to updated T (u) and R (t), such iteration is more After secondary, T (u), H (v) and R (t) will gradually restrain stabilization, and algorithm will export T (u), the end value of H (v) and R (t).
Since the codomain of T (u) He H (v) are (- 1,1), in the original filter based on figure, generally directly by 0 conduct The authenticity of the threshold value identification u and v of T (u) and H (v).However, this method will produce when in face of different data sets it is larger Filtering difference, to reduce the application range of algorithm.In order to eliminate this drawback, we have proposed a kind of according to confidence level Channel zapping is come the method that quickly determines confidence threshold value.
By taking user node as an example, since virtual user node generally has targeting and repeatability to the comment of project, So the confidence difference between Virtual User node is not too large, i.e. the confidence level of dummy node will focus on a certain of (- 1,1) In subinterval;And the confidence level of real user node should be generally higher than the confidence level of dummy node.In fact, the reality after passing through Analysis is tested it can be found that the confidence level of a large amount of real user node is close to 1.Therefore, the channel zapping of user node confidence level Double peak forms of Fig. 2 will be presented in figure.If the confidence threshold value of setting can be fallen at the low ebb between double peaks just, can have Effect distinguishes Virtual User and real user, is finally completed the filtering to Virtual User node.
In the step 3), in order to solve the problems, such as two main problems of proposed algorithm based on cluster:User Or the typical types of commodity represent (cluster centre) and need artificial determine;The personalized recommendation of similar users.Using based on fast The double-layer network of fast Density Clustering recommends method, and basic framework as shown in figure 3, mainly complete, i.e., in two steps:
3.1) characteristic information of user and project are extracted, and the upper project of user is gathered according to its characteristic information respectively Class;
3.2) the double-deck two subnetwork models are established, are carried out according to network structure and cluster result consequently recommended.
In the step 3.1), often there is larger Deta sparseness in social network data, and due to social networks number According to the features such as numerous with node, node real-time update is added, conventional recommendation algorithm will produce high when handling this kind of data Time complexity and recommendation effect it is bad.Proposed algorithm based on cluster can be by a large amount of sparse data compressions at a series of close Collect subset, recommendation effect can either be optimized, can also reduce the time complexity of algorithm.
Rodriguezs et al. proposes a kind of algorithm automatically determining cluster centre, in the algorithm artificial cluster centre It is with high density and also larger with the distance between density more high point.But so there are two disadvantages for this clustering algorithm times:Nothing The completely automatic determining cluster centre of method and density radius will have a direct impact on the result of cluster.Based on this thought, it is proposed that A kind of algorithm of quick determining cluster centre, and above-described two problems of effective solution.
Define 1 (local density):For arbitrary sample point i, local density ρiCalculation formula be:
ρi=∑ ξ (dij-dc) (20)
Wherein dijIndicate the distance between sample point i and sample point j value.
Define 2 (minimum ranges):For arbitrary sample point i, minimum range δiIt is more than all of the point for local density From the lowest distance value of point i in point.
δi=min (dij)(ρj≥ρi) (22)
In view of automatically determine the algorithm of cluster centre there are the problem of, we introduce variable γ, are defined as:
γii×δi (23)
Definition according to γ obtains the probability density distribution of γ, from its distribution it can be found that its shape is similar to normal state point Cloth.Confidence interval is calculated according to approximate normal distribution curve, singular point is determined by confidence interval.
Assuming that corresponding γ be obey mean value be μ, standard deviation σ, normal distribution.To determine mean value and standard deviation When, sample average is calculated firstIt can then be obtained further according to moments estimation principle with sample variance S:
The γ density profiles of one data set are further analyzed, it can be found that the γ values of all data are non-negative. Illustrate the distribution for arbitrary number strong point i, γ value in this regard and non-critical normal distribution, because being negative in γ values There are the missings of data point in section, can cause large effect to the result of formula (24).Can accurately to seek μ and σ Value, need in the section to defect data to carry out completion:
First find out sample averageIt choosesSample point in range, update obtain sample averageIt chooses againSample point in range, and update and obtain sample averageContinuous iteration no longer changes or becomes until sample average successively Change is very small, and final sample mean value isFoundation symmetry principle, withIt is symmetry axis by sectionInterior number According to be filled into (- ∞, 0], make up γ density profiles the shortage of data of negative semiaxis the problem of.Sample is calculated according to current data Variance S recycles formula (24) to obtain μ and σ values.
After the value for finding out μ and σ, we can obtain a normal distribution curve, now according to 5 σ principles of normal distribution Confidence interval is chosen to find out singular point.Specific method is:
+ 5 σ of boundary value Wide=μ are set, the γ values of all the points in data set are compared with Wide.For data point I, if γi>Wide, then it is cluster centre point to mark i.
In the step 3.2), the personalized recommendation of user is carried out using double-deck two network frames, is included the following steps:
3.2.1) user and project are clustered respectively, user's gathering is obtained and closes and the conjunction of project gathering;With user's cluster and Project cluster is node, and the access times between counting user cluster and project cluster build two networks of first layer;To two of structure Proposed algorithm of the Web vector graphic based on two networks obtains the personalized recommendation list of all user's clusters.
3.3.2) for each user's cluster, the top n project cluster in personalized recommendation list obtained in the previous step is chosen, with The user for including in user's cluster and the project for including in selected item cluster are node, are with the score information between user and project Even side builds two networks of the second layer.Similarly, the proposed algorithm based on two networks is also used to two networks of the second layer, most The personalized recommendation list of each user is obtained eventually.
Double-layer network structure can effectively reduce the complexity of two subnetworks originally, improve the operational efficiency of proposed algorithm;This Outside, two-tier network uses different company's side construction methods, and the access record quantity first passed through between user and project finds user The High relevancy project cluster of cluster, then personalized recommendation is carried out to each user by specific score information, this to recommend to calculate Method has higher accuracy.

Claims (7)

1. a kind of fast density cluster double-layer network based on graph structure filtering recommends method, which is characterized in that the method packet Include following steps:
1) void that simulation comment data is used as accurate mark category is automatically generated based on TextGAN by historical user's comment information Vacation comment, the comment data of generation and is really commented on very much like, it is difficult to the method progress using tradition to false comment filtering Detection;
2) it will allow for and be difficult to only be filtered with conventional method using the comment data of the method generation of machine learning, set herein A kind of virtual information filter based on figure has been counted, the data for adulterating fictitious users had been carried out by the behavioural characteristic of user Filter;
3) design clusters the recommendation method of double-layer network, the effective personalized recommendation list for obtaining user based on fast density.
2. a kind of fast density cluster double-layer network based on graph structure filtering as described in claim 1 recommends method, special Sign is, in the step 1), virtual comment based on TextGAN generates, and is using the comment of part real history as inputting, according to It is generated and the very much like virtual comment of authentic specimen according to TextGAN;
Automatic comment technology based on TextGAN can generate and comment information as the input phase according to the text sentence of input;It is raw At simulation comment information different scorings is given according to the different emotions of text representation, the purpose of sentiment analysis is according to comment Middle emotion word judges their tendency to every comment, is classified as actively being inclined to or passiveness is inclined to, for each word Each usage has corresponding actively score and passive score, and positive score Ps and passiveness score Ns are subtracted each other, the word is obtained The score Score of this usage:
Score=PsNs (1)
The value of the score of final each word is between [- 1,1], it is considered that this usage of this word has when more than 0 Actively tendency, it is on the contrary then be inclined to passive;
In order to add score information to comment text, extraction can indicate the adjective of Sentiment orientation other than feature, adverbial word, move Word and noun are as emotion word, the score of all middle emotion words in the sentence that adds up, and consider that generating the true of the comment comments simultaneously By the scoring mean value of sample, the final score of the sentence is calculated.
3. a kind of fast density cluster double-layer network based on graph structure filtering as claimed in claim 1 or 2 recommends method, It is characterized in that, in the step 2), the network of user and project is mainly made of three parts:It user node, item nodes and comments It is that corresponding confidence calculations rule is arranged in these elements to divide information, the filter based on graph structure, passes through the side of successive ignition Method filters out virtual user node and score information;
For any one user node u, confidence level HuIt indicates:
Wherein nuIndicate the score information number that user node u leaves,Indicate the confidence level of i-th scoring of user u;
For the confidence level of user to be limited in certain section, enable
Wherein T (u) ∈ (- 1,1).Due to T (u) and HuBetween relationship correspond, and the boundedness of T (u) is more suitable for subsequently mistake The setting of user's confidence threshold value in journey, therefore finally use the confidence level of T (u) expression user nodes u;
For arbitrary score information v, the calculation formula of confidence level H (v) is:
Wherein φvIndicate the destination item of comment v, R (φv) indicate that the confidence level of the destination item, A (v) then indicate user's confidence Spend the influence to v confidence levels;
For arbitrary project t, the calculation formula of its confidence level R (t) is:
Wherein
UtIndicate that the user for accessing project t gathers, ψvThe specific score value of score information between expression t and r, α are to weigh scoring letter Cease the threshold parameter of property;
It is 1 first to initialize all T (u) and R (t), calculates the confidence level H (v) of every score information;When H (v) calculating finishes Afterwards, then in order T (u) and R (t) is calculated successively;Then the H (v) of next round can be calculated according to updated T (u) and R (t), such as After this iteration is multiple, T (u), H (v) and R (t) will gradually restrain stabilization, and algorithm will export T (u), the end value of H (v) and R (t).
4. a kind of fast density cluster double-layer network based on graph structure filtering as claimed in claim 3 recommends method, special Sign is that the codomain of T (u) and H (v) are (- 1,1), it is proposed that a kind of channel zapping according to confidence level is set come quickly determination The method of confidence threshold can be effectively right if the confidence threshold value of setting can be fallen at the low ebb between the double peaks of channel zapping Virtual User and real user distinguish, and are finally completed the filtering to Virtual User node.
5. a kind of fast density cluster double-layer network based on graph structure filtering as claimed in claim 1 or 2 recommends method, It is characterized in that, in the step 3), method is recommended using the double-layer network clustered based on fast density, specifically includes following step Suddenly:
5.1) characteristic information of user and project are extracted, and the upper project of user is clustered according to its characteristic information respectively;
5.2) the double-deck two subnetwork models are established, are carried out according to network structure and cluster result consequently recommended.
6. a kind of fast density cluster double-layer network based on graph structure filtering as claimed in claim 5 recommends method, special Sign is, in the step 3), using the quick algorithm for determining cluster centre, specifically includes following steps:
Define 1:For arbitrary sample point i, local density ρiCalculation formula be:
ρi=∑ ξ (dij-dc) (7)
Wherein dijIndicate the distance between sample point i and sample point j value;
Define 2:For arbitrary sample point i, minimum range δiIt is more than the minimum in all the points of the point from point i for local density Distance value.
δi=min (dij)(ρj≥ρi) (9)
In view of automatically determine the algorithm of cluster centre there are the problem of, introduce variable γ, be defined as:
γii×δi (10)
Definition according to γ obtains the probability density distribution of γ, from its distribution it can be found that its shape is similar to normal distribution, Confidence interval is calculated according to approximate normal distribution curve, singular point is determined by confidence interval;
Assuming that corresponding γ is that obey mean value be μ, standard deviation σ, normal distribution, it is first to determine mean value and when standard deviation First calculate sample averageIt can then be obtained further according to moments estimation principle with sample variance S:
The γ density profiles of one data set are further analyzed, it is found that the γ values of all data are non-negative, in this point It is upper to illustrate the distribution for arbitrary number strong point i, γ value and non-critical normal distribution, because being negative section memory in γ values In the missing of data point, large effect can be caused to the result of formula (11), be the value that can accurately seek μ and σ, need Completion is carried out to data in the section of defect:
First find out sample averageIt choosesSample point in range, update obtain sample averageIt chooses again Sample point in range, and update and obtain sample averageContinuous iteration is until sample average no longer changes or changes very successively Small, final sample mean value isFoundation symmetry principle, withIt is symmetry axis by sectionInterior data filling To (- ∞, 0], γ density profiles are made up the problem of bearing the shortage of data of semiaxis, and sample variance S is calculated according to current data, Formula (11) is recycled to obtain μ and σ values;
After the value for finding out μ and σ, a normal distribution curve is obtained, confidence interval is chosen now according to 5 σ principles of normal distribution To find out singular point, process is:
+ 5 σ of boundary value Wide=μ are set, the γ values of all the points in data set are compared with Wide.For data point i, if γi>Wide, then it is cluster centre point to mark i.
7. a kind of fast density cluster double-layer network based on graph structure filtering as claimed in claim 6 recommends method, special Sign is, in the step 3), the personalized recommendation of user is carried out using double-deck two network frames, specifically includes following step Suddenly:
7.1) user and project are clustered respectively, obtains user's gathering and closes and the conjunction of project gathering;With user's cluster and project cluster For node, the access times between counting user cluster and project cluster build two networks of first layer;Two networks of structure are made With the proposed algorithm based on two networks, the personalized recommendation list of all user's clusters is obtained;
7.2) for each user's cluster, the top n project cluster in personalized recommendation list obtained in the previous step is chosen, with user's cluster In include user and selected item cluster in include project be node, with the score information between user and project be even side structure Build two networks of the second layer;Similarly, the proposed algorithm based on two networks is also used to two networks of the second layer, finally obtained The personalized recommendation list of each user.
CN201711469928.4A 2017-12-29 2017-12-29 Quick density clustering double-layer network recommendation method based on graph structure filtering Active CN108304479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711469928.4A CN108304479B (en) 2017-12-29 2017-12-29 Quick density clustering double-layer network recommendation method based on graph structure filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711469928.4A CN108304479B (en) 2017-12-29 2017-12-29 Quick density clustering double-layer network recommendation method based on graph structure filtering

Publications (2)

Publication Number Publication Date
CN108304479A true CN108304479A (en) 2018-07-20
CN108304479B CN108304479B (en) 2022-05-03

Family

ID=62868047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711469928.4A Active CN108304479B (en) 2017-12-29 2017-12-29 Quick density clustering double-layer network recommendation method based on graph structure filtering

Country Status (1)

Country Link
CN (1) CN108304479B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508740A (en) * 2018-11-09 2019-03-22 郑州轻工业学院 Object hardness identification method based on Gaussian mixed noise production confrontation network
CN111783980A (en) * 2020-06-28 2020-10-16 大连理工大学 Ranking learning method based on dual cooperation generation type countermeasure network
CN112950295A (en) * 2021-04-21 2021-06-11 北京大米科技有限公司 User data mining method and device, readable storage medium and electronic equipment
CN112989179A (en) * 2019-12-13 2021-06-18 北京达佳互联信息技术有限公司 Model training and multimedia content recommendation method and device
CN114241263A (en) * 2021-12-17 2022-03-25 电子科技大学 Radar interference semi-supervised open set identification system based on generation countermeasure network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016062095A1 (en) * 2014-10-24 2016-04-28 华为技术有限公司 Video classification method and apparatus
CN107506480A (en) * 2017-09-13 2017-12-22 浙江工业大学 A kind of excavated based on comment recommends method with the double-deck graph structure of Density Clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016062095A1 (en) * 2014-10-24 2016-04-28 华为技术有限公司 Video classification method and apparatus
CN107506480A (en) * 2017-09-13 2017-12-22 浙江工业大学 A kind of excavated based on comment recommends method with the double-deck graph structure of Density Clustering

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUAN WANG等: "Review Graph based Online Store Review Spammer Detection", 《IEEE》 *
JINYIN CHEN等: "Double Layered Recommendation Algorithm Based on Fast Density Clustering: Case Study on Yelp Social Networks Dataset", 《IEEE》 *
YIZHE ZHANG等: "Generating Text via Adversarial Training", 《IEEE》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508740A (en) * 2018-11-09 2019-03-22 郑州轻工业学院 Object hardness identification method based on Gaussian mixed noise production confrontation network
CN109508740B (en) * 2018-11-09 2019-08-13 郑州轻工业学院 Object hardness identification method based on Gaussian mixed noise production confrontation network
CN112989179A (en) * 2019-12-13 2021-06-18 北京达佳互联信息技术有限公司 Model training and multimedia content recommendation method and device
CN112989179B (en) * 2019-12-13 2023-07-28 北京达佳互联信息技术有限公司 Model training and multimedia content recommendation method and device
CN111783980A (en) * 2020-06-28 2020-10-16 大连理工大学 Ranking learning method based on dual cooperation generation type countermeasure network
CN112950295A (en) * 2021-04-21 2021-06-11 北京大米科技有限公司 User data mining method and device, readable storage medium and electronic equipment
CN112950295B (en) * 2021-04-21 2024-03-19 北京大米科技有限公司 Method and device for mining user data, readable storage medium and electronic equipment
CN114241263A (en) * 2021-12-17 2022-03-25 电子科技大学 Radar interference semi-supervised open set identification system based on generation countermeasure network
CN114241263B (en) * 2021-12-17 2023-05-02 电子科技大学 Radar interference semi-supervised open set recognition system based on generation of countermeasure network

Also Published As

Publication number Publication date
CN108304479B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN108304479A (en) A kind of fast density cluster double-layer network recommendation method based on graph structure filtering
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN106650725A (en) Full convolutional neural network-based candidate text box generation and text detection method
CN110046260A (en) A kind of darknet topic discovery method and system of knowledge based map
Chang et al. Research on detection methods based on Doc2vec abnormal comments
CN103116637A (en) Text sentiment classification method facing Chinese Web comments
CN109684636B (en) Deep learning-based user emotion analysis method
CN110363049A (en) The method and device that graphic element detection identification and classification determine
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN113343126B (en) Rumor detection method based on event and propagation structure
CN111008337A (en) Deep attention rumor identification method and device based on ternary characteristics
Solomon et al. Understanding the psycho-sociological facets of homophily in social network communities
CN114492423A (en) False comment detection method, system and medium based on feature fusion and screening
CN115310589A (en) Group identification method and system based on depth map self-supervision learning
Yao et al. Online deception detection refueled by real world data collection
Zhang et al. Research on borrower's credit classification of P2P network loan based on LightGBM algorithm
CN107590742B (en) Behavior-based social network user attribute value inversion method
CN108717450A (en) Film review emotional orientation analysis algorithm
CN114218445A (en) Anomaly detection method based on dynamic heterogeneous information network representation of metagraph
Kaiser et al. Ant-based simulation of opinion spreading in online social networks
CN111767404A (en) Event mining method and device
Matapurkar et al. Comparative analysis for mining fuzzified dataset using association rule mining approach
CN110674257B (en) Method for evaluating authenticity of text information in network space
Qin et al. Recommender resources based on acquiring user's requirement and exploring user's preference with Word2Vec model in web service
Alkulaib et al. HyperTwitter: A Hypergraph-based Approach to Identify Influential Twitter Users and Tweets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant