CN103179198B - Based on the topic influence individual method for digging of many relational networks - Google Patents

Based on the topic influence individual method for digging of many relational networks Download PDF

Info

Publication number
CN103179198B
CN103179198B CN201310071162.XA CN201310071162A CN103179198B CN 103179198 B CN103179198 B CN 103179198B CN 201310071162 A CN201310071162 A CN 201310071162A CN 103179198 B CN103179198 B CN 103179198B
Authority
CN
China
Prior art keywords
user
relation
dispatch
network
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310071162.XA
Other languages
Chinese (zh)
Other versions
CN103179198A (en
Inventor
丁兆云
贾焰
杨树强
周斌
韩伟红
李爱平
韩毅
李莎莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310071162.XA priority Critical patent/CN103179198B/en
Publication of CN103179198A publication Critical patent/CN103179198A/en
Application granted granted Critical
Publication of CN103179198B publication Critical patent/CN103179198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

For determine the network user dispatch between whether there is the method for replication relation, described method comprises: obtain the time probability distribution that the time interval between two sections of dispatches existing and clearly forward relation obeys; The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet; Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of dispatches based on described similarity.

Description

Based on the topic influence individual method for digging of many relational networks
Technical field
The present invention relates to network data excavation technical field, particularly relate to topic influence individual digging technology in microblogging.
Background technology
The microblogging service of class Twitter is developed rapidly as a new communication medium recently, according to the 29th China Internet report statistics: by by the end of December, 2011, China's microblogging actual user number reaches 2.5 hundred million, comparatively goes up and increases 296.0% an end of the year, and netizen's utilization rate is 48.7%.Be different from the social networking service of other class Facebook, the social networking relationships of microblogging service is unidirectional, and user does not need other user rights just can pay close attention to them.Such as, in Twitter, community network is formed by paying close attention to (following) relation, and the person that user pays close attention to is the good friend (friend) of this user; The person paying close attention to certain user is the follower (follower) of this user, all literary compositions (tweets) that push away that user issues will appear at (publictimeline) on line common time, by all message of this user of display on this user all followers timeline.
Along with popularizing of microblogging service, a large number of users participates in topic discussion in microblogging, thus causes microblogging service to produce the information relating to multiple topic in a large number every day.The influence power that bulk information has flooded in each topic is individual, and therefore, the influence power individuality how excavating each topic in the bulk information of microblogging will be a challenging job.
Recently, relevant scholar proposes the individual method for digging of topic hierarchy influence power for Twitter data.
As list of references 1 " [1] WengJ; LimEP; JiangJ; etal.TwitterRank:Findingtopic-sensitiveinfluentialtwitte rs [C] //Procofthe3thACMInternationalConferenceonWebSearchandData Mining.NewYork; NY:ACM; 2010:261-270 ", the people such as Weng, in order to measure the influence power of user on each topic in Twitter, propose TwitterRank method.TwitterRank method is better than PageRank method and Topic-sensitivePageRank method to a certain extent, but in TwitterRank method, the transition probability of random walk only considers and pushes away civilian number and topic similarity, have ignored the correlative factors such as forwarding, reply.
At list of references 2 " [2] PalA; CountsS.Identifyingtopicalauthoritiesinmicroblogs [C] //Procofthe4thACMInternationalConferenceonWebSearchandData Mining.NewYork; NY:ACM; 2011:45-54 ", the people such as Pal are in order to identify the authoritative user of each topic in Twitter, consider multiple attributes of user in Twitter, but have ignored the link structure of many relational networks, thus be difficult to portray user's relative influence in the entire network.
Therefore, a kind of topic influence individual digging technology of improvement is needed in this area.
Summary of the invention
An object of the present invention is to provide a kind of method that whether there is replication relation between dispatch for determining the network user.Further object of the present invention is to provide a kind of topic influence individual method for digging of improvement, weighs incomplete deficiency to make up traditional individual influence power evaluation method.
To achieve these goals, provide a kind of method that whether there is replication relation between dispatch for determining the network user in one aspect of the invention, described method comprises: obtain to exist and clearly forward the time probability distribution that the time interval between two sections of relation dispatches obeys; The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet; Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of dispatches based on described similarity.
Preferably, the amenable time probability of the time interval existed between two sections of replication relation dispatches institute is distributed be inferred as with exist two sections of clearly forwarding relation send the documents between the time interval time probability of obeying distribute identical.
Provide a kind of topic influence individual method for digging based on many relational networks in another aspect of the present invention, comprise: the forwarding relation extracted between user forwards relational network to construct, and calculates the transition probability of the dispatch of another user of user's random forwarding in described forwarding relational network; The reply relation extracted between user replys relational network construct, and calculating user in described reply relational network replys the transition probability of the dispatch of another user at random; According to the determined dispatch that there is replication relation in method above, extract replication relation between user construct replication relation network, and the transition probability of calculating dispatch of another user of user's random reproduction in described replication relation network; The reading relation extracted between user reads relational network construct, and calculating user in described reading relational network reads the transition probability of the dispatch of another user at random; Consider above-mentioned transition probability to calculate arbitrary user by the probability of other user's random accesss.
Preferably, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between user, dispatch and user, extract the reading relation between user.
Preferably, the number of times forwarding the dispatch of other users based on a described user calculates the transition probability of the dispatch of another user of user's random forwarding described in described forwarding relational network.
Preferably, the number of times of replying the dispatch of other users based on a described user to calculate described in described reply relational network the transition probability that a user replys the dispatch of another user at random.
Preferably, based on two users dispatch between the time interval and similarity, calculate the transition probability of the dispatch of another user of user's random reproduction in described replication relation network.
Preferably, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between two users, dispatch and two users, to calculate in described reading relational network the transition probability that a user reads the dispatch of another user at random.
Preferably, described access comprises forwarding, replys, copies and read.
Preferably, described dispatch is text, video, audio frequency or their combination in any that user delivers.
By adopting technical scheme of the present invention, whether replication relation is there is between the dispatch can determining the network user efficiently, in addition, further, by using multiple different relational network, the multiple different interaction characteristic between the network user can be considered all sidedly, thus can find that the influence power in network is individual more exactly.
Accompanying drawing explanation
Be described in detail with reference to the attached drawings the present invention, should be appreciated that accompanying drawing and corresponding description are appreciated that it is illustrative and nonrestrictive, wherein:
Fig. 1 schematically illustrates many relational networks that may exist between multiple microblog users;
Fig. 2 shows the example time intervals distribution that user forwards operation;
Fig. 3 shows the quantum condition entropy in the time interval of the two sections of blog articles that there is replication relation;
Fig. 4 schematically illustrates the time sequence model of posting of the every day of 3 users;
Fig. 5 shows the accuracy rate of each algorithm in each topic; And
Fig. 6 shows the Average Accuracy of each algorithm in all topics.
Embodiment
Exemplarily come to be described in detail to the preferred embodiment of the present invention with microblogging application below in conjunction with accompanying drawing.
In order to the influence power of overall measure user on certain topic hierarchy, contemplated by the invention the multiple network relationship type in microblogging.Such as shown in Fig. 1 (a), the impact of user B on user A shows as 4 kinds of relationship types: 1) user A employs similar " RTB " or " viaB " in the blog article of oneself, forwarded the blog article of user B; 2) user A employs similar " B " in the blog article of oneself, has replied the blog article of user B; 3) user A does not use similar " RTB " or " viaB " etc. to forward type label clearly, replicates the blog article of user B; 4) user A has read the blog article of user B.Show the directed edge (A, B) that between user A and B 4 kinds are dissimilar in fig. l (a), to represent above-mentioned four kinds of relationship types respectively.Fig. 1 (b) shows many relational networks of another example, and it is made up of 3 users and above-mentioned 4 kinds of dissimilar directed edges.As can be seen here, the network that affects between microblog users is relational network more than.Can according to four kinds of relationship types mentioned above, this many relational network is correspondingly decomposed into 4 kinds of dissimilar relational networks, is respectively: transmission network, reply network, duplicate network and read web.
The influence power individuality realized in microblogging excavates, and first should extract above-mentioned transmission network, reply network, duplicate network and read web, then can merge above-mentioned network and analyze individual influence power or calculate.The individual mining process of above-mentioned influence power based on be microblogging related data (such as, the forwarding record, reply record etc. of the blog article that user profile, user deliver, user), those skilled in the art understand, existing various mode can be adopted to gather microblogging related data, the emphasis that itself and non-invention are discussed.Therefore, in order to avoid fuzzy the present invention, the process obtaining microblogging related data is no longer repeated.
Be more than the basic implementation procedure of the inventive method, in detailed description process below, first related definition be explained respectively.
Related definition
First defining C is whole blog article set, and V is whole user set.Define k topic (k is positive integer), then C simultaneously iand V irepresent all blog article set and user's set in i-th (0 < i≤k) individual topic space respectively.
Definition microblogging in concern network be one oriented without weight graph G f=(V, E f), wherein V represents that this is oriented without the node set (also i.e. user's set) in weight graph G, E frepresent the directed edge set between the node in G, the concern relation between its representative of consumer.In addition, above-mentioned f there is no physical meaning, and it is only pay close attention to (following) network and the relation of concern for what identify that this network or relation relate to.
Many relational networks under definition i-th (0 < i≤k) individual topic space are many graphs of a relation G i = ( V i , E Retweet i &cup; E Reply i &cup; E Copy i &cup; E Read i ) , Wherein V irepresent the user's set under i-th topic space, are all directed edge set, represent the forwarding relation under i-th topic space, reply relation, replication relation and reading relation respectively.This many graph of a relation can be decomposed into the figure of 4 kinds of different relationship types, is respectively forwarding (Retweet) network diagram, reply (Reply) network diagram, copies (Copy) network diagram and reading (Read) network diagram.Particularly, represent the oriented transmission network figure of weighting, wherein under representing i-th topic space and relate to forwarding relation user set; be the set of directed edge, represent the forwarding relation under i-th topic space; be forward the weight on limit, it can be such as hop count between two users under i-th topic space.Equally, represent the oriented reply network diagram of weighting, wherein under representing i-th topic space and relate to reply relation user set; be the set of directed edge, represent the reply relation under i-th topic space; be the weight of replying limit, it can be such as reply number of times between two users under i-th topic space. represent the oriented duplicate network figure of weighting, wherein under representing i-th topic space and relate to replication relation user set; be directed edge set, represent the replication relation between user under i-th topic space; it is the weight (will introduce below) copying limit. represent the oriented read web figure of weighting, wherein under representing i-th topic space and relate to reading relation user set; be directed edge set, represent the reading relation between user under i-th topic space; it is the weight (will introduce below) of reading limit.A, b, c, d used herein are only for identifying or distinguish user's set of each network or each network, and itself there is no physical meaning.
Example explanation is carried out hereafter by the transition probability computational methods of preferred implementation to each network.
The transition probability computational methods of transmission network
Transmission network figure middle random walk process is constructed as follows: user is subject to its good friend impact in i-th topic space, forward its good friend's blog article by with certain transition probability.Random walk process in transmission network figure simulates the forwarding behavior of user in microblogging.The transition probability matrix under i-th topic space in transmission network is made to be transition probability between user is defined as follows.
Define under 1. i-th topic spaces in transmission network, user random forwarding user the transition probability of blog article is defined as:
P a i ( u t i | u s i ) = w a ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w a ( u s i , u i ) ,
Wherein represent under i-th topic space, user forward user the number of times of blog article; represent under i-th topic space, user forward the number of times of the blog article of its all good friend, wherein representing paying close attention in network, that starting point is the terminal of all directed edges of usi set, being also good friend set.
Reply the transition probability computational methods of network
Reply network diagram middle random walk process is constructed as follows: user is subject to its good friend impact in i-th topic space, reply its good friend's blog article by with certain transition probability.The random walk process of replying in network diagram simulates the reply behavior of user in microblogging.The transition probability matrix of replying under making i-th topic space in network is transition probability then between user is defined as follows.
Reply in network under defining 2. i-th topic spaces, user random reply user the transition probability of blog article is defined as:
P b i ( u t i | u s i ) = w b ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w b ( u s i , u i ) ,
Wherein represent under i-th topic space, user reply user the number of times of blog article; represent under i-th topic space, user reply the number of times of the blog article of its all good friend, wherein represent paying close attention in network, starting point and be the set of terminal of all directed edges, be also good friend set.
The transition probability computational methods of duplicate network
Owing to there is not " copying " relational tags in microblogging, therefore in order to construct duplicate network figure in random walk process, need first to infer " copying " relation, to excavate implicit relation limit.
An embodiment of the invention have considered the similitude of time interval between two sections of blog articles and blog article.Usually, the content similarity that there are two sections of blog articles of " copying " relation is higher, infers that the simple method of " copying " relation can be considered to calculate the similarity between blog article, if similarity is higher than certain threshold value, then and the source of this blog article of deducibility.Above-mentioned simple method needs the similarity calculated between all blog articles of all good friends of user, and calculation cost is high.In an embodiment of the invention, in order to reduce calculation cost, not only considering the similarity between blog article, also contemplating the time interval Δ t between blog article.
" copying " behavior in microblogging belongs to forwarding behavior to a certain extent, does not only clearly use " RTB " or " viaB " etc. to forward type label.So use the time interval Δ t existed between the blog article clearly forwarding relation herein retweetset T retweetprobability Distribution Fitting in exist " copying " relation blog article between time interval Δ t copyset T copyprobability distribution.Based on above-mentioned consideration, the present invention proposes a kind of method that whether there is replication relation between blog article for determining the network user, it can comprise the steps: to obtain the time probability distribution that the time interval between two sections of blog articles existing and clearly forward relation obeys; The amenable time probability distribution of time interval institute between the two sections of blog articles that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, the scope that the time interval between the two sections of blog articles that there is replication relation should meet is set; Be in any two sections of blog articles in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of blog articles based on described similarity.
Particularly, in one embodiment, be first other numbers of 71000(from data centralization random sampling sample size be also feasible) existence clearly forward relation blog article between time interval Δ t retweetset T retweet, namely | T retweet|=71000.Fig. 2 shows its Data distribution8, and most of the time interval is only in some hours as shown in Figure 2, only has small part time interval span larger, even has minority time interval span to exceed 10 days.In order to portray time interval distribution situation more meticulously, the long-tail point of time interval span more than 10 days can be removed.It is rational for removing these points, because by the forwarding user normally junk user of data analysis discovery time interval span more than 10 days.
After removing the long-tail point of time interval span more than 10 days, sampling sample size becomes 69770, namely | and T ' retweet|=69770, find through statistics, its time interval of delta t roughly obeys quantum condition entropy.Given sample set T ' retweet, estimate parameter lambda=1.9768 × 10 of quantum condition entropy 4, then quantum condition entropy probability density function is as follows:
f ( x ) = 1 19768 e - x / 19768 , x &GreaterEqual; 0 , 0 , x < 0 .
Wherein, e is the truth of a matter of natural logrithm function.
So parameter lambda=1.9768 × 10 are obeyed in the set that there is the time interval between the blog article clearly forwarding relation 4quantum condition entropy.Because the behavior that " copies " belongs to forwarding behavior to a certain extent, the time interval of two sections of blog articles of deducibility existence " copying " relation also roughly obeys parameter lambda=1.9768 × 10 thus 4quantum condition entropy, distribution function (see Fig. 3) as follows:
F ( x ) = 1 - e - x / 19768 , x &GreaterEqual; 0 , 0 , x < 0 .
According to quantum condition entropy, can be similar to and provide the range delta t that existence " copies " time interval of two sections of blog articles of relation range.Consider calculation cost and precision, the range delta t that existence " copies " time interval of two sections of blog articles of relation is exemplarily set in one embodiment range∈ (0ks, 1.08 × 10 2ks], wherein ks represents 1000 seconds, and round bracket " (" expression does not comprise endpoint value, bracket "] " expression comprises endpoint value.According to quantum condition entropy, the recall rate R(Recall of " copying " relation can be inferred) be:
R=F(1.08×10 5)-F(0)=99.58%。
If the time interval Δ t of two sections of blog articles is not at Δ t rangein, then illustrate that existence " copies " probability of relation very low, thus do not need the similitude of calculating two sections of blog articles, to reduce calculation cost.
Infer whether the blog article that there is friend relation user issue exists " copying " relation, needs to meet two conditions below simultaneously thus:
1)Sim(p t,p s)≥ξ;
2)Δt range∈(0ks,1.08×10 2ks]。
First i.e. two sections of blog article p of condition tand p ssimilarity Sim (p t, p s) need to be more than or equal to certain threshold xi.Included angle cosine can be adopted in one embodiment to calculate the similarity of two sections of blog articles, other calculating document similarity method, such as KL distance (kullback-leibler) etc., are effective equally.
sim(p t,p s)=cos(v t,v s),
Wherein v twith v srepresent the vector of two sections of blog articles respectively.
Second time interval of condition that is between two sections of blog articles will in certain threshold value.Above use 1.08 × 10 2ks, only for example, it will be understood by those skilled in the art that based on considering calculation cost and precision, can use any other suitable value.
Two sections of blog articles that definition exists " copying " relation are two tuple <p t, p s>, then between two good friends, all existence " copy " blog article of relation to being two tuple-set U.Duplicate network figure can be inferred thus be a weighted digraph, user to user " copying " relation (that is, due to user copy user blog article produce " copying " relation) weights be defined as follows:
w c ( u s i , u t i ) = &Sigma; < p t i , p s i > &Element; U s , t i sim ( p s i , p t i ) &times; f ( &Delta; t p s i , p t i ) .
be under i-th topic space, user with between all existence " copy " relation and be by user the blog article of initiating to copy is to two tuple-sets.F in above-mentioned equation represents a function, usually, f be set to when Δ t more hour, its value is higher.Therefore, above-mentioned weight computing has considered similitude and the time interval of two sections of blog articles, and the similarity between two sections of blog articles is higher, and the time interval is less, illustrates that existence " copies " probability of relation higher.
Duplicate network figure middle random walk process is constructed as follows: user is subject to its good friend impact in i-th topic space, copy its good friend's blog article by with certain transition probability.Random walk process in duplicate network figure simulates the replication of user in microblogging.The transition probability matrix under i-th topic space in duplicate network is made to be transition probability then between user is defined as follows.
Define under 3. i-th topic spaces in duplicate network, user random reproduction user the transition probability of blog article is defined as:
P c i ( u t i | u s i ) = w c ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w c ( u s i , u i ) ,
Wherein represent under i-th topic space, user to user " copying " relation (that is, due to user copy user blog article produce " copying " relation) weights; under representing i-th topic space, user to the weights sum of " copying " relation of its all good friend.
The transition probability computational methods of read web
It is more by the user read that user issues blog article, illustrates that this blog article coverage is wider to a certain extent.In order to construct read web figure middle random walk process, first needs to construct read web.
Namely the simple thought of structure read web figure use concern network G f=(V, E f) relation between structuring user's, rely on user to issue the weights of blog article quantity as limit.Be interpreted as intuitively under specific topics space, it is more that user issues blog article, and have more followers, then its blog article coverage is wider.
The popularization of this simple method is TwitterRank algorithm, adds the topic similitude between good friend, is interpreted as that user more likely reads the good friend blog article similar to oneself topic intuitively.TwitterRank algorithm transition probability is as follows:
P t ( i , j ) = | &tau; j | &Sigma; a : s i follows s a | &tau; a | &times; sim t ( i , j ) .
| τ j| be user u jthe blog article number issued; for user u ithe blog article number issued of all good friends; sim t(i, j) is under t topic space, the similarity between user.
In microblogging, all blog articles that user issues are by active push on the timeline of its follower, and usual follower logs in its personal homepage, can read the information of its homepage.Infer thus the time that user logs in and its good friend time interval of posting less, more likely read the blog article of its good friend.But user's login time is difficult to obtain.Therefore, in an embodiment of the invention, to post the every day that the rule of posting by adding up each user's some calculates user time sequence model, and suppose that between user, the time sequence model similarity of posting of every day is higher, the probability that there is " reading " relation is larger.The time statistical law of posting due to some reflects user's login time rule to a certain extent, therefore, time sequence model similarity of posting every day is higher, describe time that user logs in a certain extent and its good friend time interval of posting less, then more likely read the blog article of its good friend, so this hypothesis is rational.
Such as, to post the every day that Fig. 4 shows 3 users time sequence model, if user A has paid close attention to user B and C simultaneously, by the post time sequence model similarity of time sequence model similarity apparently higher than user A and C of posting of known user A and B shown in figure, then user A has larger probability to read the blog article of B.Because user B has larger probability to issue blog article when user A online (online), thus blog article is initiatively pushed to the personal homepage of A by microblogging service, user A is made to have larger probability to read the blog article of B.
In one embodiment, with in one day 24 hours, user is at the probability of posting of each hour to weigh the rule of posting of user, and definition time sequence model of posting is as follows.
Definition 4. to any user u, two tuple <t, p> represent t(unit for hour) in, the probability that user posts is p; Time series set { t 0, t 1..., t 23(t 0< t 1< ... < t 23) represent 24 discrete points, each is the duration of 1 hour, then user's time sequence model of posting is defined as follows:
ts=<ts 0=<t 0,p 0>,ts 1=<t 1,p 1>,...,ts 23=<t 23,p 23>>
Add up the temporal regularity that each user issues N number of model, calculating user in i(unit is hour) in probability of posting wherein N irepresent in user all N number of models, every day in i(unit be hour) in the number of all models of issue, namely user in i(unit is hour) in issue model number more, then the probability that in this hour of inherence of one day of user, microblogging is online is larger.
If the post probability of multiple user in each hour is more identical, then the similarity of their time sequence model is higher.In one embodiment, Euclidean (Euclidean) distance metric time series similarity can be adopted.
Particularly, Q and C is made to represent two time serieses of posting respectively; q irepresent the value of i-th point of Q sequence; c irepresent the value of C sequence at i-th point; I, n represent the sequence number of current point in whole sequence and the length of whole sequence respectively, then two time series similarity computing formula are as follows:
simSeries ( Q , C ) = 1 &Sigma; i = 1 n ( q i - c i ) 2 , | Q | &NotEqual; 0 and | C | &NotEqual; 0 | ,
Wherein | Q| and | C| represents the length (both length is equal) of two sequences respectively, and Euclidean distance is less, and similarity is higher.
In one embodiment, can suppose that the probability that there is reading relation between user is relevant to following 3 factors:
1) user reads the many good friends of blog article quantity with higher probability;
2) user reads the high good friend of topic similarity with more high probability;
3) user reads the high good friend of time sequence model similarity of posting with higher probability.
Therefore, can by user u sread good friend u tthe definition of probability of blog article is as follows:
P read ( u s , u t ) = &tau; t &times; sim ( u s , u t ) &times; simSeries ( u s , u t ) &Sigma; u &Element; out ( u s ) &tau; u &times; sim ( u s , u ) &times; simSeries ( u s , u ) ,
Wherein τ represents blog article quantity that data centralization relative users issues (do not comprise there is forwarding, reply and the blog article of replication relation); Sim (u s, u t) represent user u swith its concern good friend u tbetween topic similarity; SimSeries (u s, u t) represent user u swith its concern good friend u tbetween time series similarity of posting; Out (u s) represent user u sconcern good friend set.Above-mentioned execution mode considers above-mentioned 3 factors all sidedly, is appreciated that in some embodiments, in order to simplify computation complexity etc., only can consider wherein any one or two factors.
In an embodiment of the invention, using all blog articles of a user as the one section of above-mentioned topic similarity sim of document calculations (u s, u t), the same topic distribution using LDA model to determine each user.Definition t is " user-topic " distribution vector of user, i.e. t=(t 1, t 2..., t k) t, wherein t 1, t 2..., t kfor each element in " user-topic " distribution vector, represent the probability distribution of user in each topic space, k represents the topic number of setting, and user's topic category set is expressed as { t 1, t 2..., t k.
User's topic similarity is defined as user's topic category set { t 1, t 2..., t kkL distance:
sim ( u s , u t ) = 1 KL ( u s , u t )
= 1 &Sigma; 0 < i &le; k t s i 1 g t s i t t i , | u t | &NotEqual; 0 and | u s | &NotEqual; 0 ,
Wherein | u s| with | u t| represent the blog article number that user issues respectively; represent user u sin the probability distribution in i-th topic space, represent user u tin the probability distribution in i-th topic space, KL distance is less, and topic similarity is higher.
If | u t|=0, i.e. data centralization user u sthe good friend u paid close attention to tdo not issue blog article, then user u sread good friend u tprobability be 0.
If | u s|=0, i.e. data centralization user u sdo not issue blog article, definition user u sread good friend u tprobability be:
P read ( u s , u t ) = &tau; t &Sigma; u &Element; out ( u s ) &tau; u
Namely it is relevant that the probability reading good friend only issues blog article number to good friend, and it is more that good friend issues blog article, and the probability reading this good friend is larger.
In one embodiment, U is gathered for all users in i-th topic space i, infer that user gathers U iwhether all followers of middle u and u exist " reading " relation, the following condition of demand fulfillment:
t u i &times; P read &GreaterEqual; &eta;
Wherein represent the probability distribution of user u in i-th topic space; P readrepresent that the follower of user u reads the probability of the blog article of this user.I.e. two users " reading " probability in i topic space need to be more than or equal to certain threshold value η.
Read web figure be a weighted digraph, in i-th topic space, user to user reading relation (that is, due to user read user blog article and the reading relation produced) weights can be defined as follows:
w d ( u s i , u t i ) = t t i &times; P read ( u s i , u t i )
Wherein, represent user in the probability distribution in i-th topic space.
Make read web figure under i-th topic space in transition probability matrix be transition probability then between user is defined as follows.
Define under 5. i-th topic spaces in read web, user random reading user the transition probability of blog article is defined as:
P d i ( u t i | u s i ) = w d ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w d ( u s i , u i ) ,
Wherein represent under i-th topic space, user to user " reading " relation (that is, due to user read user blog article and the reading relation produced) weights; under representing i-th topic space, user usi is to the weights sum of " reading " relation of its all good friend.
Describe at transmission network, the execution mode of replying the calculating transition probability in network, duplicate network and read web above successively, but obviously, in Practical Calculation process, the above-mentioned transition probability in these networks can be calculated with any order, also can perform one or more concurrently to calculate, be not limited to order as described above.
Merge many relational networks method
In order to weigh the influence power of microblog users individuality all sidedly, the above-mentioned transition probability in above-mentioned four kinds of networks can be considered for each user, to calculate the influence power Rank scores of each user, to this user (namely this influence power Rank scores reflects other user's random accesss, forward, reply, copy and read the blog article of this user) probability, itself so reflect the topic influence of this user.
The impact that user is subject to its good friend affects inside in network at above-mentioned 4 kinds and shows as random walk process, also will jump to another kind of impact in network with certain probability simultaneously.In an embodiment of the invention, define user rest on transmission network, reply network, duplicate network, read web probability be respectively λ 1, λ 2, λ 3, λ 4, and meet λ 1+ λ 2+ λ 3+ λ 4=1.Then user will with 1-λ 1probability jump to other networks from transmission network; With 1-λ 2probability from reply network hop forward other networks to; With 1-λ 3probability jump to other networks from duplicate network; With 1-λ 4probability jump to other networks from read web.By considering the stop probability λ of each network, make transition probability matrix be B, in i-th topic space in 4 kinds of networks between user (such as, from user to user ) transition probability be defined as follows respectively:
1) transmission network:
B a i ( u t i | u s i ) = &lambda; 1 w a ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w a ( u s i , u i )
2) network is replied:
B b i ( u t i | u s i ) = &lambda; 2 w b ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w b ( u s i , u i )
3) duplicate network:
B c i ( u t i | u s i ) = &lambda; 3 w c ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w c ( u s i , u i )
4) read web
B d i ( u t i | u s i ) = &lambda; 4 w d ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w d ( u s i , u i )
In yet another embodiment of the present invention, according to PageRank algorithm, user not only along network random walk (such as, user links the microblogging of directly access good friend via its good friend), but also jump to other nodes at random (such as with certain probability β, user by other means, such as manual input, visit the microblogging of the user not being its good friend), so the stop probability λ of the redirect probability β considered in this embodiment between internal nodes of network and each network, transition probability matrix is made to be B, in i-th topic space in 4 kinds of networks between user (such as, from user to user ) transition probability be defined as follows respectively:
1) transmission network:
B a i ( u t i | u s i ) = &lambda; 1 &times; ( 1 - &beta; ) &times; w a ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w a ( u s i , u i ) + &beta; n
2) network is replied:
B b i ( u t i | u s i ) = &lambda; 2 &times; ( 1 - &beta; ) &times; w b ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w b ( u s i , u i ) + &beta; n
3) duplicate network:
B c i ( u t i | u s i ) = &lambda; 3 &times; ( 1 - &beta; ) &times; w c ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w c ( u s i , u i ) + &beta; n
4) read web
B d i ( u t i | u s i ) = &lambda; 4 &times; ( 1 - &beta; ) &times; w d ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w d ( u s i , u i ) + &beta; n
N in above-mentioned formula represents the interstitial content in corresponding network.
Make r iu () is the Rank scores of user u in i-th topic space, consider the random walk of user in 4 kinds of networks, be then defined as follows by the Rank scores of user u in i-th topic space in one embodiment:
r i ( u ) = &Sigma; ( u t i , u ) &Element; E Retweet i B a i ( u | u t i ) r i ( u t i ) + &Sigma; ( u t i , u ) &Element; E Reply i B b i ( u | u t i ) r i ( u t i )
+ &Sigma; ( u t i , u ) &Element; E Copy i B c i ( u | u t i ) r i ( u t i ) + &Sigma; ( u t i , u ) &Element; E Read i B d i ( u | u t i ) r i ( u t i ) .
Namely the Rank scores of user determines primarily of the probability of follower's random access to this user.
The Rank scores that can calculate user u in i-th topic space is out of shape by above formula:
r i ( u ) = &Sigma; ( u t i , u ) &Element; E Retweet i B a i ( u | u t i ) + &Sigma; ( u t i , u ) &Element; E Reply i B b i ( u | u t i )
+ &Sigma; ( u t i , u ) &Element; E Copy i B c i ( u | u t i ) + &Sigma; ( u t i , u ) &Element; E Read i B d i ( u | u t i ) ) &times; r i ( u t i ) .
Make M be the transition probability matrix merging many relational networks random walk model, then many relational networks random walk iterative model is:
&pi; i t + 1 = M &pi; i t
Wherein be in i-th topic space, user's Rank scores vector in the t time iterative process.
From above formula, many relational networks random walk iterative model is the markoff process of an ergodicity.Therefore, can a given initial vector (such as, each node in network can be made initially to have identical Rank scores), then, by n iteration, result of calculation will restrain gradually.When meeting certain stop condition, the continuation of iteration can be stopped.
Experimental result
Be illustrated below in conjunction with the effect of experimental result to the inventive method.
Experiment adopts 261954 Chinese users in Twitter, and wherein issuing the number of users pushing away literary composition at 2011-04-15 to 2011-07-15 between these 3 months is 103836, and account for 39.6% of Chinese user's ratio, acquisition 2660281 simultaneously pushes away literary composition.
In order to verify the validity of the inventive method, demonstrate the accuracy rate that influence power individuality excavates respectively.Because in Twitter, actual influence power individuality is difficult to artificially determine, so rely on the cross validation of many algorithms to determine the accuracy rate of often kind of algorithm.Experiment has considered the individual mining algorithm of 6 kinds of influence powers:
1) forward the community network that relation is formed, the influence power in transmission network is individual to rely on PageRank algorithm to find, defining this type of algorithm is RepostRank;
2) reply the community network of relation formation, the influence power of replying in network is individual to rely on PageRank algorithm to find, defining this type of algorithm is ReplyRank;
3) people such as Weng proposes TwitterRank algorithm;
4) rely on follower's number to weigh individual influence power, defining this type of algorithm is FollowerNum;
5) rely on number of posting to weigh individual influence power, defining this type of algorithm is TweetNum;
6) the present invention propose many relational networks in random walk model, defining this type of algorithm is illustrated in MultiRank(Fig. 5 and 6 that three kinds of different parameters of MultiRank algorithm are arranged, be named as MultiRank1, MultiRank2 and MultiRank3 respectively, impact on MultiRank algorithm is set to weigh different parameters further).
Experiment utilizes cross validation method, i.e. the result correct result as a reference that all praises of multiple (N kind) algorithm.Such as given 4 kinds of algorithm A, Top-K the high-impact individual collections that B, C, D obtain is respectively I a, I b, I cand I d, suppose the result correct result as a reference that 2 kinds of algorithms all praise, then defining influence power individual reference standard set is:
I 2=(I A∩I B)∪(I A∩I C)∪(I A∩I D)∪
(I B∩I C)∪(I B∩I D)∪(I C∩I D)
Accuracy rate P (Precision) reflects the authenticity of influence power individuality in Twitter, and namely algorithm A influence power individuality finds that accuracy rate is defined as follows:
P A = | I A &cap; I 2 | | I A |
Experiment obtains Top-10 in each topic, 20,50,100,200 respectively according to above-mentioned each algorithm, and the influence power of 500 is individual.
For N=2,3,4,5,6, when 7, Top-10 in each topic classification, as shown in Figure 5, its ordinate represents accuracy rate, and abscissa represents each topic in the Average Accuracy distribution of the individual mining algorithm of 20,50,100,200,500 influence powers.
Experimental result is known as shown in Figure 5, at N=2, and 3,4,5,6, when 7, the Selecting parameter of the MultiRank algorithm that the present invention proposes is (shown in Fig. 5 MultiRank2) when the 2nd group, in 10 topic classifications, all show higher accuracy rate.The principle of MultiRank2 Selecting parameter is that balancing network scale is to influence power Rank scores r ithe impact of (u), and the principle of MultiRank1, MultiRank3 Selecting parameter is respectively read web Rank scores r domination influence power Rank scores r i(u), and forward, reply, duplicate network Rank scores r arranges influence power Rank scores r iu (), due in many relational networks, MultiRank1, MultiRank3 are subject to the impact of network size respectively, cause the relative reduction of accuracy rate.
Experimental result shows in each topic simultaneously, and along with the increase of normative reference quantity N in cross validation, accuracy rate performance downward trend, because normative reference quantity N increases, will cause the intersection set element I of multiple normative reference nnumber is less, thus makes any particular algorithms I awith I nintersection set element less, cause the reduction of accuracy rate.Experimental result also shows normative reference quantity N=3, and when 4, each algorithm accuracy rate discrimination is larger, and experiment effect is best, if too low (N=2) that N is arranged, will cause the intersection set element I of multiple normative reference nnumber is on the high side, makes MultiRank1, and MultiRank2 and MultiRank3 is totally 3 kinds of algorithms and canonical reference set I ncommon factor element basically identical, cause 3 kinds of algorithm accuracy rate discriminations little.Too high (N=5,6,7) that if N is arranged, the intersection set element I of multiple normative reference will be caused nnumber is on the low side, makes each algorithm and canonical reference set I equally ncommon factor element basically identical, cause accuracy rate discrimination little.
Subsequently for N=2,3,4,5,6, when 7, the Average Accuracy of 10 topic classifications at Top-10,20,50,100,200, the distribution of 500 as shown in Figure 6.
Experimental result is as shown in Figure 6 known, the MultiRank algorithm that same the present invention proposes all shows higher accuracy rate in the individual mining algorithm of 6 groups of Top-K influence powers, simultaneously normative reference quantity N=3,4 time, each algorithm accuracy rate discrimination is larger, experiment effect the best.Experimental result also shows that along with the number of K increases, accuracy rate presents ascendant trend, due to the increase of influence power individual amount, by causing the identical element number of the cross validation of polyalgorithm to increase, makes accuracy rate increase in Top-K influence power individuality excavates.
The preferred embodiment of the present invention is illustrated above, it should be noted that, be described in detail for microblogging application in a preferred embodiment, but it will be appreciated by those skilled in the art that, method as herein described can be applied in other application outside microblogging, and corresponding user can deliver other dispatches, such as text, video, audio frequency etc. and their combination in any, be not limited to blog article.In addition, specific algorithm, formula, optimum configurations etc. that the above-mentioned concrete implementation section of the application is mentioned only illustrate for example, are not limited to the present invention.Those skilled in the art are when knowing design concept of the present invention and connotation, and can carry out suitable distortion and replacement to above-mentioned algorithm, formula, parameter etc., it still belongs to the protection range of the application.

Claims (10)

1. for determine the network user dispatch between whether there is the method for replication relation, described method comprises:
Obtain to exist and clearly forward the time probability distribution that the time interval between two sections of relation dispatches obeys;
The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution;
Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet;
Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And
Determine whether there is replication relation between two sections of dispatches based on described similarity.
2. method according to claim 1, wherein, described infer based on the distribution of above-mentioned time probability there is replication relation two sections of dispatches between the amenable time probability distribution of time interval institute comprise: by the time interval between the two sections of dispatches that there is replication relation amenable time probability distribute be inferred as with exist two sections of clearly forwarding relation send the documents between the time interval time probability of obeying distribute identical.
3., based on a topic influence individual method for digging for many relational networks, comprising:
The forwarding relation extracted between user forwards relational network to construct, and calculates the transition probability of the dispatch of another user of user's random forwarding in described forwarding relational network;
The reply relation extracted between user replys relational network construct, and calculating user in described reply relational network replys the transition probability of the dispatch of another user at random;
According to by the determined dispatch that there is replication relation of the method for claim 1 or 2, replication relation between extraction user to construct replication relation network, and calculates the transition probability of the dispatch of another user of user's random reproduction in described replication relation network;
The reading relation extracted between user reads relational network construct, and calculating user in described reading relational network reads the transition probability of the dispatch of another user at random;
Consider above-mentioned transition probability in above-mentioned four kinds of networks to calculate arbitrary user by the probability of other user's random accesss.
4. method according to claim 3, wherein, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between user, dispatch and user, extract the reading relation between user.
5. method according to claim 3, wherein, the number of times forwarding the dispatch of other users based on a described user calculates the transition probability of the dispatch of another user of user's random forwarding described in described forwarding relational network.
6. method according to claim 3, wherein, the number of times of replying the dispatch of other users based on a described user to calculate described in described reply relational network the transition probability that a user replys the dispatch of another user at random.
7. method according to claim 3, wherein, based on two users dispatch between the time interval and similarity, calculate the transition probability of the dispatch of another user of user's random reproduction in described replication relation network.
8. method according to claim 3, wherein, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between two users, dispatch and two users, to calculate in described reading relational network the transition probability that a user reads the dispatch of another user at random.
9. method according to claim 3, wherein, described access comprises forwarding, replys, copies and read.
10. method according to claim 3, wherein, described dispatch is text, video, audio frequency or their combination in any that user delivers.
CN201310071162.XA 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks Active CN103179198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310071162.XA CN103179198B (en) 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201210432184.X 2012-11-02
CN201210432184 2012-11-02
CN201310071162.XA CN103179198B (en) 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks

Publications (2)

Publication Number Publication Date
CN103179198A CN103179198A (en) 2013-06-26
CN103179198B true CN103179198B (en) 2016-01-20

Family

ID=48638816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310071162.XA Active CN103179198B (en) 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks

Country Status (1)

Country Link
CN (1) CN103179198B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761292B (en) * 2014-01-16 2017-01-18 北京理工大学 User forward behavior based microblog reading probability calculation method
CN104134159B (en) * 2014-08-04 2017-10-24 中国科学院软件研究所 A kind of method that spread scope is maximized based on stochastic model information of forecasting
CN104346443B (en) * 2014-10-20 2018-08-03 北京国双科技有限公司 Network text processing method and processing device
CN104376083B (en) * 2014-11-18 2017-06-27 电子科技大学 It is a kind of that method is recommended based on concern relation and the figure of multi-user's behavior
CN105808664A (en) * 2016-02-29 2016-07-27 四川长虹电器股份有限公司 Forum user ranking method
CN108009933B (en) * 2016-10-27 2021-06-11 中国科学技术大学先进技术研究院 Graph centrality calculation method and device
CN109446171B (en) * 2017-08-30 2022-03-15 腾讯科技(深圳)有限公司 Data processing method and device
CN109271584B (en) * 2018-08-29 2022-02-15 杭州电子科技大学 Recommendation method based on improved PageRank and comprehensive influence
CN109800351A (en) * 2018-12-29 2019-05-24 常熟理工学院 High-impact usage mining method in microblogging specific topics
CN110851659B (en) * 2019-10-23 2021-06-29 清华大学 Student academic influence calculation method and system based on student thesis relationship network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2081126A2 (en) * 2008-01-21 2009-07-22 NEC Corporation Information processing system, information processing apparatus, information processing program and recording medium
CN102254025A (en) * 2011-07-28 2011-11-23 清华大学 Information memory retrieving method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2081126A2 (en) * 2008-01-21 2009-07-22 NEC Corporation Information processing system, information processing apparatus, information processing program and recording medium
CN102254025A (en) * 2011-07-28 2011-11-23 清华大学 Information memory retrieving method

Also Published As

Publication number Publication date
CN103179198A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
CN103179198B (en) Based on the topic influence individual method for digging of many relational networks
Li et al. Characterizing information propagation patterns in emergencies: A case study with Yiliang Earthquake
CN103678669B (en) Evaluating system and method for community influence in social network
Wang et al. Whispers in the dark: analysis of an anonymous social network
CN104537096A (en) Microblog message influence measuring method based on microblog message propagation tree
Horel et al. Scalable methods for adaptively seeding a social network
CN103116611A (en) Social network opinion leader identification method
Lee et al. Content-driven detection of campaigns in social media
CN105005594A (en) Abnormal Weibo user identification method
Zhu et al. Information dissemination model for social media with constant updates
Li et al. Novel user influence measurement based on user interaction in microblog
CN105550275A (en) Microblog forwarding quantity prediction method
CN105095419A (en) Method for maximizing influence of information to specific type of weibo users
Al‐Qurishi et al. SybilTrap: A graph‐based semi‐supervised Sybil defense scheme for online social networks
CN105608624A (en) Microblog big data interest community analysis optimization method based on user experience
CN103425703A (en) Method and device for processing network information
Parau et al. Opinion leader detection
Moriano et al. Community-based event detection in temporal networks
CN105335476B (en) A kind of focus incident classification method and device
Fu et al. Leveraging careful microblog users for spammer detection
Liu et al. Improving fraud detection via hierarchical attention-based graph neural network
Yin et al. It takes two to tango: Exploring social tie development with both online and offline interactions
Zhang et al. Tweetscore: Scoring tweets via social attribute relationships for twitter spammer detection
Zeng et al. Multiplex network disintegration strategy inference based on deep network representation learning
Modupe et al. Filtering of mobile short messaging service communication using latent Dirichlet allocation with social network analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant