CN103179198B

CN103179198B - Based on the topic influence individual method for digging of many relational networks

Info

Publication number: CN103179198B
Application number: CN201310071162.XA
Authority: CN
Inventors: 丁兆云; 贾焰; 杨树强; 周斌; 韩伟红; 李爱平; 韩毅; 李莎莎
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2012-11-02
Filing date: 2013-03-06
Publication date: 2016-01-20
Anticipated expiration: 2033-03-06
Also published as: CN103179198A

Abstract

For determine the network user dispatch between whether there is the method for replication relation, described method comprises: obtain the time probability distribution that the time interval between two sections of dispatches existing and clearly forward relation obeys; The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet; Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of dispatches based on described similarity.

Description

Based on the topic influence individual method for digging of many relational networks

Technical field

The present invention relates to network data excavation technical field, particularly relate to topic influence individual digging technology in microblogging.

Background technology

The microblogging service of class Twitter is developed rapidly as a new communication medium recently, according to the 29th China Internet report statistics: by by the end of December, 2011, China's microblogging actual user number reaches 2.5 hundred million, comparatively goes up and increases 296.0% an end of the year, and netizen's utilization rate is 48.7%.Be different from the social networking service of other class Facebook, the social networking relationships of microblogging service is unidirectional, and user does not need other user rights just can pay close attention to them.Such as, in Twitter, community network is formed by paying close attention to (following) relation, and the person that user pays close attention to is the good friend (friend) of this user; The person paying close attention to certain user is the follower (follower) of this user, all literary compositions (tweets) that push away that user issues will appear at (publictimeline) on line common time, by all message of this user of display on this user all followers timeline.

Along with popularizing of microblogging service, a large number of users participates in topic discussion in microblogging, thus causes microblogging service to produce the information relating to multiple topic in a large number every day.The influence power that bulk information has flooded in each topic is individual, and therefore, the influence power individuality how excavating each topic in the bulk information of microblogging will be a challenging job.

Recently, relevant scholar proposes the individual method for digging of topic hierarchy influence power for Twitter data.

As list of references 1 " [1] WengJ; LimEP; JiangJ; etal.TwitterRank:Findingtopic-sensitiveinfluentialtwitte rs [C] //Procofthe3thACMInternationalConferenceonWebSearchandData Mining.NewYork; NY:ACM; 2010:261-270 ", the people such as Weng, in order to measure the influence power of user on each topic in Twitter, propose TwitterRank method.TwitterRank method is better than PageRank method and Topic-sensitivePageRank method to a certain extent, but in TwitterRank method, the transition probability of random walk only considers and pushes away civilian number and topic similarity, have ignored the correlative factors such as forwarding, reply.

At list of references 2 " [2] PalA; CountsS.Identifyingtopicalauthoritiesinmicroblogs [C] //Procofthe4thACMInternationalConferenceonWebSearchandData Mining.NewYork; NY:ACM; 2011:45-54 ", the people such as Pal are in order to identify the authoritative user of each topic in Twitter, consider multiple attributes of user in Twitter, but have ignored the link structure of many relational networks, thus be difficult to portray user's relative influence in the entire network.

Therefore, a kind of topic influence individual digging technology of improvement is needed in this area.

Summary of the invention

An object of the present invention is to provide a kind of method that whether there is replication relation between dispatch for determining the network user.Further object of the present invention is to provide a kind of topic influence individual method for digging of improvement, weighs incomplete deficiency to make up traditional individual influence power evaluation method.

To achieve these goals, provide a kind of method that whether there is replication relation between dispatch for determining the network user in one aspect of the invention, described method comprises: obtain to exist and clearly forward the time probability distribution that the time interval between two sections of relation dispatches obeys; The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet; Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of dispatches based on described similarity.

Preferably, the amenable time probability of the time interval existed between two sections of replication relation dispatches institute is distributed be inferred as with exist two sections of clearly forwarding relation send the documents between the time interval time probability of obeying distribute identical.

Provide a kind of topic influence individual method for digging based on many relational networks in another aspect of the present invention, comprise: the forwarding relation extracted between user forwards relational network to construct, and calculates the transition probability of the dispatch of another user of user's random forwarding in described forwarding relational network; The reply relation extracted between user replys relational network construct, and calculating user in described reply relational network replys the transition probability of the dispatch of another user at random; According to the determined dispatch that there is replication relation in method above, extract replication relation between user construct replication relation network, and the transition probability of calculating dispatch of another user of user's random reproduction in described replication relation network; The reading relation extracted between user reads relational network construct, and calculating user in described reading relational network reads the transition probability of the dispatch of another user at random; Consider above-mentioned transition probability to calculate arbitrary user by the probability of other user's random accesss.

Preferably, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between user, dispatch and user, extract the reading relation between user.

Preferably, the number of times forwarding the dispatch of other users based on a described user calculates the transition probability of the dispatch of another user of user's random forwarding described in described forwarding relational network.

Preferably, the number of times of replying the dispatch of other users based on a described user to calculate described in described reply relational network the transition probability that a user replys the dispatch of another user at random.

Preferably, based on two users dispatch between the time interval and similarity, calculate the transition probability of the dispatch of another user of user's random reproduction in described replication relation network.

Preferably, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between two users, dispatch and two users, to calculate in described reading relational network the transition probability that a user reads the dispatch of another user at random.

Preferably, described access comprises forwarding, replys, copies and read.

Preferably, described dispatch is text, video, audio frequency or their combination in any that user delivers.

By adopting technical scheme of the present invention, whether replication relation is there is between the dispatch can determining the network user efficiently, in addition, further, by using multiple different relational network, the multiple different interaction characteristic between the network user can be considered all sidedly, thus can find that the influence power in network is individual more exactly.

Accompanying drawing explanation

Be described in detail with reference to the attached drawings the present invention, should be appreciated that accompanying drawing and corresponding description are appreciated that it is illustrative and nonrestrictive, wherein:

Fig. 1 schematically illustrates many relational networks that may exist between multiple microblog users;

Fig. 2 shows the example time intervals distribution that user forwards operation;

Fig. 3 shows the quantum condition entropy in the time interval of the two sections of blog articles that there is replication relation;

Fig. 4 schematically illustrates the time sequence model of posting of the every day of 3 users;

Fig. 5 shows the accuracy rate of each algorithm in each topic; And

Fig. 6 shows the Average Accuracy of each algorithm in all topics.

Embodiment

Exemplarily come to be described in detail to the preferred embodiment of the present invention with microblogging application below in conjunction with accompanying drawing.

In order to the influence power of overall measure user on certain topic hierarchy, contemplated by the invention the multiple network relationship type in microblogging.Such as shown in Fig. 1 (a), the impact of user B on user A shows as 4 kinds of relationship types: 1) user A employs similar " RTB " or " viaB " in the blog article of oneself, forwarded the blog article of user B; 2) user A employs similar " B " in the blog article of oneself, has replied the blog article of user B; 3) user A does not use similar " RTB " or " viaB " etc. to forward type label clearly, replicates the blog article of user B; 4) user A has read the blog article of user B.Show the directed edge (A, B) that between user A and B 4 kinds are dissimilar in fig. l (a), to represent above-mentioned four kinds of relationship types respectively.Fig. 1 (b) shows many relational networks of another example, and it is made up of 3 users and above-mentioned 4 kinds of dissimilar directed edges.As can be seen here, the network that affects between microblog users is relational network more than.Can according to four kinds of relationship types mentioned above, this many relational network is correspondingly decomposed into 4 kinds of dissimilar relational networks, is respectively: transmission network, reply network, duplicate network and read web.

The influence power individuality realized in microblogging excavates, and first should extract above-mentioned transmission network, reply network, duplicate network and read web, then can merge above-mentioned network and analyze individual influence power or calculate.The individual mining process of above-mentioned influence power based on be microblogging related data (such as, the forwarding record, reply record etc. of the blog article that user profile, user deliver, user), those skilled in the art understand, existing various mode can be adopted to gather microblogging related data, the emphasis that itself and non-invention are discussed.Therefore, in order to avoid fuzzy the present invention, the process obtaining microblogging related data is no longer repeated.

Be more than the basic implementation procedure of the inventive method, in detailed description process below, first related definition be explained respectively.

Related definition

First defining C is whole blog article set, and V is whole user set.Define k topic (k is positive integer), then C simultaneously ⁱand V ⁱrepresent all blog article set and user's set in i-th (0 < i≤k) individual topic space respectively.

Definition microblogging in concern network be one oriented without weight graph G _f=(V, E _f), wherein V represents that this is oriented without the node set (also i.e. user's set) in weight graph G, E _frepresent the directed edge set between the node in G, the concern relation between its representative of consumer.In addition, above-mentioned f there is no physical meaning, and it is only pay close attention to (following) network and the relation of concern for what identify that this network or relation relate to.

Many relational networks under definition i-th (0 < i≤k) individual topic space are many graphs of a relation

G^{i} = (V^{i}, E_{Retweet}^{i} \cup E_{Reply}^{i} \cup E_{Copy}^{i} \cup E_{Read}^{i}),

Wherein V ⁱrepresent the user's set under i-th topic space, are all directed edge set, represent the forwarding relation under i-th topic space, reply relation, replication relation and reading relation respectively.This many graph of a relation can be decomposed into the figure of 4 kinds of different relationship types, is respectively forwarding (Retweet) network diagram, reply (Reply) network diagram, copies (Copy) network diagram and reading (Read) network diagram.Particularly, represent the oriented transmission network figure of weighting, wherein under representing i-th topic space and relate to forwarding relation user set; be the set of directed edge, represent the forwarding relation under i-th topic space; be forward the weight on limit, it can be such as hop count between two users under i-th topic space.Equally, represent the oriented reply network diagram of weighting, wherein under representing i-th topic space and relate to reply relation user set; be the set of directed edge, represent the reply relation under i-th topic space; be the weight of replying limit, it can be such as reply number of times between two users under i-th topic space. represent the oriented duplicate network figure of weighting, wherein under representing i-th topic space and relate to replication relation user set; be directed edge set, represent the replication relation between user under i-th topic space; it is the weight (will introduce below) copying limit. represent the oriented read web figure of weighting, wherein under representing i-th topic space and relate to reading relation user set; be directed edge set, represent the reading relation between user under i-th topic space; it is the weight (will introduce below) of reading limit.A, b, c, d used herein are only for identifying or distinguish user's set of each network or each network, and itself there is no physical meaning.

Example explanation is carried out hereafter by the transition probability computational methods of preferred implementation to each network.

The transition probability computational methods of transmission network

Transmission network figure middle random walk process is constructed as follows: user is subject to its good friend impact in i-th topic space, forward its good friend's blog article by with certain transition probability.Random walk process in transmission network figure simulates the forwarding behavior of user in microblogging.The transition probability matrix under i-th topic space in transmission network is made to be transition probability between user is defined as follows.

Define under 1. i-th topic spaces in transmission network, user random forwarding user the transition probability of blog article is defined as:

P_{a}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{a} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{a} (u_{s}^{i}, u^{i})},

Wherein represent under i-th topic space, user forward user the number of times of blog article; represent under i-th topic space, user forward the number of times of the blog article of its all good friend, wherein representing paying close attention in network, that starting point is the terminal of all directed edges of usi set, being also good friend set.

Reply the transition probability computational methods of network

Reply network diagram middle random walk process is constructed as follows: user is subject to its good friend impact in i-th topic space, reply its good friend's blog article by with certain transition probability.The random walk process of replying in network diagram simulates the reply behavior of user in microblogging.The transition probability matrix of replying under making i-th topic space in network is transition probability then between user is defined as follows.

Reply in network under defining 2. i-th topic spaces, user random reply user the transition probability of blog article is defined as:

P_{b}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{b} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{b} (u_{s}^{i}, u^{i})},

Wherein represent under i-th topic space, user reply user the number of times of blog article; represent under i-th topic space, user reply the number of times of the blog article of its all good friend, wherein represent paying close attention in network, starting point and be the set of terminal of all directed edges, be also good friend set.

The transition probability computational methods of duplicate network

Owing to there is not " copying " relational tags in microblogging, therefore in order to construct duplicate network figure in random walk process, need first to infer " copying " relation, to excavate implicit relation limit.

An embodiment of the invention have considered the similitude of time interval between two sections of blog articles and blog article.Usually, the content similarity that there are two sections of blog articles of " copying " relation is higher, infers that the simple method of " copying " relation can be considered to calculate the similarity between blog article, if similarity is higher than certain threshold value, then and the source of this blog article of deducibility.Above-mentioned simple method needs the similarity calculated between all blog articles of all good friends of user, and calculation cost is high.In an embodiment of the invention, in order to reduce calculation cost, not only considering the similarity between blog article, also contemplating the time interval Δ t between blog article.

" copying " behavior in microblogging belongs to forwarding behavior to a certain extent, does not only clearly use " RTB " or " viaB " etc. to forward type label.So use the time interval Δ t existed between the blog article clearly forwarding relation herein _retweetset T _retweetprobability Distribution Fitting in exist " copying " relation blog article between time interval Δ t _copyset T _copyprobability distribution.Based on above-mentioned consideration, the present invention proposes a kind of method that whether there is replication relation between blog article for determining the network user, it can comprise the steps: to obtain the time probability distribution that the time interval between two sections of blog articles existing and clearly forward relation obeys; The amenable time probability distribution of time interval institute between the two sections of blog articles that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, the scope that the time interval between the two sections of blog articles that there is replication relation should meet is set; Be in any two sections of blog articles in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of blog articles based on described similarity.

Particularly, in one embodiment, be first other numbers of 71000(from data centralization random sampling sample size be also feasible) existence clearly forward relation blog article between time interval Δ t _retweetset T _retweet, namely | T _retweet|=71000.Fig. 2 shows its Data distribution8, and most of the time interval is only in some hours as shown in Figure 2, only has small part time interval span larger, even has minority time interval span to exceed 10 days.In order to portray time interval distribution situation more meticulously, the long-tail point of time interval span more than 10 days can be removed.It is rational for removing these points, because by the forwarding user normally junk user of data analysis discovery time interval span more than 10 days.

After removing the long-tail point of time interval span more than 10 days, sampling sample size becomes 69770, namely | and T ' _retweet|=69770, find through statistics, its time interval of delta t roughly obeys quantum condition entropy.Given sample set T ' _retweet, estimate parameter lambda=1.9768 × 10 of quantum condition entropy ⁴, then quantum condition entropy probability density function is as follows:

f (x) = \{\begin{matrix} \frac{1}{19768} e^{- x / 19768}, x &GreaterEqual; 0, \\ 0, x < 0 . \end{matrix}

Wherein, e is the truth of a matter of natural logrithm function.

So parameter lambda=1.9768 × 10 are obeyed in the set that there is the time interval between the blog article clearly forwarding relation ⁴quantum condition entropy.Because the behavior that " copies " belongs to forwarding behavior to a certain extent, the time interval of two sections of blog articles of deducibility existence " copying " relation also roughly obeys parameter lambda=1.9768 × 10 thus ⁴quantum condition entropy, distribution function (see Fig. 3) as follows:

F (x) = \{\begin{matrix} 1 - e^{- x / 19768}, x &GreaterEqual; 0, \\ 0, x < 0 . \end{matrix}

According to quantum condition entropy, can be similar to and provide the range delta t that existence " copies " time interval of two sections of blog articles of relation _range.Consider calculation cost and precision, the range delta t that existence " copies " time interval of two sections of blog articles of relation is exemplarily set in one embodiment _range∈ (0ks, 1.08 × 10 ²ks], wherein ks represents 1000 seconds, and round bracket " (" expression does not comprise endpoint value, bracket "] " expression comprises endpoint value.According to quantum condition entropy, the recall rate R(Recall of " copying " relation can be inferred) be:

R=F(1.08×10 ⁵)-F(0)＝99.58%。

If the time interval Δ t of two sections of blog articles is not at Δ t _rangein, then illustrate that existence " copies " probability of relation very low, thus do not need the similitude of calculating two sections of blog articles, to reduce calculation cost.

Infer whether the blog article that there is friend relation user issue exists " copying " relation, needs to meet two conditions below simultaneously thus:

1）Sim(p _t,p _s)≥ξ；

2）Δt _range∈(0ks,1.08×10 ²ks]。

First i.e. two sections of blog article p of condition _tand p _ssimilarity Sim (p _t, p _s) need to be more than or equal to certain threshold xi.Included angle cosine can be adopted in one embodiment to calculate the similarity of two sections of blog articles, other calculating document similarity method, such as KL distance (kullback-leibler) etc., are effective equally.

sim(p _t,p _s)＝cos(v _t,v _s)，

Wherein v _twith v _srepresent the vector of two sections of blog articles respectively.

Second time interval of condition that is between two sections of blog articles will in certain threshold value.Above use 1.08 × 10 ²ks, only for example, it will be understood by those skilled in the art that based on considering calculation cost and precision, can use any other suitable value.

Two sections of blog articles that definition exists " copying " relation are two tuple <p _t, p _s>, then between two good friends, all existence " copy " blog article of relation to being two tuple-set U.Duplicate network figure can be inferred thus be a weighted digraph, user to user " copying " relation (that is, due to user copy user blog article produce " copying " relation) weights be defined as follows:

w_{c} (u_{s}^{i}, u_{t}^{i}) = \underset{< p_{t}^{i}, p_{s}^{i} > &Element; U_{s, t}^{i}}{Σ} sim (p_{s}^{i}, p_{t}^{i}) \times f (Δ t_{p_{s}^{i}, p_{t}^{i}}) .

be under i-th topic space, user with between all existence " copy " relation and be by user the blog article of initiating to copy is to two tuple-sets.F in above-mentioned equation represents a function, usually, f be set to when Δ t more hour, its value is higher.Therefore, above-mentioned weight computing has considered similitude and the time interval of two sections of blog articles, and the similarity between two sections of blog articles is higher, and the time interval is less, illustrates that existence " copies " probability of relation higher.

Duplicate network figure middle random walk process is constructed as follows: user is subject to its good friend impact in i-th topic space, copy its good friend's blog article by with certain transition probability.Random walk process in duplicate network figure simulates the replication of user in microblogging.The transition probability matrix under i-th topic space in duplicate network is made to be transition probability then between user is defined as follows.

Define under 3. i-th topic spaces in duplicate network, user random reproduction user the transition probability of blog article is defined as:

P_{c}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{c} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{c} (u_{s}^{i}, u^{i})},

Wherein represent under i-th topic space, user to user " copying " relation (that is, due to user copy user blog article produce " copying " relation) weights; under representing i-th topic space, user to the weights sum of " copying " relation of its all good friend.

The transition probability computational methods of read web

It is more by the user read that user issues blog article, illustrates that this blog article coverage is wider to a certain extent.In order to construct read web figure middle random walk process, first needs to construct read web.

Namely the simple thought of structure read web figure use concern network G _f=(V, E _f) relation between structuring user's, rely on user to issue the weights of blog article quantity as limit.Be interpreted as intuitively under specific topics space, it is more that user issues blog article, and have more followers, then its blog article coverage is wider.

The popularization of this simple method is TwitterRank algorithm, adds the topic similitude between good friend, is interpreted as that user more likely reads the good friend blog article similar to oneself topic intuitively.TwitterRank algorithm transition probability is as follows:

P_{t} (i, j) = \frac{| τ_{j} |}{\underset{a : s_{i} follows s_{a}}{Σ} | τ_{a} |} \times {sim}_{t} (i, j) .

| τ _j| be user u _jthe blog article number issued; for user u _ithe blog article number issued of all good friends; sim _t(i, j) is under t topic space, the similarity between user.

In microblogging, all blog articles that user issues are by active push on the timeline of its follower, and usual follower logs in its personal homepage, can read the information of its homepage.Infer thus the time that user logs in and its good friend time interval of posting less, more likely read the blog article of its good friend.But user's login time is difficult to obtain.Therefore, in an embodiment of the invention, to post the every day that the rule of posting by adding up each user's some calculates user time sequence model, and suppose that between user, the time sequence model similarity of posting of every day is higher, the probability that there is " reading " relation is larger.The time statistical law of posting due to some reflects user's login time rule to a certain extent, therefore, time sequence model similarity of posting every day is higher, describe time that user logs in a certain extent and its good friend time interval of posting less, then more likely read the blog article of its good friend, so this hypothesis is rational.

Such as, to post the every day that Fig. 4 shows 3 users time sequence model, if user A has paid close attention to user B and C simultaneously, by the post time sequence model similarity of time sequence model similarity apparently higher than user A and C of posting of known user A and B shown in figure, then user A has larger probability to read the blog article of B.Because user B has larger probability to issue blog article when user A online (online), thus blog article is initiatively pushed to the personal homepage of A by microblogging service, user A is made to have larger probability to read the blog article of B.

In one embodiment, with in one day 24 hours, user is at the probability of posting of each hour to weigh the rule of posting of user, and definition time sequence model of posting is as follows.

Definition 4. to any user u, two tuple <t, p> represent t(unit for hour) in, the probability that user posts is p; Time series set { t ₀, t ₁..., t ₂₃(t ₀< t ₁< ... < t ₂₃) represent 24 discrete points, each is the duration of 1 hour, then user's time sequence model of posting is defined as follows:

ts＝＜ts ₀＝＜t ₀,p ₀＞,ts ₁＝＜t ₁,p ₁＞,...,ts ₂₃＝＜t ₂₃,p ₂₃＞＞

Add up the temporal regularity that each user issues N number of model, calculating user in i(unit is hour) in probability of posting wherein N _irepresent in user all N number of models, every day in i(unit be hour) in the number of all models of issue, namely user in i(unit is hour) in issue model number more, then the probability that in this hour of inherence of one day of user, microblogging is online is larger.

If the post probability of multiple user in each hour is more identical, then the similarity of their time sequence model is higher.In one embodiment, Euclidean (Euclidean) distance metric time series similarity can be adopted.

Particularly, Q and C is made to represent two time serieses of posting respectively; q _irepresent the value of i-th point of Q sequence; c _irepresent the value of C sequence at i-th point; I, n represent the sequence number of current point in whole sequence and the length of whole sequence respectively, then two time series similarity computing formula are as follows:

simSeries (Q, C) = \frac{1}{\sqrt{Σ_{i = 1}^{n} {(q_{i} - c_{i})}^{2}}}, | Q | &NotEqual; 0 and | C | &NotEqual; 0 |,

Wherein | Q| and | C| represents the length (both length is equal) of two sequences respectively, and Euclidean distance is less, and similarity is higher.

In one embodiment, can suppose that the probability that there is reading relation between user is relevant to following 3 factors:

1) user reads the many good friends of blog article quantity with higher probability;

2) user reads the high good friend of topic similarity with more high probability;

3) user reads the high good friend of time sequence model similarity of posting with higher probability.

Therefore, can by user u _sread good friend u _tthe definition of probability of blog article is as follows:

P_{read} (u_{s}, u_{t}) = \frac{τ_{t} \times sim (u_{s}, u_{t}) \times simSeries (u_{s}, u_{t})}{\underset{u &Element; out (u_{s})}{Σ} τ_{u} \times sim (u_{s}, u) \times simSeries (u_{s}, u)},

Wherein τ represents blog article quantity that data centralization relative users issues (do not comprise there is forwarding, reply and the blog article of replication relation); Sim (u _s, u _t) represent user u _swith its concern good friend u _tbetween topic similarity; SimSeries (u _s, u _t) represent user u _swith its concern good friend u _tbetween time series similarity of posting; Out (u _s) represent user u _sconcern good friend set.Above-mentioned execution mode considers above-mentioned 3 factors all sidedly, is appreciated that in some embodiments, in order to simplify computation complexity etc., only can consider wherein any one or two factors.

In an embodiment of the invention, using all blog articles of a user as the one section of above-mentioned topic similarity sim of document calculations (u _s, u _t), the same topic distribution using LDA model to determine each user.Definition t is " user-topic " distribution vector of user, i.e. t=(t ¹, t ²..., t ^k) ^t, wherein t ¹, t ²..., t ^kfor each element in " user-topic " distribution vector, represent the probability distribution of user in each topic space, k represents the topic number of setting, and user's topic category set is expressed as { t ¹, t ²..., t ^k.

User's topic similarity is defined as user's topic category set { t ¹, t ²..., t ^kkL distance:

sim (u_{s}, u_{t}) = \frac{1}{KL (u_{s}, u_{t})}

= \frac{1}{\underset{0 < i \leq k}{Σ} t_{s}^{i} 1 g \frac{t_{s}^{i}}{t_{t}^{i}}}, | u_{t} | &NotEqual; 0 and | u_{s} | &NotEqual; 0,

Wherein | u _s| with | u _t| represent the blog article number that user issues respectively; represent user u _sin the probability distribution in i-th topic space, represent user u _tin the probability distribution in i-th topic space, KL distance is less, and topic similarity is higher.

If | u _t|=0, i.e. data centralization user u _sthe good friend u paid close attention to _tdo not issue blog article, then user u _sread good friend u _tprobability be 0.

If | u _s|=0, i.e. data centralization user u _sdo not issue blog article, definition user u _sread good friend u _tprobability be:

P_{read} (u_{s}, u_{t}) = \frac{τ_{t}}{\underset{u &Element; out (u_{s})}{Σ} τ_{u}}

Namely it is relevant that the probability reading good friend only issues blog article number to good friend, and it is more that good friend issues blog article, and the probability reading this good friend is larger.

In one embodiment, U is gathered for all users in i-th topic space ⁱ, infer that user gathers U ⁱwhether all followers of middle u and u exist " reading " relation, the following condition of demand fulfillment:

t_{u}^{i} \times P_{read} &GreaterEqual; η

Wherein represent the probability distribution of user u in i-th topic space; P _readrepresent that the follower of user u reads the probability of the blog article of this user.I.e. two users " reading " probability in i topic space need to be more than or equal to certain threshold value η.

Read web figure be a weighted digraph, in i-th topic space, user to user reading relation (that is, due to user read user blog article and the reading relation produced) weights can be defined as follows:

w_{d} (u_{s}^{i}, u_{t}^{i}) = t_{t}^{i} \times P_{read} (u_{s}^{i}, u_{t}^{i})

Wherein, represent user in the probability distribution in i-th topic space.

Make read web figure under i-th topic space in transition probability matrix be transition probability then between user is defined as follows.

Define under 5. i-th topic spaces in read web, user random reading user the transition probability of blog article is defined as:

P_{d}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{d} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{d} (u_{s}^{i}, u^{i})},

Wherein represent under i-th topic space, user to user " reading " relation (that is, due to user read user blog article and the reading relation produced) weights; under representing i-th topic space, user usi is to the weights sum of " reading " relation of its all good friend.

Describe at transmission network, the execution mode of replying the calculating transition probability in network, duplicate network and read web above successively, but obviously, in Practical Calculation process, the above-mentioned transition probability in these networks can be calculated with any order, also can perform one or more concurrently to calculate, be not limited to order as described above.

Merge many relational networks method

In order to weigh the influence power of microblog users individuality all sidedly, the above-mentioned transition probability in above-mentioned four kinds of networks can be considered for each user, to calculate the influence power Rank scores of each user, to this user (namely this influence power Rank scores reflects other user's random accesss, forward, reply, copy and read the blog article of this user) probability, itself so reflect the topic influence of this user.

The impact that user is subject to its good friend affects inside in network at above-mentioned 4 kinds and shows as random walk process, also will jump to another kind of impact in network with certain probability simultaneously.In an embodiment of the invention, define user rest on transmission network, reply network, duplicate network, read web probability be respectively λ ₁, λ ₂, λ ₃, λ ₄, and meet λ ₁+ λ ₂+ λ ₃+ λ ₄=1.Then user will with 1-λ ₁probability jump to other networks from transmission network; With 1-λ ₂probability from reply network hop forward other networks to; With 1-λ ₃probability jump to other networks from duplicate network; With 1-λ ₄probability jump to other networks from read web.By considering the stop probability λ of each network, make transition probability matrix be B, in i-th topic space in 4 kinds of networks between user (such as, from user to user ) transition probability be defined as follows respectively:

1) transmission network:

B_{a}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{1} \frac{w_{a} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{a} (u_{s}^{i}, u^{i})}

2) network is replied:

B_{b}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{2} \frac{w_{b} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{b} (u_{s}^{i}, u^{i})}

3) duplicate network:

B_{c}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{3} \frac{w_{c} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{c} (u_{s}^{i}, u^{i})}

4) read web

B_{d}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{4} \frac{w_{d} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{d} (u_{s}^{i}, u^{i})}

In yet another embodiment of the present invention, according to PageRank algorithm, user not only along network random walk (such as, user links the microblogging of directly access good friend via its good friend), but also jump to other nodes at random (such as with certain probability β, user by other means, such as manual input, visit the microblogging of the user not being its good friend), so the stop probability λ of the redirect probability β considered in this embodiment between internal nodes of network and each network, transition probability matrix is made to be B, in i-th topic space in 4 kinds of networks between user (such as, from user to user ) transition probability be defined as follows respectively:

1) transmission network:

B_{a}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{1} \times (1 - β) \times \frac{w_{a} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{a} (u_{s}^{i}, u^{i})} + \frac{β}{n}

2) network is replied:

B_{b}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{2} \times (1 - β) \times \frac{w_{b} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{b} (u_{s}^{i}, u^{i})} + \frac{β}{n}

3) duplicate network:

B_{c}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{3} \times (1 - β) \times \frac{w_{c} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{c} (u_{s}^{i}, u^{i})} + \frac{β}{n}

4) read web

B_{d}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{4} \times (1 - β) \times \frac{w_{d} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{d} (u_{s}^{i}, u^{i})} + \frac{β}{n}

N in above-mentioned formula represents the interstitial content in corresponding network.

Make r ⁱu () is the Rank scores of user u in i-th topic space, consider the random walk of user in 4 kinds of networks, be then defined as follows by the Rank scores of user u in i-th topic space in one embodiment:

r^{i} (u) = \underset{(u_{t}^{i}, u) &Element; E_{Retweet}^{i}}{Σ} B_{a}^{i} (u {| u}_{t}^{i}) r^{i} (u_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Reply}^{i}}{Σ} B_{b}^{i} (u | u_{t}^{i}) r^{i} (u_{t}^{i})

+ \underset{(u_{t}^{i}, u) &Element; E_{Copy}^{i}}{Σ} B_{c}^{i} (u | u_{t}^{i}) r^{i} (u_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Read}^{i}}{Σ} B_{d}^{i} (u | u_{t}^{i}) r^{i} (u_{t}^{i}) .

Namely the Rank scores of user determines primarily of the probability of follower's random access to this user.

The Rank scores that can calculate user u in i-th topic space is out of shape by above formula:

r^{i} (u) = \underset{(u_{t}^{i}, u) &Element; E_{Retweet}^{i}}{Σ} B_{a}^{i} (u {| u}_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Reply}^{i}}{Σ} B_{b}^{i} (u | u_{t}^{i})

+ \underset{(u_{t}^{i}, u) &Element; E_{Copy}^{i}}{Σ} B_{c}^{i} (u | u_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Read}^{i}}{Σ} B_{d}^{i} (u | u_{t}^{i})) \times r^{i} (u_{t}^{i}) .

Make M be the transition probability matrix merging many relational networks random walk model, then many relational networks random walk iterative model is:

π_{i}^{t + 1} = M π_{i}^{t}

Wherein be in i-th topic space, user's Rank scores vector in the t time iterative process.

From above formula, many relational networks random walk iterative model is the markoff process of an ergodicity.Therefore, can a given initial vector (such as, each node in network can be made initially to have identical Rank scores), then, by n iteration, result of calculation will restrain gradually.When meeting certain stop condition, the continuation of iteration can be stopped.

Experimental result

Be illustrated below in conjunction with the effect of experimental result to the inventive method.

Experiment adopts 261954 Chinese users in Twitter, and wherein issuing the number of users pushing away literary composition at 2011-04-15 to 2011-07-15 between these 3 months is 103836, and account for 39.6% of Chinese user's ratio, acquisition 2660281 simultaneously pushes away literary composition.

In order to verify the validity of the inventive method, demonstrate the accuracy rate that influence power individuality excavates respectively.Because in Twitter, actual influence power individuality is difficult to artificially determine, so rely on the cross validation of many algorithms to determine the accuracy rate of often kind of algorithm.Experiment has considered the individual mining algorithm of 6 kinds of influence powers:

1) forward the community network that relation is formed, the influence power in transmission network is individual to rely on PageRank algorithm to find, defining this type of algorithm is RepostRank;

2) reply the community network of relation formation, the influence power of replying in network is individual to rely on PageRank algorithm to find, defining this type of algorithm is ReplyRank;

3) people such as Weng proposes TwitterRank algorithm;

4) rely on follower's number to weigh individual influence power, defining this type of algorithm is FollowerNum;

5) rely on number of posting to weigh individual influence power, defining this type of algorithm is TweetNum;

6) the present invention propose many relational networks in random walk model, defining this type of algorithm is illustrated in MultiRank(Fig. 5 and 6 that three kinds of different parameters of MultiRank algorithm are arranged, be named as MultiRank1, MultiRank2 and MultiRank3 respectively, impact on MultiRank algorithm is set to weigh different parameters further).

Experiment utilizes cross validation method, i.e. the result correct result as a reference that all praises of multiple (N kind) algorithm.Such as given 4 kinds of algorithm A, Top-K the high-impact individual collections that B, C, D obtain is respectively I _a, I _b, I _cand I _d, suppose the result correct result as a reference that 2 kinds of algorithms all praise, then defining influence power individual reference standard set is:

I ₂=(I _A∩I _B)∪(I _A∩I _C)∪(I _A∩I _D)∪

(I _B∩I _C)∪(I _B∩I _D)∪(I _C∩I _D)

Accuracy rate P (Precision) reflects the authenticity of influence power individuality in Twitter, and namely algorithm A influence power individuality finds that accuracy rate is defined as follows:

P_{A} = \frac{| I_{A} \cap I_{2} |}{| I_{A} |}

Experiment obtains Top-10 in each topic, 20,50,100,200 respectively according to above-mentioned each algorithm, and the influence power of 500 is individual.

For N=2,3,4,5,6, when 7, Top-10 in each topic classification, as shown in Figure 5, its ordinate represents accuracy rate, and abscissa represents each topic in the Average Accuracy distribution of the individual mining algorithm of 20,50,100,200,500 influence powers.

Experimental result is known as shown in Figure 5, at N=2, and 3,4,5,6, when 7, the Selecting parameter of the MultiRank algorithm that the present invention proposes is (shown in Fig. 5 MultiRank2) when the 2nd group, in 10 topic classifications, all show higher accuracy rate.The principle of MultiRank2 Selecting parameter is that balancing network scale is to influence power Rank scores r ⁱthe impact of (u), and the principle of MultiRank1, MultiRank3 Selecting parameter is respectively read web Rank scores r domination influence power Rank scores r ⁱ(u), and forward, reply, duplicate network Rank scores r arranges influence power Rank scores r ⁱu (), due in many relational networks, MultiRank1, MultiRank3 are subject to the impact of network size respectively, cause the relative reduction of accuracy rate.

Experimental result shows in each topic simultaneously, and along with the increase of normative reference quantity N in cross validation, accuracy rate performance downward trend, because normative reference quantity N increases, will cause the intersection set element I of multiple normative reference _nnumber is less, thus makes any particular algorithms I _awith I _nintersection set element less, cause the reduction of accuracy rate.Experimental result also shows normative reference quantity N=3, and when 4, each algorithm accuracy rate discrimination is larger, and experiment effect is best, if too low (N=2) that N is arranged, will cause the intersection set element I of multiple normative reference _nnumber is on the high side, makes MultiRank1, and MultiRank2 and MultiRank3 is totally 3 kinds of algorithms and canonical reference set I _ncommon factor element basically identical, cause 3 kinds of algorithm accuracy rate discriminations little.Too high (N=5,6,7) that if N is arranged, the intersection set element I of multiple normative reference will be caused _nnumber is on the low side, makes each algorithm and canonical reference set I equally _ncommon factor element basically identical, cause accuracy rate discrimination little.

Subsequently for N=2,3,4,5,6, when 7, the Average Accuracy of 10 topic classifications at Top-10,20,50,100,200, the distribution of 500 as shown in Figure 6.

Experimental result is as shown in Figure 6 known, the MultiRank algorithm that same the present invention proposes all shows higher accuracy rate in the individual mining algorithm of 6 groups of Top-K influence powers, simultaneously normative reference quantity N=3,4 time, each algorithm accuracy rate discrimination is larger, experiment effect the best.Experimental result also shows that along with the number of K increases, accuracy rate presents ascendant trend, due to the increase of influence power individual amount, by causing the identical element number of the cross validation of polyalgorithm to increase, makes accuracy rate increase in Top-K influence power individuality excavates.

The preferred embodiment of the present invention is illustrated above, it should be noted that, be described in detail for microblogging application in a preferred embodiment, but it will be appreciated by those skilled in the art that, method as herein described can be applied in other application outside microblogging, and corresponding user can deliver other dispatches, such as text, video, audio frequency etc. and their combination in any, be not limited to blog article.In addition, specific algorithm, formula, optimum configurations etc. that the above-mentioned concrete implementation section of the application is mentioned only illustrate for example, are not limited to the present invention.Those skilled in the art are when knowing design concept of the present invention and connotation, and can carry out suitable distortion and replacement to above-mentioned algorithm, formula, parameter etc., it still belongs to the protection range of the application.

Claims

1. for determine the network user dispatch between whether there is the method for replication relation, described method comprises:

Obtain to exist and clearly forward the time probability distribution that the time interval between two sections of relation dispatches obeys;

The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution;

Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet;

Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And

Determine whether there is replication relation between two sections of dispatches based on described similarity.

2. method according to claim 1, wherein, described infer based on the distribution of above-mentioned time probability there is replication relation two sections of dispatches between the amenable time probability distribution of time interval institute comprise: by the time interval between the two sections of dispatches that there is replication relation amenable time probability distribute be inferred as with exist two sections of clearly forwarding relation send the documents between the time interval time probability of obeying distribute identical.

3., based on a topic influence individual method for digging for many relational networks, comprising:

The forwarding relation extracted between user forwards relational network to construct, and calculates the transition probability of the dispatch of another user of user's random forwarding in described forwarding relational network;

The reply relation extracted between user replys relational network construct, and calculating user in described reply relational network replys the transition probability of the dispatch of another user at random;

According to by the determined dispatch that there is replication relation of the method for claim 1 or 2, replication relation between extraction user to construct replication relation network, and calculates the transition probability of the dispatch of another user of user's random reproduction in described replication relation network;

The reading relation extracted between user reads relational network construct, and calculating user in described reading relational network reads the transition probability of the dispatch of another user at random;

Consider above-mentioned transition probability in above-mentioned four kinds of networks to calculate arbitrary user by the probability of other user's random accesss.

4. method according to claim 3, wherein, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between user, dispatch and user, extract the reading relation between user.

5. method according to claim 3, wherein, the number of times forwarding the dispatch of other users based on a described user calculates the transition probability of the dispatch of another user of user's random forwarding described in described forwarding relational network.

6. method according to claim 3, wherein, the number of times of replying the dispatch of other users based on a described user to calculate described in described reply relational network the transition probability that a user replys the dispatch of another user at random.

7. method according to claim 3, wherein, based on two users dispatch between the time interval and similarity, calculate the transition probability of the dispatch of another user of user's random reproduction in described replication relation network.

8. method according to claim 3, wherein, any one or more based in the interest similarity between the number of the time sequence model similitude of posting between two users, dispatch and two users, to calculate in described reading relational network the transition probability that a user reads the dispatch of another user at random.

9. method according to claim 3, wherein, described access comprises forwarding, replys, copies and read.

10. method according to claim 3, wherein, described dispatch is text, video, audio frequency or their combination in any that user delivers.