CN103530402A

CN103530402A - Method for identifying microblog key users based on improved Page Rank

Info

Publication number: CN103530402A
Application number: CN201310502924.7A
Authority: CN
Inventors: 程工; 刘春阳; 张旭; 庞琳; 吴俊杰; 韩洋; 刘洪甫; 韩小汀
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-10-23
Filing date: 2013-10-23
Publication date: 2014-01-22

Abstract

The invention discloses a method for identifying microblog key users based on an improved Page Rank. The method comprises the steps that microblog information data are input, wherein the microblog information data comprise n microblogs; word segmentation is conducted on texts of the n microblogs; according to a word segmentation result, a reverse index structure is established, so that retrieval is conveniently conducted according to appointed keywords; according to the retrieved relevant microblog, forwarding hierarchy information of the microblog is extracted and a weighting directed graph is established, wherein the weighting directed graph is a forwarding network G; the forwarding network G is divided into a plurality of maximum connected subgraphs Gi; the Page rank algorithm is applied to each sub network Gi according to the parallelization computing technology; computing results of the sub networks are combined, so that ranking results of the whole network G are generated; the first m ranking results of the ranking results are selected, serve as the key users and are output. According to the method for identifying the microblog key users based on the improved Page Rank, the parallelization computing technology is adopted, a dynamic forwarding network of a microblog platform is ranked and computed in a big data environment, so that the key users in the information transmission process are identified, and the method is applied to the fields of network public opinion analysis and the like.

Description

A kind of microblogging key user recognition methods based on improved PageRank

Technical field

The present invention relates to microblogging key user recognition methods, particularly a kind of microblogging key user recognition methods based on improved PageRank, belongs to complex network and Data Mining, the special microblogging data analysis for magnanimity.

Background technology

Microblogging platform key user is the user that the propagation of information and diffusion are played an important role.Key user plays a part important intermediary or filtration in the forming process of mass media effect, and they are by diffusion of information to audient, and the cascade that forms information is propagated.Therefore, the discovery of key user's the information that is identified in and propagation analysis aspect play important effect, for work such as Internet public opinion analysis, have great directive significance.Yet content-based key user's identification, because the feature of microblogging short text is often not accurate enough; Key user's identification based on structure, think in the network structure being formed by user that key user is all important node, and only consider static concern bean vermicelli relational network, but because the reasons such as the degree of correlation, user's liveness cause a lot of important node not have key user's feature.In view of this, the present invention, in conjunction with content and network structure two aspect features, utilizes the dynamic network of the forwarding information structure Topic relative in microblogging platform, and key user's identification is converted into the sequencing problem based on PageRank on digraph.In addition, consider that microblogging platform has the feature of mass data, the network dynamically generating is often huge, and needs instant computing result, and therefore the consumption of traditional power iteration algorithm on computing time and space is all difficult to satisfy the demands.Adopt parallel computing to decompose on some extendible computing nodes calculating on a large scale, reduced the Space-time Complexity of algorithm and for the requirement of the computing power of single node, be particularly suitable in cloud computing platform deploy and application.So the present invention adopts MapReduce to realize PageRank parallel computation, under large data environment, for microblogging short text information, excavate, thus the key user in identifying information communication process.

Summary of the invention

The present invention proposes a kind of microblogging platform key user recognition methods based on improved PageRank.The method be take PageRank as core, extracts forwarding information and builds the network being formed by forwarding relation, and then utilize final acquisition of PageRank to have high robust and high-quality key user's recognition result from microblogging text structure.Utilize forwarding relation can obtain inquiring about the dynamic transmission network that relevant quality is higher, overcome to a certain extent the inferior solution being caused by short text, correlativity, static network structure; A plurality of dynamic transmission networks combine after calculating respectively by PageRank and improve counting yield and algorithm robustness.A microblogging key user recognition methods of improved PageRank, comprises the steps:

Step 1: utilize web crawlers and microblogging API to collect the micro-blog data on internet, mainly comprise author information, text message, authorship's information and the temporal information of microblogging.

Step 2: the text data in microblogging data is carried out to word segmentation processing.

Step 3: stop words removes, stem extracts.

Step 4: build inverted index structure, be convenient to fast from keyword retrieval microblogging.

Step 5: the structure of dynamic transmission network, microblogging for each forwarding extracts its forwarding hierarchical information, as shown in Figure 2, comprise from authorship to deliver author and propagate process user, build thus directed networks, node is user, the relation of limit for forwarding, is the direction using the contrary direction of information flow as limit here, and the weight on limit is for forwarding the number of times occurring, direction is pointed to the author of its forwarding for author, and the final data structure forming is weighted digraph.

Step 6: the decomposition of transmission network, is divided into several very big connected subgraphs by transmission network G, for each sub-network G _i, give its weight, i.e. W _i=V (G _i)/V (G), the wherein nodes in V (G) representative graph G.

Step 7: parallel iteration, solves the PageRank(pr of each node in sub-network) value.

(1) starting stage, for each sub-network G _ieach node v give initial p ageRank value pr _i0(v)=1/V (G _i).

(2) iteration phase, utilizes MapReduce concurrent technique to calculate each node PageRank value in subnet, and its computing formula is pr _ik=((1-d) E/n+dA _i ^t) pr _ik-1, wherein k is the current iteration stage; D is ratio of damping, is generally made as 0.85; N is the nodes in network; E is complete 1 vector; A _ifor sub-network G _iadjacency matrix.The end condition of iteration can reach the threshold value that the threshold value of appointment or result difference that twice iteration produces are less than appointment for k, | and pr _ik-pr _ik-1| < ε.

Step 8: the fusion of ranking results, by each sub-network G _iranking results pr _i, be multiplied by its weights W _i, then the result of each sub-network is merged to the ranking results pr that finally obtains G.

Step 9: sort result output, pr, according to value sequence, is exported to the list of user's importance sorting, a front m user is key user.

The present invention utilizes dynamic forwarding relation to build user network, and a plurality of dynamic transmission network ranking results are merged, and greatly improves efficiency and the quality of microblogging platform key user identification under large data environment, also has very strong robustness.Method of the present invention can overcome the shortcoming of short text and static network structure, can effectively improve the efficiency of Internet public opinion analysis.

Accompanying drawing explanation

Fig. 1 is general frame figure of the present invention.

Fig. 2 is forwarding information exemplary plot.

Fig. 3 is dynamic network figure.

Fig. 4 is MapReduce schematic diagram.

Embodiment

Below in conjunction with accompanying drawing and concrete embodiment, the present invention is made further instructions.

The present invention proposes a kind of microblogging platform key user recognition methods based on improved PageRank.The method be take PageRank as core, adopt MapReduce parallel computing to overcome the low problem of counting yield of the large data of microblogging, from microblogging text structure, extract forwarding information and build the network being formed by forwarding relation, and then utilize final acquisition of PageRank to there is high robust and high-quality key user's recognition result.Utilize forwarding relation can obtain inquiring about the dynamic transmission network that relevant quality is higher, overcome to a certain extent the inferior solution being caused by short text, correlativity, static network structure; A plurality of dynamic transmission networks combine the robustness that improves result after calculating respectively by PageRank.

A kind of microblogging platform key user recognition methods based on improved PageRank, be divided into three modules: Data Collection and pretreatment module, dynamically transmission network builds module and PageRank calculating and Fusion Module, wherein Data Collection and pretreatment module comprise Data Collection, process, stop words removes, extract stem and generate the processes such as inverted index, dynamically transmission network builds module and utilizes micro-blog information extraction forwarding relation structure transmission network, PageRank calculates with Fusion Module and comprises transmission network division, PageRank calculates, result of calculation merges and last sequence and output, specific implementation process is as follows:

1. utilize web crawlers and microblogging API to collect the micro-blog data on internet, mainly comprise author information, text message, authorship's information and the temporal information of microblogging.Take Sina's microblogging as example, and the present invention comprises in the field of obtaining data phase by API and may needing:

Wherein a partial data structure (JSON form) that forwards microblogging is as follows:

In addition it should be noted that the retweeted_staus item for here, the microblogging that not necessarily user directly forwards, but be the microblogging of tracing to the source certainly, be the initiation point of Information Communication, i.e. user E in Fig. 2 forwarding information.

2. the text data in microblogging data is carried out to word segmentation processing.

3. stop words removes, stem extracts.

4. build inverted index structure, be convenient to fast from keyword retrieval microblogging.

5. the dynamic structure of transmission network, for the microblogging of each forwarding, extract the hierarchical information of its forwarding, comprise from authorship to deliver author and propagate process user, build thus directed networks, node is user, the relation of limit for forwarding, it should be noted that the direction using the contrary direction of information flow as limit, the number of times of the weight on limit for forward occurring, direction is pointed to the author of its forwarding for author, and the data structure of final formation is weighted digraph.The Fig. 3 of take below illustrates the building process of transmission network as example.

In this figure, have 2 forwarding informations, relate to altogether A, B, C, D and E totally 5 users, in figure, each limit represents forwarding relation, as A-> B shows that A has forwarded the information of B, the weight on this limit is the number of times that A has forwarded the information of B, is 2 times in the figure.

6. the decomposition of transmission network, is divided into several very big connected subgraphs by transmission network G, for each sub-network G _i, give its weight, i.e. W _i=V (G _i)/V (G), the wherein number of vertex in V (G) representative graph G.

7. sub-network parallel computation.

(1) starting stage, for each sub-network G _ieach node v give initial value pr _i0(v)=1/V (G _i).

(2) iteration phase, PageRank prime formula is pr _ik=((1-d) E/n+dA _i ^t) pr _ik-1, wherein k is the current iteration stage; D is ratio of damping, is generally made as 0.85; N is the nodes in network; E is complete 1 vector; A _ifor sub-network G _iadjacency matrix.The end condition of iteration can reach threshold value that the threshold value of appointment or result difference that twice iteration produces be less than appointment for k | pr _ik-pr _ik-1| < ε.

Common PageRank algorithm the convergence speed is very slow, when data scale is very large, can greatly reduce the execution efficiency of algorithm.The present invention, by MapReduce computation model, is converted into several minor issues by the problem of Algorithm for Solving, for do distributed parallel at different computing nodes, calculates, and finally does merger again and gathers.Below in conjunction with Fig. 3, introduce the principle of MapReduce parallel computation PageRank.

The kernel data structure of MapReduce is " key-value pair " structure, uses < Key here, and Value > represents.Wherein Key represents key, Value typical value.

In the present invention, regard the node in network as key,

using the pr value of each node and out-degree node thereof as value, i.e. Value _i=(pr _i, outlinks (v _i, G)), wherein outlinks (v, G) shows the out-degree node of node v in figure G, as shown in the input phase in Fig. 4.By whole dynamic transmission network thaumatropy, be finally the set of row, wherein key-value pair of each behavior.?

RowData = \begin{matrix} < {Key}_{1}, {Value}_{1} > \\ < {Key}_{i}, {Value}_{i} > \\ . . . . . . \\ < {Key}_{n}, {Value}_{n} > \end{matrix}

The core calculations process of MapReduce can be understood as two steps below:

(a) mapping (Map): a pair of key-value pair is answered by forwarding relation and another or several key-value pair, as follows:

< {Key}_{0}, {Value}_{0} > {< {Key}_{1}, {Value}_{1} >, < {Key}_{2}, {Value}_{2} >,}_{&DoubleRightArrow;}^{M} < {Key}_{3}, {Value}_{3} > . . .

Wherein M is mapping function.

(b) abbreviation (Reduce): the key-value pair that several are had to same keys merges their value letter, as follows:

＜Key ₁,Value ₁＞

< {Key}_{1}, {Value}_{2} > &DoubleRightArrow; < {Key}_{1}, R ({Value}_{1}, {Value}_{2}, {Value}_{3}) >

＜Key ₁,Value ₃＞，

Wherein R is abbreviation function

It should be noted that the result generating in abbreviation (Reduce) process is wherein also the set of key-value pair, can continue on for doing the input of shining upon (Map) process, this may for realizing that some iterative algorithms provide.

In this application, mapping process is using the out-degree node of the node in network as new key, using the PageRank value of node divided by the total quantity of out-degree node as new value, bear new key-value pair,

< {Key}_{i}, {Value}_{i} > &DoubleRightArrow; < {Key}_{j}, {pr}_{i} / count (outlinks (v_{i}, G)) >, &ForAll; v_{j} &Element; outlinks (v_{i}, G) .

In abbreviation process, for each key, its value is for pointing to the PageRank value sum of the all-links of the corresponding node of this key,

\begin{matrix} < {Key}_{i}, {Value}_{1} > \\ < {Key}_{i}, {Value}_{2} > &DoubleRightArrow; < {Key}_{i}, (1 - d) / N + d * \underset{j}{Σ} {Value}_{j} > \\ < {Key}_{i}, {Value}_{3} > \end{matrix}

Wherein d is the ratio of damping of PageRank algorithm, is traditionally arranged to be the total number that 0.85, N is nodes.

Last abbreviation result can be used as again the input of mapping next time, can constantly carry out iteration thus, until reach end condition.It should be noted that in said process and the out-degree nodal information of node do not transmitted in mapping-abbreviation, actual be need to be in mapping the new pr value of photogenerated node not, also to preserve its original structural relation.

(3) fusing stage, by each sub-network G _iranking results pr _i, be multiplied by its weights W _i, then the result of each sub-network is merged to the ranking results pr that finally obtains G.

8. sort result output, by pr, according to value sequence, obtaining node is key user's sorted lists, pr value is larger, shows that this user is more important.

The present invention proposes a kind of microblogging platform key user recognition methods based on improved PageRank, the method combines the structural information of content and dynamic network, user is sorted, and adopt MapReduce concurrent technique to accelerate travelling speed.The method overcomes the inferior solution being caused by short text, correlativity, static network structure to a certain extent; A plurality of dynamic transmission networks combine the robustness that improves result after calculating respectively by PageRank.

Although embodiment of the present invention openly as above; but it is not limited only to listed utilization in instructions and embodiment; it can be applied to various applicable the field of the invention completely; for those skilled in the art; can easily implement other modification, therefore, not deviate under the universal that claim and equivalency range limit; can also make some improvements and modifications, these improvements and modifications also should be considered as in protection domain of the present invention.

Claims

1. the microblogging key user recognition methods based on improved PageRank, is characterized in that, comprises the following steps:

Step 1, input micro-blog information data, comprising the user of n bar microblogging, text, temporal information etc.;

Step 2, according to the default dictionary that contains m vocabulary, carries out word segmentation processing to the text of n bar microblogging;

Step 3, sets up inverted index structure according to the result after participle, is convenient to according to designated key word and search;

Step 4, extracts it according to the relevant microblogging retrieving and forwards hierarchical information, builds the oriented transmission network G of weighting;

Step 5, is divided into several very big connected subgraph G by transmission network G _i;

Step 6 adopts respectively parallel computing in each sub-network, and application PageRank algorithm merges after drawing the ranking results of each sub-network again;

Step 7, ranking results output.

2. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, it is characterized in that, the micro-blog information in described step 1 is data from capture the micro-blog information data in a Preset Time section on internet, microblogging platform api.

3. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, is characterized in that, the text of every microblogging is carried out to word segmentation processing and all comprise the following steps:

1) remove the stop-word in text;

2) according to the vocabulary in default dictionary, according to the vertical order of text, the word in text is carried out to participle fractionation.

4. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, is characterized in that, sets up microblogging word segmentation result to the inverted index between microblogging in step 3 described in claim 1, convenient according to keyword retrieval appointment microblogging.

5. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, is characterized in that, in step 4 described in claim 1, according to the information extraction of microblogging text, it forwards hierarchical information, builds weighted digraph.

6. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, is characterized in that, in step 5 described in claim 1, transmission network is decomposed into several very big connected subgraphs, and gives weights for each sub-network.

7. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, is characterized in that, in step 6 described in claim 1, in each sub-network, adopts PageRank algorithm.

8. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, it is characterized in that, in step 6 described in claim 1, in each sub-network, adopt the parallel computing based on MapReduce computation model to realize PageRank algorithm.

9. the microblogging key user recognition methods based on improved PageRank as claimed in claim 1, is characterized in that, in step 6 described in claim 1, forms the sequence result of calculation of whole transmission network after the result of calculation of each sub-network merges.