CN105045822A

CN105045822A - Method for monitoring similar users of specific user in microblog

Info

Publication number: CN105045822A
Application number: CN201510363990.XA
Authority: CN
Inventors: 仲兆满; 管燕
Original assignee: Huaihai Institute of Techology
Current assignee: Huaihai Institute of Techology
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2015-11-11

Abstract

The invention discloses a method for monitoring similar users of a specific user in microblog. The method comprises the following steps: (1) obtaining a user set of an EEN (Extended Ego Network) according to an input specific user su and presenting the user set as EEN (su)=FollowerCS (su) UFansCS (su) UVisitor CS (su), wherein the FollowerCS (su) is a following set of the su, the FansCS (su) is a fans set of the su and the Visitor CS (su) is a visitor set of the su; and (2) finding a similar user set SimUser (su)iEEN (su) similar to the su on the basis of the followers and fans of the users, the similarity of the dynamic microblogs and the dynamic interaction reciprocity among the users. According to the method disclosed by the invention, the users of a visitor type are introduced, the comprehensiveness and diversity for finding the similar users are increased, and the dynamic division of time is introduced so that the dynamics of the microblog can be better embodied and the found similar users are more correct.

Description

Similar user monitoring method for specific user in microblog

Technical Field

The invention relates to an information mining technology, in particular to a similar user monitoring method for a specific user in a microblog.

Background

Today, social media is considered one of the most valuable information resources on the Web. The microblog is one of a plurality of social media, and due to the strong spreading and the convenient operation, a plurality of users form a circle of interaction similar to the real society on the microblog. Two networks are arranged between traditional media users and topics, and the microblog introduces concerns and fans, so that a multi-mode network is arranged between the users and the topics. Due to the strong spreading of microblog information and the complex network structure, the high attention of academic and industrial fields is attracted in recent years.

Similar users in the microblog refer to a user group with a plurality of common attributes on a microblog medium, and the attributes mainly comprise information of the background, attention, fans, the microblog, interaction and the like of the users. The information of users on social media is generally divided into two categories: one is the user's context (e.g., location, education, occupation, interest, etc.) and published microblog information; another class is social networks built based on concerns and fans. Based on these two types of information, existing user similarity calculation methods can be roughly classified into three types: (1) a method based on the background and microblog information of the user, abbreviated as SUDByText; (2) a social networking method based on attention and fans, abbreviated as SUDBySN; (3) the hybrid method is a fusion calculation of methods SUDByText and SUDBySN, abbreviated as SUDByTSN. Currently, SUDByTSN is the mainstream research method.

Conference discourse published in the united states in 2011: 2011visual information communication-interaction conference (Proceedingsof2011visual information communication-international symposium), entitled: interest-based friend discovery and recommendation in social networks (Sfviz: interest-based friends and social networks), the authors were: GouL, YouF, GuoJ, WuL, ZhangXL, which proposes to calculate the similarity of users using social tags of users and the topology of the network, including the attention and fans of users, without utilizing guest-class users.

Journal published in germany in 2013: UserModellingandiuser-AdaptedInterction, entitled: exploringsocialcaggingfor personalizedcommumylrecogments, the authors are: KimHN, SaddikAE, who looks away from a user to find his interested communities based on social tags. The social tags of the community are extracted through tags of community members, including interests, emotions, geographic positions, time and the like of the members.

A journal published in china in 2014: the Chinese information newspaper is characterized in that: based on microblog user recommendation of sequencing learning, the author is as follows: according to the method, when microblog user recommendation is performed, four factors of a user, such as microblog, personal information, interactive information, social topological information and the like are utilized, and the influence of the interactive information of the user on the recommendation performance of similar users is considered to be the largest.

A journal published in china in 2014: the computer learns the newspaper, the title is: the similarity measurement of microblog users and the application thereof have the following authors: xuximing, Lidong, Liu Jiang, Lisheng, Wang just and Yuantree Lun, the article considers the background information, microblog, social contact and interaction information of the user when measuring the similarity of the user. And (3) with 50 users as seed nodes, 1-layer related fans and attention users are crawled, and the social information is considered to be most valuable when the similarity of the users is calculated.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for monitoring similar users by microblog media, aiming at the problems and the defects in the prior art, the method can increase the comprehensiveness and diversity of similar users and improve the accuracy of similar user discovery.

The technical problem to be solved by the invention is achieved by the following technical scheme. The invention relates to a method for monitoring similar users of a specific user in a microblog, which is characterized by comprising the following steps of:

A. obtaining a user set of an extended self-network EEN (ExtendedEgoNet) according to an input specific user su, and recording the user set as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su), wherein FollowerCS (su) is an attention set of su, FansCS (su) is a fan set of su, and VisitorCS (su) is a visitor set of su;

the method comprises the following specific steps:

a1, acquiring all microblog sets MB-su of a user su in a timeslice TimeSpan, including original, forwarded and commented microblogs;

a2, acquiring a su attention set FollowerCS (su) and a fan set FansCS (su);

a3, extracting a visitor set VisitorCS (su) according to a su microblog set MB-su, and recording three types of users as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su);

B. based on the attention of users, the similarity of fans and dynamic microblogs and the dynamic interaction among users, a user set SimUser (su) similar to su is found from EEN (su), and the specific steps are as follows:

b1, acquiring user of each user_i EEN (su) microblog set MB-user within TimeSpan_iFollowerCS (user) of interest set_i) Vermicelli incorporating FansCS (user)_i)；

B2, calculating user su and user_i The dynamic microblog similarity of EEN (su) is marked as MBSim (su, user)_i)，Wherein, T_jIs a certain time slice, T_j-T₁The calculation result is the number of phase differences of the time slices, l is an exponential decay parameter, and the user_iAt T_jTime slice microblogVectorization is represented as:

K W - {user}_{i}^{T_{j}} = {< {kw}_{1} - {user}_{i}^{T_{j}}, w_{1} - {user}_{i}^{T_{j}} >, < {kw}_{2} - {user}_{i}^{T_{j}},

w_{2} - {user}_{i}^{T_{j}} >, L, < {kw}_{y} - {user}_{i}^{T_{j}}, w_{y} - {user}_{i}^{T_{j}} >,

wherein,for the weight of the feature item, the method of TF-IDF is used for calculating, and two users su, user are in the time slice Tj_iThe microblog similarity is calculated by a cosine included angle mode:

b3, calculating user su and user_i Dynamic cross-correlation of EEN (SpecUser), denoted as RC (su, user)_i)，Wherein, T_jIs a certain time slice, T_j-T₁The calculation result is the phase difference number of the time slice, l is an exponential decay parameter, and two users su, user in the time slice Tj_iThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions of m time slices is denoted RCmax by RC_maxNormalizing the user's interaction relevance for reference:

R C ({su}^{T_{j}}, {user}_{i}^{T_{j}}) = \frac{R C ({su}^{T_{j}}, {user}_{i}^{T_{j}})}{{RC}_{\max}};

b4, calculating user su and user_i Attention similarity of EEN (SpecUser) is reported as FollowerSim (su, user)_i)，

F o l l o w e r S i m (s u, {user}_{i}) = \frac{| F o l l o w e r C S (s u) I F o l l o w e r C S ({user}_{i}) |}{| F o l l o w e r C S (s u) U F o l l o w e r C S ({user}_{i}) |};

B5, calculating user su and user_i Vermicelli similarity of EEN (SpecUser) is marked as FansSim (su, user)_i)，

F a n s S i m (s u, {user}_{i}) = \frac{| F a n s C S (s u) I F a n s C S ({user}_{i}) |}{| F a n s C S (s u) U F a n s C S ({user}_{i}) |};

B6, finally obtaining the user su and the user_i Similarity of EEN (SpecUser) Sim (user)_i)，

S i m (s u, {user}_{i}) = {\log_{2}}^{2 + R C (s u, {user}_{i}))} * (l_{1} * F o l l o w e r S i m (s u, {user}_{i}) + l_{2} * F a n s S i m (s u, {user}_{i}) + l_{3} * M B S i m (s u, {user}_{i})) .

Compared with the prior art, the method of the invention has the following effects: the method introduces visitor type users, and increases the comprehensiveness and diversity of similar users; dynamic division of time is introduced, so that the dynamics of the microblog can be better reflected, and found similar users are more accurate.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a flowchart of the subscriber set een (su) for obtaining the extended self network in step 101 of fig. 1;

fig. 3 is a flowchart of finding a subscriber set simuser (su) similar to su from the een (su) in step 102 of fig. 1.

Detailed Description

The following describes the implementation of the present invention in further detail with reference to the accompanying drawings and the detailed description.

Embodiment 1, a method for monitoring similar users of a specific user in a microblog, comprising the following steps:

the method comprises the following specific steps:

a2, acquiring a su attention set FollowerCS (su) and a fan set FansCS (su);

b1, acquiring user of each user_i EEN (su) TimeSpan at time sliceInner micro blog set MB-user_iFollowerCS (user) of interest set_i) Vermicelli incorporating FansCS (user)_i)；

B2, calculating user su and user_i The dynamic microblog similarity of EEN (su) is marked as MBSim (su, user))，Wherein, T_jIs a certain time slice, T_j-T₁The calculation result is the number of phase differences of the time slices, I is an exponential decay parameter, and the user_iAt T_jTime slice microblogVectorization is represented as:

K W - {user}_{i}^{T_{j}} = {< {kw}_{1} - {user}_{i}^{T_{j}}, w_{1} - {user}_{i}^{T_{j}} >, < {kw}_{2} - {user}_{i}^{T_{j}},

w_{2} - {user}_{i}^{T_{j}} >, L, < {kw}_{y} - {user}_{i}^{T_{j}}, w_{y} - {user}_{i}^{T_{j}} >,

wherein,for feature weight, using TF-IDF method to calculate_jTwo users su, user_iThe microblog similarity is calculated by a cosine included angle mode:

b3, calculating user su and user_i Dynamic cross-correlation of EEN (SpecUser), denoted as RC (su, user)_i)，Wherein, T_jIs a certain time slice, T_j-T₁The calculation result is the phase difference number of the time slice, I is an exponential decay parameter, and in the time slice T_jTwo users su, user_iThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions for m time slices is denoted as RC_maxBy RC_maxNormalizing the user's interaction relevance for reference:

R C ({su}^{T_{j}}, {user}_{i}^{T_{j}}) = \frac{R C ({su}^{T_{j}}, {user}_{i}^{T_{j}})}{{RC}_{\max}};

F o l l o w e r S i m (s u, {user}_{i}) = \frac{| F o l l o w e r C S (s u) I F o l l o w e r C S ({user}_{i}) |}{| F o l l w e r C S (s u) U F o l l o w e r C S ({user}_{i}) |};

F a n S i m (s u, {user}_{i}) = \frac{| F a n s C S (s u) I F a n s C S ({user}_{i}) |}{| F a n s C S (s u) U F a n s C S ({user}_{i}) |};

S i m (s u, {user}_{i}) = {\log_{2}}^{2 + R C (s u, {user}_{i})} * (l_{2} * F o l l o w e r S i m (s u, {user}_{i}) + l_{2} * F a n s S i m (s u, {user}_{i}) + l_{3} * M B S i m (s u, {user}_{i})) .

Embodiment 2, referring to fig. 1, a method for discovering similar users of a specific user in a microblog includes the following steps:

step 101, obtaining a user set een (su) of an extended self-network, referring to fig. 2, the specific steps are as follows:

step 201, acquiring all microblog sets MB-su of a user su in a timeslice TimeSpan, including original, forwarded and commented microblogs; using an HtmlUnit package to simulate a browser mode to acquire all microblogs of a user su, and if too many microblogs are searched and displayed at one time, acquiring the microblogs successively at different times by adopting a time constraint method;

step 202, acquiring a focus set FollowerCS (su) and a fan set FansCS (su) of su; similarly, using an HtmlUnit package to simulate a browser mode to obtain the attention and fans of the user su;

step 203, extracting a visitor set VisitorCS (su) according to the microblog set MB-su of the user su, if a user is a certain user_pGo on the microblog of user suSend/comment, and user_pIf the user is not the attention of su, fan_pAdded to the user visitor set visitorcs (su). Finally, EEN (Su) FollowerCS (su) UFansCS (su) UVisitorCS (su) is obtained.

Step 102, finding out a user set simuser (su) similar to su from the een (su), referring to fig. 3, the specific steps are as follows:

step 301, obtaining each user_i EEN (su) microblog set MB-user within TimeSpan_iFollowerCS (user) of interest set_i) Vermicelli incorporating FansCS (user)_i) (ii) a And similarly, obtaining microblogs, concerns and fans of the user su by using the HtmlUnit package in a browser simulation mode.

Step 302, calculating su and user of the user_i The dynamic microblog similarity of EEN (su) is marked as MBSim (su, user)_i)。

Many bloggers of users are too short and short, such as contents of 'praise', 'good', 'like', 'expect', and the like, and the common spoken words in the microblog are sorted, and the currently sorted microblog spoken word library comprises 173 words. The microblog content is filtered based on the word banks, and the filtered microblogs do not participate in feature extraction and microblog similarity calculation in the later period any more, but can be used as interaction behaviors among users for calculating interaction correlation among the users.

Because the features of the microblog samples are sparse, a plurality of most representative features are selected from the microblogs by adopting a mutual information method and are used for calculating the microblog similarity at the later stage.

One user_iIn time slice T_jAll published microblogs are notedThe extraction of microblog characteristic words based on mutual information is as follows:

● pairsAfter the common words are segmented and filtered, the obtained characteristic word set is

{WS}_{i}^{T_{j}} = {W_{i 1}^{T_{j}}, W_{i 2}^{T_{j}}, L, W_{i x}^{T_{j}}}

(assume x feature words);

●, calculating the mutual information of the two words by the following method:

M I (w_{i u}^{T_{j}}, w_{i v}^{T_{j}}) = \frac{f (w_{i u}^{T_{j}}, w_{i v}^{T_{j}})}{f (w_{i u}^{T_{j}}) + f (w_{i v}^{T_{j}}) - f (w_{i u}^{T_{j}}, w_{i v}^{T_{j}})},

wherein,to words within a certain windowAndand the frequency of common occurrence is defined as the range of each microblog because the microblogs are short and small. For x feature words, a mutual information matrix MIM (symmetric matrix, mutual information of the same feature word is not calculated, and the value is set to be 0) obtained by pairwise calculation is as follows:

[\begin{matrix} w_{i 1}^{T_{j}} & w_{i 1}^{T_{j}} & ... & w_{i w}^{T_{j}} \\ w_{i 1}^{T_{j}} & 0 & M I (w_{i 1}^{T_{j}}, w_{i 2}^{T_{j}}) & ... & M I (w_{i 1}^{T_{j}}, w_{i w}^{T_{j}}) \\ w_{i 2}^{T_{j}} & ... & 0 & ... & M I (w_{i 2}^{T_{i}}, w_{i w}^{T_{i}}) \\ ... & ... & ... & ... & ... \\ w_{i w}^{T_{j}} & ... & ... & ... & 0 \end{matrix}];

● the y words with large mutual information degree are selected from MIM asFinal feature of (2)

User_iMicro blogVectorizable is represented as:

K W - {user}_{i}^{T_{j}} = {< {kw}_{1} - {user}_{i}^{T_{j}}, - {user}_{i}^{T_{j}} >, < {kw}_{2} - {user}_{i}^{T_{j}}, w_{2} - {user}_{i}^{T_{j}} >, L, < {kw}_{y} - {user}_{i}^{T_{j}}, w_{y} - {user}_{i}^{T_{j}} >,

wherein,for the weight of the feature term, it is calculated using TF × IDF method.

In time slice T_iTwo inner users su, user_iThe microblog similarity calculation method uses a classical cosine similarity calculation method as follows:

step 303, calculating the su and user of the user_i Dynamic cross-correlation of EEN (SpecUser), denoted as RC (su, user)_i)。

In time slice T_jTwo users su, user_iThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions for m time slices is denoted as RC_maxBy RC_maxThe user's interaction relevance is normalized for reference,

R C ({su}^{T_{j}}, {user}_{i}^{T_{j}}) = \frac{R C ({su}^{T_{j}}, {user}_{i}^{T_{j}})}{{RC}_{\max}} .

by taking the short-time smoothness phenomenon of a microblog user circle as a reference, when the dynamic interaction correlation of the user is calculated, exponential decay is introduced to depict the relation, and the following results are obtained:wherein, T_jIs a certain time slice, T_j-T₁The calculation result of (1) is the phase difference number of the time slices, and l is an exponential decay parameter.

Step 304, calculating the su and the user_i Attention similarity of EEN (SpecUser) is reported as FollowerSim (su, user)_i) The method adopts the Jaccard method,

F o l l o w e r S i m (s u, {user}_{i}) = \frac{| F o l l o w e r C S (s u) I F o l l o w e r C S ({user}_{i}) |}{| F o l l o w e r C S (s u) U F o l l o w e r C S ({user}_{i}) |} .

step 305, calculating the su and the user of the user_i Vermicelli similarity of EEN (SpecUser) is marked as FansSim (su, user)_i) The Jaccard method is also used, as is,

F a n s S i m (s u, {user}_{i}) = \frac{| F a n C S (s u) I F a n s C S ({user}_{i}) |}{| F a n s C S (s u) U F a n s C S ({user}_{i}) |} .

step 306, finally obtaining the user su and the user_i Similarity of EEN (SpecUser) Sim (user)_i)，

S i m (s u, {user}_{i}) = {\log_{2}}^{2 + R C (s u, {user}_{i}))} * (l_{1} * F o l l o w e r S i m (s u, {user}_{i}) + l_{2} * F a n s S i m (s u, {user}_{i}) + l_{3} * M B S i m (s u, {user}_{i})) .

And comparing the accuracy and the distribution condition of similar user discovery by using four different similar user discovery methods. The four methods are as follows:

(1) the method 1-SuDByText calculates the similarity of users based on the background and microblogs of the users. According to the characteristics of the Xinlang microblog, the selected user background information comprises brief introduction, labels, education and professional information. The similarity calculation of the background information adopts a Jaccard method. The similarity calculation of the microblogs does not consider the dynamics of the microblogs according to time slice division. The values of the background and microblog similarity in linear integration are 0.3 and 0.7 respectively.

(2) Method 2-SuDBySN, calculating the similarity of users based on social networks of concerns and fans. The similarity calculation of the attention and the vermicelli adopts a Jaccard method. And the final similarity linearly integrates the similarity of attention and fans, wherein the weight of the attention similarity is 0.6, and the weight of the fan similarity is 0.4. The interaction correlation calculation between users does not take into account the dynamics of the interaction.

(3) Method 3-SuDByTSN, the existing hybrid method, calculates the similarity of users based on the text information of users and social networks. The text information comprises background information such as microblogs, brief introduction, tags, education, professional information and the like, and the social network only utilizes attention and fans and does not consider visitors. The similarity calculation of the background information and the microblog is the same as the method 1-SuDByText; the similarity calculation between the attention and the fans is the same as the method 2-SuDBySN.

(4) The method 4-SuDByTSN-Zhong, the hybrid method provided by the invention, only selects the microblog information of the user, and utilizes three types of users, namely attention users, fan users and visitor users when the social network is constructed. In order to reduce the statistical analysis amount, for three types of users related to a self network expanded by the users, basic information of concerns and fans and all microblogs released in 1-5 months in 2015 are collected during layer 1 expansion, visitors are extracted from the microblogs, and only user names of the concerns and fans are collected during layer 2 expansion. The exponential decay parameter I of the time slice is 0.3, and when the similarity of the users is calculated, the weights of the similarity of the attention, the fan and the microblog are I₁＝0.5,I₂＝0.2,I₃Time slices are divided by week, 0.3.

Taking the Sina microblog as an example, 50 seed users in 5 fields of academic research, enterprise management, education, culture and military are selected for experimental data acquisition and analysis.

Inputting field keywords in a Xinlang microblog search box for retrieval, clicking a 'person finding' button, selecting two types of users, namely 'personal authentication' and 'common user', and collecting by using an HtmlUnit. In some fields, users pay too much attention or fans, which exceed tens of thousands or even millions, and for the convenience of analysis, the obtained users are screened, and the number of the attention or fans is limited within 5000. And randomly selecting 10 seed users from each field to perform experimental analysis, wherein the acquisition time of the microblogs is limited to 1 month and 1 day in 2015 to 5 months and 28 days in 2015, and the total time is 5 months. The authentication and general user profiles obtained in 5 fields are shown in table 1.

TABLE 15 fields selected for the experiment

Serial number	FIELD	Key word	Authentication and number of ordinary users
				1	Academic research	Information retrieval	490
2	Enterprise management	Internet high pipe	45
				3	Education	Preschool education	6049
4	Culture	Fighting with spy	876
				5	Military affairs	Jian 20	728

At present, the Sina microblog limits the attention of non-self and the access amount of fans in order to prevent other people from obtaining the attention of users and malicious attention or advertisement disturbance of fans, and only can obtain the content of the first 5 pages, about 100 attention and 100 fans. From a statistical analysis perspective, it is also representative to extract 100 attention and 100 fan samples for statistical analysis.

The numbers of concerns, fans, visitors, and microblogs of 50 users in 5 fields are shown in table 2.

TABLE 2.50 user attention, fan, visitor, and microblog count

In order to calculate the similarity between a specific user and each concern, fan and visitor, the concern, fan and microblog of the next-layer collection concern, fan and visitor user need to be expanded. Similarly, the number of the concerns and fans of each user is 100, and the time for collecting microblogs is limited to 1/2015 to 5/2015 and 28/2015.

The microblog content of the user is original on one hand, and is forwarded/commented on the other hand, the forwarded/commented microblog is also used as the microblog content of the user, but the forwarding/commenting of the same microblog is only 1 time when the same microblog is forwarded/commented for multiple times.

The total number of the attention class users, the fan class users and the microblogs which are finally obtained and used for experimental analysis is 2157843, 2086613 and 932531.

● accuracy comparison of finding similar users

Due to the mass of microblog users, the common evaluation index of the similar users is found to be Pn, namely the similar users n before the ranking are taken, and the proportion of the real similar users is judged. For microblog users, because the information related to each user is relatively complicated, including factors such as attention, fans, microblogs, interaction and the like, the difficulty of manual judgment is very high. Therefore, Pn is improved, and an evaluation index of Sn, that is, the scores of the top n similar users obtained by each method are calculated.

Supposing that there are m evaluation methods, methods_i(1 ￡ i ￡ m) the resulting set of the first n similar users is: method for producing a metal oxide layer_i＝{user_i1,user_i2,L,user_inWill user_i1The total number of similar user sets appearing obtained in each method is recorded as Count (user)_i1) Method of_iIs/are as followsThe method does not need manual intervention, is easy to realize and is relatively objective.

It is noted that similar users of the method SUDByTSN-Zhong proposed by the present invention are not available in method 1, method 2 and method 3 due to the extension to the class of guests. Therefore, when calculating the Sn index of SUDByTSN-Zhong, the visitor is processed as follows: for the other three methods, respectively calculating the similarity between the visitor and the designated user, and the Method_j { SuDByText, SuDBySN, SuDByTSN }, if the visitor director_iIf the similarity value can be entered into the top n, then the viewer is considered_iPresent in a Method_jTo a similar user set.

The average Sn obtained by the four methods for 50 microblog users is shown in table 3.

Table 3 Sn of 50 microblog users obtained by four methods

As can be seen from table 3, for the average Sn of 50 users, the method SUDByTSN-Zhong score was the highest and 34.8, and the method SUDByText score was the lowest and 29.4. Among the four methods, SuDByTSN has a higher Sn score than SUDByTSN-Zhong, compared to SUDByTSN-Zhong, because SUDByTSN-Zhong introduces a dynamic constraint on time, making the discovered user more accurate. Meanwhile, scores of the SuDByTSN and the SuDByTSN-Zhong are high, and the advantages of the mixed type social network analysis are further verified. The method SuDByText only utilizes the background and microblog information of a user, and the method SuDBySN only utilizes the social network information of a microblog, including attention and fans, which have certain defects. In terms of SUDByText and SUDBySN, the method SUDBySN is superior to SUDByText, which further verifies that the social information of the user is more valuable than other information of the user.

For 5 fields, the scores of the two fields of academic research and military are high, the main reason is that the information retrieval and the keyword fighting 20 are used for searching when users in the field are obtained, the range limitation of keywords is specific, the circle of friends of the seed users is narrow, published microblogs are professional, and the score of similar users of each user is stable. For users in other three fields (enterprise management, education and culture), the circle of friends of the users is often too large, fans can reach hundreds of thousands of people, daily microblogs are scattered, and calculation interference on similar users is large. This indicates that the smaller the range of fields to which the user belongs, the higher the degree of specialization, and the better the effect when finding similar users.

In addition, the activity of 500 similar users found by 50 users (each user takes 10 similar users with the top rank) is counted, and it is found that over 95% of 500 users have over 100 times of behaviors of forwarding, commenting or posting microblogs within a period of 5 months, and only 5% of users are not very active. The less active users are ranked higher because of the higher attention and fan index scores when calculating the similarity. Therefore, the method provided by the invention is more beneficial to discovering the active users in the microblog.

● distribution comparison of finding similar users

Similar users found by the existing method are only distributed in two categories of attention and fans, and similar users found by the SUDByTSN-Zhong method provided by the invention are distributed in three categories of attention, fans and visitors.

The distribution evaluation indexes of similar users comprise three indexes:

(1) with a focus on the scale of the scale,(2) the proportion of the vermicelli is that,(3) the proportion of the visitors,

for 5 users in the field, p obtained by four methods_follower、p_fansAnd p_visitorThe results are shown in Table 4. In table 4, the found similar users may belong to multiple types of users (attention and fans) at the same time, and repeated statistics is needed when calculating the index. For example, a similar user is concerned and fans, and the calculation is performed 1 time each when the distribution ratio of the concerned and fans is counted.

TABLE 4 distribution of similar users obtained by four methods

As can be seen from table 4, the method SUDByTSN-Zhong introduces visitor-type users by extending the conventional self-network, and increases the diversity of the acquired similar users. Meanwhile, the obtained users are ranked according to the similarity, and after the visitors are introduced, more similar users are obtained. For four methods, p_followerAverage p of the generally larger, SUDByText_follower74%, average p of SuDBySN_follower75%, average p of SuDByTSN_follower78%, average p of SuDByTSN-Zhong_follower56%, which indicates that the similar users of the microblog have the largest proportion of users in the attention class. For the method SuDByTSN-Zhong, the proportion of visitors (32%) is slightly greater than the proportion of fans (30%). In the experiment process, the similarity of the visitor type users can be found in the front, mainly the similarity of microblogs among the users is relatively high, and a plurality of users can use a user of a certain user_iThe micro-blogs are forwarded or commented on, but the users are not users_iAttention or fans. This further illustrates the advantages that guest class users take advantage of in the process of similar user discovery. In addition, some microblogs (such as new waves) start to limit the number of users who get non-personal attention and fans, and the idea of finding similar users by visitors is worth reference.

For similar fan-like users and visitor-like users found in 5 fields, the user fields selected by academic research and military are narrower, the proportions of the fan-like users of the academic research and the military are respectively 34% and 36%, and the proportions of the visitor-like users are respectively 36% and 38%. The fact that the fan/visitor forwards/comments to a user in a narrow field range also shows that the fan/visitor has higher similarity with the user in terms of friend circles or microblog topics.

The method of the present invention is not limited to the examples described in the specific embodiments, and other embodiments derived from the technical solutions of the present invention by those skilled in the art also belong to the technical innovation scope of the present invention.

Claims

1. A method for monitoring similar users of a specific user in a microblog is characterized by comprising the following steps:

the method comprises the following specific steps:

a2, acquiring a su attention set FollowerCS (su) and a fan set FansCS (su);

b1, acquiring each userMicroblog collection MB-user within TimeSpan_iFollowerCS (user) of interest set_i) Vermicelli incorporating FansCS (user)_i)；

B2, calculating Su andthe similarity of the dynamic microblogs is recorded as MBSim (su, user)_i)，Wherein, T_jIs a certain time slice, T_j-T₁The calculation result is the number of phase differences of the time slices, l is an exponential decay parameter, and the user_iAt T_jTime slice microblogVectorization is represented as: wherein,for feature weight, using TF-IDF method to calculate_jTwo users su, user_iThe microblog similarity is calculated by a cosine included angle mode:

b3, calculating Su andis denoted as RC (su, user)_i)，Wherein, T_jIs a certain time slice, T_j-T₁The calculation result is the phase difference number of the time slice, l is the exponential decay parameter, and in the time slice T_jTwo users su, user_iThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions for m time slices is denoted as RC_maxBy RC_maxNormalizing the user's interaction relevance for reference:

b4, calculating Su andattention similarity of (2) is recorded as FollowerSim (su, user)_i)，

B5, calculating Su andthe vermicelli similarity is recorded as FansSim (su, user)_i)，

B6, finally obtaining user su andsimilarity of (xi, user)_i)，