CN105045822A - Method for monitoring similar users of specific user in microblog - Google Patents
Method for monitoring similar users of specific user in microblog Download PDFInfo
- Publication number
- CN105045822A CN105045822A CN201510363990.XA CN201510363990A CN105045822A CN 105045822 A CN105045822 A CN 105045822A CN 201510363990 A CN201510363990 A CN 201510363990A CN 105045822 A CN105045822 A CN 105045822A
- Authority
- CN
- China
- Prior art keywords
- user
- users
- microblog
- similarity
- een
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 238000012544 monitoring process Methods 0.000 title claims abstract description 8
- 230000008846 dynamic interplay Effects 0.000 claims abstract description 5
- 230000003993 interaction Effects 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 21
- 238000011160 research Methods 0.000 description 6
- 238000009826 distribution Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 3
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000007480 spreading Effects 0.000 description 2
- 238000003892 spreading Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for monitoring similar users of a specific user in microblog. The method comprises the following steps: (1) obtaining a user set of an EEN (Extended Ego Network) according to an input specific user su and presenting the user set as EEN (su)=FollowerCS (su) UFansCS (su) UVisitor CS (su), wherein the FollowerCS (su) is a following set of the su, the FansCS (su) is a fans set of the su and the Visitor CS (su) is a visitor set of the su; and (2) finding a similar user set SimUser (su)iEEN (su) similar to the su on the basis of the followers and fans of the users, the similarity of the dynamic microblogs and the dynamic interaction reciprocity among the users. According to the method disclosed by the invention, the users of a visitor type are introduced, the comprehensiveness and diversity for finding the similar users are increased, and the dynamic division of time is introduced so that the dynamics of the microblog can be better embodied and the found similar users are more correct.
Description
Technical Field
The invention relates to an information mining technology, in particular to a similar user monitoring method for a specific user in a microblog.
Background
Today, social media is considered one of the most valuable information resources on the Web. The microblog is one of a plurality of social media, and due to the strong spreading and the convenient operation, a plurality of users form a circle of interaction similar to the real society on the microblog. Two networks are arranged between traditional media users and topics, and the microblog introduces concerns and fans, so that a multi-mode network is arranged between the users and the topics. Due to the strong spreading of microblog information and the complex network structure, the high attention of academic and industrial fields is attracted in recent years.
Similar users in the microblog refer to a user group with a plurality of common attributes on a microblog medium, and the attributes mainly comprise information of the background, attention, fans, the microblog, interaction and the like of the users. The information of users on social media is generally divided into two categories: one is the user's context (e.g., location, education, occupation, interest, etc.) and published microblog information; another class is social networks built based on concerns and fans. Based on these two types of information, existing user similarity calculation methods can be roughly classified into three types: (1) a method based on the background and microblog information of the user, abbreviated as SUDByText; (2) a social networking method based on attention and fans, abbreviated as SUDBySN; (3) the hybrid method is a fusion calculation of methods SUDByText and SUDBySN, abbreviated as SUDByTSN. Currently, SUDByTSN is the mainstream research method.
Conference discourse published in the united states in 2011: 2011visual information communication-interaction conference (Proceedingsof2011visual information communication-international symposium), entitled: interest-based friend discovery and recommendation in social networks (Sfviz: interest-based friends and social networks), the authors were: GouL, YouF, GuoJ, WuL, ZhangXL, which proposes to calculate the similarity of users using social tags of users and the topology of the network, including the attention and fans of users, without utilizing guest-class users.
Journal published in germany in 2013: UserModellingandiuser-AdaptedInterction, entitled: exploringsocialcaggingfor personalizedcommumylrecogments, the authors are: KimHN, SaddikAE, who looks away from a user to find his interested communities based on social tags. The social tags of the community are extracted through tags of community members, including interests, emotions, geographic positions, time and the like of the members.
A journal published in china in 2014: the Chinese information newspaper is characterized in that: based on microblog user recommendation of sequencing learning, the author is as follows: according to the method, when microblog user recommendation is performed, four factors of a user, such as microblog, personal information, interactive information, social topological information and the like are utilized, and the influence of the interactive information of the user on the recommendation performance of similar users is considered to be the largest.
A journal published in china in 2014: the computer learns the newspaper, the title is: the similarity measurement of microblog users and the application thereof have the following authors: xuximing, Lidong, Liu Jiang, Lisheng, Wang just and Yuantree Lun, the article considers the background information, microblog, social contact and interaction information of the user when measuring the similarity of the user. And (3) with 50 users as seed nodes, 1-layer related fans and attention users are crawled, and the social information is considered to be most valuable when the similarity of the users is calculated.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for monitoring similar users by microblog media, aiming at the problems and the defects in the prior art, the method can increase the comprehensiveness and diversity of similar users and improve the accuracy of similar user discovery.
The technical problem to be solved by the invention is achieved by the following technical scheme. The invention relates to a method for monitoring similar users of a specific user in a microblog, which is characterized by comprising the following steps of:
A. obtaining a user set of an extended self-network EEN (ExtendedEgoNet) according to an input specific user su, and recording the user set as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su), wherein FollowerCS (su) is an attention set of su, FansCS (su) is a fan set of su, and VisitorCS (su) is a visitor set of su;
the method comprises the following specific steps:
a1, acquiring all microblog sets MB-su of a user su in a timeslice TimeSpan, including original, forwarded and commented microblogs;
a2, acquiring a su attention set FollowerCS (su) and a fan set FansCS (su);
a3, extracting a visitor set VisitorCS (su) according to a su microblog set MB-su, and recording three types of users as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su);
B. based on the attention of users, the similarity of fans and dynamic microblogs and the dynamic interaction among users, a user set SimUser (su) similar to su is found from EEN (su), and the specific steps are as follows:
b1, acquiring user of each useri EEN (su) microblog set MB-user within TimeSpaniFollowerCS (user) of interest seti) Vermicelli incorporating FansCS (user)i);
B2, calculating user su and useri The dynamic microblog similarity of EEN (su) is marked as MBSim (su, user)i),Wherein, TjIs a certain time slice, Tj-T1The calculation result is the number of phase differences of the time slices, l is an exponential decay parameter, and the useriAt TjTime slice microblogVectorization is represented as: wherein,for the weight of the feature item, the method of TF-IDF is used for calculating, and two users su, user are in the time slice TjiThe microblog similarity is calculated by a cosine included angle mode: <math><mrow>
<mi>M</mi>
<mi>B</mi>
<mi>S</mi>
<mi>i</mi>
<mi>n</mi>
<mrow>
<mo>(</mo>
<msup>
<mi>su</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msup>
<mo>,</mo>
<msubsup>
<mi>user</mi>
<mi>i</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msubsup>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msup>
<mi>su</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msup>
<mo>·</mo>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msubsup>
<mi>user</mi>
<mi>i</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msubsup>
</mrow>
<mrow>
<mo>|</mo>
<mo>|</mo>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msup>
<mi>su</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msup>
<mo>|</mo>
<mo>|</mo>
<mo>·</mo>
<mo>|</mo>
<mo>|</mo>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msubsup>
<mi>user</mi>
<mi>i</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msubsup>
<mo>|</mo>
<mo>|</mo>
</mrow>
</mfrac>
<mo>;</mo>
</mrow></math>
b3, calculating user su and useri Dynamic cross-correlation of EEN (SpecUser), denoted as RC (su, user)i),Wherein, TjIs a certain time slice, Tj-T1The calculation result is the phase difference number of the time slice, l is an exponential decay parameter, and two users su, user in the time slice TjiThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions of m time slices is denoted RCmax by RCmaxNormalizing the user's interaction relevance for reference:
b4, calculating user su and useri Attention similarity of EEN (SpecUser) is reported as FollowerSim (su, user)i),
B5, calculating user su and useri Vermicelli similarity of EEN (SpecUser) is marked as FansSim (su, user)i),
B6, finally obtaining the user su and the useri Similarity of EEN (SpecUser) Sim (user)i),
Compared with the prior art, the method of the invention has the following effects: the method introduces visitor type users, and increases the comprehensiveness and diversity of similar users; dynamic division of time is introduced, so that the dynamics of the microblog can be better reflected, and found similar users are more accurate.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a flowchart of the subscriber set een (su) for obtaining the extended self network in step 101 of fig. 1;
fig. 3 is a flowchart of finding a subscriber set simuser (su) similar to su from the een (su) in step 102 of fig. 1.
Detailed Description
The following describes the implementation of the present invention in further detail with reference to the accompanying drawings and the detailed description.
Embodiment 1, a method for monitoring similar users of a specific user in a microblog, comprising the following steps:
A. obtaining a user set of an extended self-network EEN (ExtendedEgoNet) according to an input specific user su, and recording the user set as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su), wherein FollowerCS (su) is an attention set of su, FansCS (su) is a fan set of su, and VisitorCS (su) is a visitor set of su;
the method comprises the following specific steps:
a1, acquiring all microblog sets MB-su of a user su in a timeslice TimeSpan, including original, forwarded and commented microblogs;
a2, acquiring a su attention set FollowerCS (su) and a fan set FansCS (su);
a3, extracting a visitor set VisitorCS (su) according to a su microblog set MB-su, and recording three types of users as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su);
B. based on the attention of users, the similarity of fans and dynamic microblogs and the dynamic interaction among users, a user set SimUser (su) similar to su is found from EEN (su), and the specific steps are as follows:
b1, acquiring user of each useri EEN (su) TimeSpan at time sliceInner micro blog set MB-useriFollowerCS (user) of interest seti) Vermicelli incorporating FansCS (user)i);
B2, calculating user su and useri The dynamic microblog similarity of EEN (su) is marked as MBSim (su, user)),Wherein, TjIs a certain time slice, Tj-T1The calculation result is the number of phase differences of the time slices, I is an exponential decay parameter, and the useriAt TjTime slice microblogVectorization is represented as: wherein,for feature weight, using TF-IDF method to calculatejTwo users su, useriThe microblog similarity is calculated by a cosine included angle mode: <math><mrow>
<mi>M</mi>
<mi>B</mi>
<mi>S</mi>
<mi>i</mi>
<mi>m</mi>
<mrow>
<mo>(</mo>
<msup>
<mi>su</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msup>
<mo>,</mo>
<msubsup>
<mi>user</mi>
<mi>i</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msubsup>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msup>
<mi>su</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msup>
<mo>·</mo>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msubsup>
<mi>user</mi>
<mi>i</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msubsup>
</mrow>
<mrow>
<mo>|</mo>
<mo>|</mo>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msup>
<mi>su</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msup>
<mo>|</mo>
<mo>|</mo>
<mo>·</mo>
<mo>|</mo>
<mo>|</mo>
<mi>K</mi>
<mi>W</mi>
<mo>-</mo>
<msubsup>
<mi>user</mi>
<mi>i</mi>
<msub>
<mi>T</mi>
<mi>j</mi>
</msub>
</msubsup>
<mo>|</mo>
<mo>|</mo>
</mrow>
</mfrac>
<mo>;</mo>
</mrow></math>
b3, calculating user su and useri Dynamic cross-correlation of EEN (SpecUser), denoted as RC (su, user)i),Wherein, TjIs a certain time slice, Tj-T1The calculation result is the phase difference number of the time slice, I is an exponential decay parameter, and in the time slice TjTwo users su, useriThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions for m time slices is denoted as RCmaxBy RCmaxNormalizing the user's interaction relevance for reference:
b4, calculating user su and useri Attention similarity of EEN (SpecUser) is reported as FollowerSim (su, user)i),
B5, calculating user su and useri Vermicelli similarity of EEN (SpecUser) is marked as FansSim (su, user)i),
B6, finally obtaining the user su and the useri Similarity of EEN (SpecUser) Sim (user)i),
Embodiment 2, referring to fig. 1, a method for discovering similar users of a specific user in a microblog includes the following steps:
step 101, obtaining a user set een (su) of an extended self-network, referring to fig. 2, the specific steps are as follows:
step 201, acquiring all microblog sets MB-su of a user su in a timeslice TimeSpan, including original, forwarded and commented microblogs; using an HtmlUnit package to simulate a browser mode to acquire all microblogs of a user su, and if too many microblogs are searched and displayed at one time, acquiring the microblogs successively at different times by adopting a time constraint method;
step 202, acquiring a focus set FollowerCS (su) and a fan set FansCS (su) of su; similarly, using an HtmlUnit package to simulate a browser mode to obtain the attention and fans of the user su;
step 203, extracting a visitor set VisitorCS (su) according to the microblog set MB-su of the user su, if a user is a certain userpGo on the microblog of user suSend/comment, and userpIf the user is not the attention of su, fanpAdded to the user visitor set visitorcs (su). Finally, EEN (Su) FollowerCS (su) UFansCS (su) UVisitorCS (su) is obtained.
Step 102, finding out a user set simuser (su) similar to su from the een (su), referring to fig. 3, the specific steps are as follows:
step 301, obtaining each useri EEN (su) microblog set MB-user within TimeSpaniFollowerCS (user) of interest seti) Vermicelli incorporating FansCS (user)i) (ii) a And similarly, obtaining microblogs, concerns and fans of the user su by using the HtmlUnit package in a browser simulation mode.
Step 302, calculating su and user of the useri The dynamic microblog similarity of EEN (su) is marked as MBSim (su, user)i)。
Many bloggers of users are too short and short, such as contents of 'praise', 'good', 'like', 'expect', and the like, and the common spoken words in the microblog are sorted, and the currently sorted microblog spoken word library comprises 173 words. The microblog content is filtered based on the word banks, and the filtered microblogs do not participate in feature extraction and microblog similarity calculation in the later period any more, but can be used as interaction behaviors among users for calculating interaction correlation among the users.
Because the features of the microblog samples are sparse, a plurality of most representative features are selected from the microblogs by adopting a mutual information method and are used for calculating the microblog similarity at the later stage.
One useriIn time slice TjAll published microblogs are notedThe extraction of microblog characteristic words based on mutual information is as follows:
● pairsAfter the common words are segmented and filtered, the obtained characteristic word set is
●, calculating the mutual information of the two words by the following method:
wherein,to words within a certain windowAndand the frequency of common occurrence is defined as the range of each microblog because the microblogs are short and small. For x feature words, a mutual information matrix MIM (symmetric matrix, mutual information of the same feature word is not calculated, and the value is set to be 0) obtained by pairwise calculation is as follows:
● the y words with large mutual information degree are selected from MIM asFinal feature of (2)
UseriMicro blogVectorizable is represented as:
wherein,for the weight of the feature term, it is calculated using TF × IDF method.
In time slice TiTwo inner users su, useriThe microblog similarity calculation method uses a classical cosine similarity calculation method as follows:
step 303, calculating the su and user of the useri Dynamic cross-correlation of EEN (SpecUser), denoted as RC (su, user)i)。
In time slice TjTwo users su, useriThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions for m time slices is denoted as RCmaxBy RCmaxThe user's interaction relevance is normalized for reference,
by taking the short-time smoothness phenomenon of a microblog user circle as a reference, when the dynamic interaction correlation of the user is calculated, exponential decay is introduced to depict the relation, and the following results are obtained:wherein, TjIs a certain time slice, Tj-T1The calculation result of (1) is the phase difference number of the time slices, and l is an exponential decay parameter.
Step 304, calculating the su and the useri Attention similarity of EEN (SpecUser) is reported as FollowerSim (su, user)i) The method adopts the Jaccard method,
step 305, calculating the su and the user of the useri Vermicelli similarity of EEN (SpecUser) is marked as FansSim (su, user)i) The Jaccard method is also used, as is,
step 306, finally obtaining the user su and the useri Similarity of EEN (SpecUser) Sim (user)i),
And comparing the accuracy and the distribution condition of similar user discovery by using four different similar user discovery methods. The four methods are as follows:
(1) the method 1-SuDByText calculates the similarity of users based on the background and microblogs of the users. According to the characteristics of the Xinlang microblog, the selected user background information comprises brief introduction, labels, education and professional information. The similarity calculation of the background information adopts a Jaccard method. The similarity calculation of the microblogs does not consider the dynamics of the microblogs according to time slice division. The values of the background and microblog similarity in linear integration are 0.3 and 0.7 respectively.
(2) Method 2-SuDBySN, calculating the similarity of users based on social networks of concerns and fans. The similarity calculation of the attention and the vermicelli adopts a Jaccard method. And the final similarity linearly integrates the similarity of attention and fans, wherein the weight of the attention similarity is 0.6, and the weight of the fan similarity is 0.4. The interaction correlation calculation between users does not take into account the dynamics of the interaction.
(3) Method 3-SuDByTSN, the existing hybrid method, calculates the similarity of users based on the text information of users and social networks. The text information comprises background information such as microblogs, brief introduction, tags, education, professional information and the like, and the social network only utilizes attention and fans and does not consider visitors. The similarity calculation of the background information and the microblog is the same as the method 1-SuDByText; the similarity calculation between the attention and the fans is the same as the method 2-SuDBySN.
(4) The method 4-SuDByTSN-Zhong, the hybrid method provided by the invention, only selects the microblog information of the user, and utilizes three types of users, namely attention users, fan users and visitor users when the social network is constructed. In order to reduce the statistical analysis amount, for three types of users related to a self network expanded by the users, basic information of concerns and fans and all microblogs released in 1-5 months in 2015 are collected during layer 1 expansion, visitors are extracted from the microblogs, and only user names of the concerns and fans are collected during layer 2 expansion. The exponential decay parameter I of the time slice is 0.3, and when the similarity of the users is calculated, the weights of the similarity of the attention, the fan and the microblog are I1=0.5,I2=0.2,I3Time slices are divided by week, 0.3.
Taking the Sina microblog as an example, 50 seed users in 5 fields of academic research, enterprise management, education, culture and military are selected for experimental data acquisition and analysis.
Inputting field keywords in a Xinlang microblog search box for retrieval, clicking a 'person finding' button, selecting two types of users, namely 'personal authentication' and 'common user', and collecting by using an HtmlUnit. In some fields, users pay too much attention or fans, which exceed tens of thousands or even millions, and for the convenience of analysis, the obtained users are screened, and the number of the attention or fans is limited within 5000. And randomly selecting 10 seed users from each field to perform experimental analysis, wherein the acquisition time of the microblogs is limited to 1 month and 1 day in 2015 to 5 months and 28 days in 2015, and the total time is 5 months. The authentication and general user profiles obtained in 5 fields are shown in table 1.
TABLE 15 fields selected for the experiment
Serial number | FIELD | Key word | Authentication and number of ordinary users |
1 | Academic research | Information retrieval | 490 |
2 | Enterprise management | Internet high pipe | 45 |
3 | Education | Preschool education | 6049 |
4 | Culture | Fighting with spy | 876 |
5 | Military affairs | Jian 20 | 728 |
At present, the Sina microblog limits the attention of non-self and the access amount of fans in order to prevent other people from obtaining the attention of users and malicious attention or advertisement disturbance of fans, and only can obtain the content of the first 5 pages, about 100 attention and 100 fans. From a statistical analysis perspective, it is also representative to extract 100 attention and 100 fan samples for statistical analysis.
The numbers of concerns, fans, visitors, and microblogs of 50 users in 5 fields are shown in table 2.
TABLE 2.50 user attention, fan, visitor, and microblog count
In order to calculate the similarity between a specific user and each concern, fan and visitor, the concern, fan and microblog of the next-layer collection concern, fan and visitor user need to be expanded. Similarly, the number of the concerns and fans of each user is 100, and the time for collecting microblogs is limited to 1/2015 to 5/2015 and 28/2015.
The microblog content of the user is original on one hand, and is forwarded/commented on the other hand, the forwarded/commented microblog is also used as the microblog content of the user, but the forwarding/commenting of the same microblog is only 1 time when the same microblog is forwarded/commented for multiple times.
The total number of the attention class users, the fan class users and the microblogs which are finally obtained and used for experimental analysis is 2157843, 2086613 and 932531.
● accuracy comparison of finding similar users
Due to the mass of microblog users, the common evaluation index of the similar users is found to be Pn, namely the similar users n before the ranking are taken, and the proportion of the real similar users is judged. For microblog users, because the information related to each user is relatively complicated, including factors such as attention, fans, microblogs, interaction and the like, the difficulty of manual judgment is very high. Therefore, Pn is improved, and an evaluation index of Sn, that is, the scores of the top n similar users obtained by each method are calculated.
Supposing that there are m evaluation methods, methodsi(1 £ i £ m) the resulting set of the first n similar users is: method for producing a metal oxide layeri={useri1,useri2,L,userinWill useri1The total number of similar user sets appearing obtained in each method is recorded as Count (user)i1) Method ofiIs/are as followsThe method does not need manual intervention, is easy to realize and is relatively objective.
It is noted that similar users of the method SUDByTSN-Zhong proposed by the present invention are not available in method 1, method 2 and method 3 due to the extension to the class of guests. Therefore, when calculating the Sn index of SUDByTSN-Zhong, the visitor is processed as follows: for the other three methods, respectively calculating the similarity between the visitor and the designated user, and the Methodj { SuDByText, SuDBySN, SuDByTSN }, if the visitor directoriIf the similarity value can be entered into the top n, then the viewer is considerediPresent in a MethodjTo a similar user set.
The average Sn obtained by the four methods for 50 microblog users is shown in table 3.
Table 3 Sn of 50 microblog users obtained by four methods
As can be seen from table 3, for the average Sn of 50 users, the method SUDByTSN-Zhong score was the highest and 34.8, and the method SUDByText score was the lowest and 29.4. Among the four methods, SuDByTSN has a higher Sn score than SUDByTSN-Zhong, compared to SUDByTSN-Zhong, because SUDByTSN-Zhong introduces a dynamic constraint on time, making the discovered user more accurate. Meanwhile, scores of the SuDByTSN and the SuDByTSN-Zhong are high, and the advantages of the mixed type social network analysis are further verified. The method SuDByText only utilizes the background and microblog information of a user, and the method SuDBySN only utilizes the social network information of a microblog, including attention and fans, which have certain defects. In terms of SUDByText and SUDBySN, the method SUDBySN is superior to SUDByText, which further verifies that the social information of the user is more valuable than other information of the user.
For 5 fields, the scores of the two fields of academic research and military are high, the main reason is that the information retrieval and the keyword fighting 20 are used for searching when users in the field are obtained, the range limitation of keywords is specific, the circle of friends of the seed users is narrow, published microblogs are professional, and the score of similar users of each user is stable. For users in other three fields (enterprise management, education and culture), the circle of friends of the users is often too large, fans can reach hundreds of thousands of people, daily microblogs are scattered, and calculation interference on similar users is large. This indicates that the smaller the range of fields to which the user belongs, the higher the degree of specialization, and the better the effect when finding similar users.
In addition, the activity of 500 similar users found by 50 users (each user takes 10 similar users with the top rank) is counted, and it is found that over 95% of 500 users have over 100 times of behaviors of forwarding, commenting or posting microblogs within a period of 5 months, and only 5% of users are not very active. The less active users are ranked higher because of the higher attention and fan index scores when calculating the similarity. Therefore, the method provided by the invention is more beneficial to discovering the active users in the microblog.
● distribution comparison of finding similar users
Similar users found by the existing method are only distributed in two categories of attention and fans, and similar users found by the SUDByTSN-Zhong method provided by the invention are distributed in three categories of attention, fans and visitors.
The distribution evaluation indexes of similar users comprise three indexes:
(1) with a focus on the scale of the scale,(2) the proportion of the vermicelli is that,(3) the proportion of the visitors,
for 5 users in the field, p obtained by four methodsfollower、pfansAnd pvisitorThe results are shown in Table 4. In table 4, the found similar users may belong to multiple types of users (attention and fans) at the same time, and repeated statistics is needed when calculating the index. For example, a similar user is concerned and fans, and the calculation is performed 1 time each when the distribution ratio of the concerned and fans is counted.
TABLE 4 distribution of similar users obtained by four methods
As can be seen from table 4, the method SUDByTSN-Zhong introduces visitor-type users by extending the conventional self-network, and increases the diversity of the acquired similar users. Meanwhile, the obtained users are ranked according to the similarity, and after the visitors are introduced, more similar users are obtained. For four methods, pfollowerAverage p of the generally larger, SUDByTextfollower74%, average p of SuDBySNfollower75%, average p of SuDByTSNfollower78%, average p of SuDByTSN-Zhongfollower56%, which indicates that the similar users of the microblog have the largest proportion of users in the attention class. For the method SuDByTSN-Zhong, the proportion of visitors (32%) is slightly greater than the proportion of fans (30%). In the experiment process, the similarity of the visitor type users can be found in the front, mainly the similarity of microblogs among the users is relatively high, and a plurality of users can use a user of a certain useriThe micro-blogs are forwarded or commented on, but the users are not usersiAttention or fans. This further illustrates the advantages that guest class users take advantage of in the process of similar user discovery. In addition, some microblogs (such as new waves) start to limit the number of users who get non-personal attention and fans, and the idea of finding similar users by visitors is worth reference.
For similar fan-like users and visitor-like users found in 5 fields, the user fields selected by academic research and military are narrower, the proportions of the fan-like users of the academic research and the military are respectively 34% and 36%, and the proportions of the visitor-like users are respectively 36% and 38%. The fact that the fan/visitor forwards/comments to a user in a narrow field range also shows that the fan/visitor has higher similarity with the user in terms of friend circles or microblog topics.
The method of the present invention is not limited to the examples described in the specific embodiments, and other embodiments derived from the technical solutions of the present invention by those skilled in the art also belong to the technical innovation scope of the present invention.
Claims (1)
1. A method for monitoring similar users of a specific user in a microblog is characterized by comprising the following steps:
A. obtaining a user set of an extended self-network EEN (ExtendedEgoNet) according to an input specific user su, and recording the user set as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su), wherein FollowerCS (su) is an attention set of su, FansCS (su) is a fan set of su, and VisitorCS (su) is a visitor set of su;
the method comprises the following specific steps:
a1, acquiring all microblog sets MB-su of a user su in a timeslice TimeSpan, including original, forwarded and commented microblogs;
a2, acquiring a su attention set FollowerCS (su) and a fan set FansCS (su);
a3, extracting a visitor set VisitorCS (su) according to a su microblog set MB-su, and recording three types of users as EEN (su) FollowerCS (su) UFansCS (su) UVisitoreCS (su);
B. based on the attention of users, the similarity of fans and dynamic microblogs and the dynamic interaction among users, a user set SimUser (su) similar to su is found from EEN (su), and the specific steps are as follows:
b1, acquiring each userMicroblog collection MB-user within TimeSpaniFollowerCS (user) of interest seti) Vermicelli incorporating FansCS (user)i);
B2, calculating Su andthe similarity of the dynamic microblogs is recorded as MBSim (su, user)i),Wherein, TjIs a certain time slice, Tj-T1The calculation result is the number of phase differences of the time slices, l is an exponential decay parameter, and the useriAt TjTime slice microblogVectorization is represented as: wherein,for feature weight, using TF-IDF method to calculatejTwo users su, useriThe microblog similarity is calculated by a cosine included angle mode:
b3, calculating Su andis denoted as RC (su, user)i),Wherein, TjIs a certain time slice, Tj-T1The calculation result is the phase difference number of the time slice, l is the exponential decay parameter, and in the time slice TjTwo users su, useriThe interaction correlation of (A) is their interaction times, and is recorded asThe maximum number of interactions for m time slices is denoted as RCmaxBy RCmaxNormalizing the user's interaction relevance for reference:
b4, calculating Su andattention similarity of (2) is recorded as FollowerSim (su, user)i),
B5, calculating Su andthe vermicelli similarity is recorded as FansSim (su, user)i),
B6, finally obtaining user su andsimilarity of (xi, user)i),
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510363990.XA CN105045822A (en) | 2015-06-26 | 2015-06-26 | Method for monitoring similar users of specific user in microblog |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510363990.XA CN105045822A (en) | 2015-06-26 | 2015-06-26 | Method for monitoring similar users of specific user in microblog |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105045822A true CN105045822A (en) | 2015-11-11 |
Family
ID=54452369
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510363990.XA Pending CN105045822A (en) | 2015-06-26 | 2015-06-26 | Method for monitoring similar users of specific user in microblog |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105045822A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106097113A (en) * | 2016-06-21 | 2016-11-09 | 仲兆满 | A kind of social network user sound interest digging method |
CN108920479A (en) * | 2018-04-16 | 2018-11-30 | 国家计算机网络与信息安全管理中心 | For two micro- across information source account recommended methods in one end |
CN108984609A (en) * | 2018-06-09 | 2018-12-11 | 天津大学 | The quantization method that network-oriented safety discipline frontier occurs |
CN109508380A (en) * | 2018-03-25 | 2019-03-22 | 哈尔滨工程大学 | A kind of method that combination user structure similarity carries out microblog emotional analysis |
CN109597924A (en) * | 2018-09-14 | 2019-04-09 | 湖北大学 | A kind of microblogging social circle method for digging and system based on artificial immune network |
CN112732981A (en) * | 2021-01-08 | 2021-04-30 | 深圳力维智联技术有限公司 | Data processing method, device and system and computer readable storage medium |
CN112818258A (en) * | 2021-03-08 | 2021-05-18 | 珠海市蜂巢数据技术有限公司 | Social network user searching method based on keywords, computer device and computer-readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617289A (en) * | 2013-12-12 | 2014-03-05 | 北京交通大学长三角研究院 | Micro-blog recommendation method based on user characteristics and network relations |
CN104239399A (en) * | 2014-07-14 | 2014-12-24 | 上海交通大学 | Method for recommending potential friends in social network |
-
2015
- 2015-06-26 CN CN201510363990.XA patent/CN105045822A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617289A (en) * | 2013-12-12 | 2014-03-05 | 北京交通大学长三角研究院 | Micro-blog recommendation method based on user characteristics and network relations |
CN104239399A (en) * | 2014-07-14 | 2014-12-24 | 上海交通大学 | Method for recommending potential friends in social network |
Non-Patent Citations (1)
Title |
---|
袁树仑: "《微博社会网络中人物与团体信息挖掘》", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106097113A (en) * | 2016-06-21 | 2016-11-09 | 仲兆满 | A kind of social network user sound interest digging method |
CN106097113B (en) * | 2016-06-21 | 2020-11-27 | 江苏海洋大学 | Social network user dynamic and static interest mining method |
CN109508380A (en) * | 2018-03-25 | 2019-03-22 | 哈尔滨工程大学 | A kind of method that combination user structure similarity carries out microblog emotional analysis |
CN109508380B (en) * | 2018-03-25 | 2021-07-16 | 哈尔滨工程大学 | Method for analyzing microblog emotion by combining user structure similarity |
CN108920479A (en) * | 2018-04-16 | 2018-11-30 | 国家计算机网络与信息安全管理中心 | For two micro- across information source account recommended methods in one end |
CN108920479B (en) * | 2018-04-16 | 2022-06-17 | 国家计算机网络与信息安全管理中心 | Cross-information-source account recommendation method for two micro terminals |
CN108984609A (en) * | 2018-06-09 | 2018-12-11 | 天津大学 | The quantization method that network-oriented safety discipline frontier occurs |
CN108984609B (en) * | 2018-06-09 | 2021-11-02 | 天津大学 | Quantification method for new field of network security subject |
CN109597924A (en) * | 2018-09-14 | 2019-04-09 | 湖北大学 | A kind of microblogging social circle method for digging and system based on artificial immune network |
CN109597924B (en) * | 2018-09-14 | 2020-02-07 | 湖北大学 | Microblog social circle mining method and system based on artificial immune network |
CN112732981A (en) * | 2021-01-08 | 2021-04-30 | 深圳力维智联技术有限公司 | Data processing method, device and system and computer readable storage medium |
CN112818258A (en) * | 2021-03-08 | 2021-05-18 | 珠海市蜂巢数据技术有限公司 | Social network user searching method based on keywords, computer device and computer-readable storage medium |
CN112818258B (en) * | 2021-03-08 | 2024-05-10 | 珠海市蜂巢数据技术有限公司 | Social network user searching method based on keywords, computer device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | A personalized hashtag recommendation approach using LDA-based topic model in microblog environment | |
Oliveira et al. | Can social media reveal the preferences of voters? A comparison between sentiment analysis and traditional opinion polls | |
CN106980692B (en) | Influence calculation method based on microblog specific events | |
Salloum et al. | Mining social media text: extracting knowledge from Facebook | |
CN105045822A (en) | Method for monitoring similar users of specific user in microblog | |
Baatarjav et al. | Group recommendation system for facebook | |
Kywe et al. | On recommending hashtags in twitter networks | |
McCorriston et al. | Organizations are users too: Characterizing and detecting the presence of organizations on twitter | |
Chua et al. | So fast so good: An analysis of answer quality and answer speed in community Q uestion‐answering sites | |
Zhang et al. | Your age is no secret: Inferring microbloggers' ages via content and interaction analysis | |
Amancio et al. | Comparing intermittency and network measurements of words and their dependence on authorship | |
CN105723402A (en) | Systems and methods for determining influencers in a social data network | |
Bhattacharya et al. | Deep twitter diving: Exploring topical groups in microblogs at scale | |
Chen et al. | Influencerank: An efficient social influence measurement for millions of users in microblog | |
CN103745000A (en) | Hot topic detection method of Chinese micro-blogs | |
Almquist et al. | Using radical environmentalist texts to uncover network structure and network features | |
Song et al. | Rt^ 2m: Real-time twitter trend mining system | |
Noro et al. | Twitter user rank using keyword search | |
Zhao et al. | Exploring the choice under conflict for social event participation | |
Zhai et al. | Innovation adoption: Broadcasting versus virality | |
Ali et al. | Towards the discovery of influencers to follow in micro-blogs (twitter) by detecting topics in posted messages (tweets) | |
CN114461879B (en) | Semantic social network multi-view community discovery method based on text feature integration | |
CN107729455A (en) | A kind of social network opinion leader sort algorithm based on multidimensional characteristic analysis | |
Kang et al. | A hybrid approach for paper recommendation | |
Hamzehei et al. | Collaborative topic regression for predicting topic-based social influence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151111 |