CN109447833A

CN109447833A - A kind of extensive microblog users community of interest discovery method

Info

Publication number: CN109447833A
Application number: CN201811124489.8A
Authority: CN
Inventors: 申彦
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2019-03-08

Abstract

The invention discloses a kind of extensive microblog users community of interest to find method, belongs to data mining technology field, this method comprises: (1) carries out the acquisition of data, the data source as the discovery of microblog users community of interest；(2) data detection and pretreatment are carried out；(3) data normalization processing is carried out, converts structural data for non-structured data so as to clustering；(4) clustering that SSLOK-means is improved based on Calinski-Harabasz is carried out；(5) best cluster number k is determined using CH distinguishing validity function, complete the building of community of interest discovery model.The invention enables microblog users big data can be directed to, clustering is carried out in limited memory and automatically determines the cluster number of cluster, provide support for the optimization of microblogging personalized service, the promotion for income of marketing.

Description

A kind of extensive microblog users community of interest discovery method

Technical field

The invention belongs to data mining technology fields, and in particular to a kind of extensive microblog users community of interest discovery side Method.

Background technique

As an open network platform, microblogging provides wide sharing and communication space for user.By in real time, Succinctly, open characteristic, microblogging have huge user.Show according to Sina weibo user's development report in 2017: cut-off is extremely On September 30th, 2017, the moon of Sina weibo 3.76 hundred million enlivens number and creates the highest again, and than 2016 same period increased by 27%.In face of with The user group that day all increases, it is current urgent need solution that how microblogging operator, which provides more accurate personalized service for user, A great problem certainly.Contain user behavior information abundant in the mass data that microblog users generate on platform, by right The analysis and research of user data find user group similar in interest preference, can optimize personalized service for microblog and provide It supports.

It is found by the combing to domestic and foreign scholars' research achievement with analysis, at present about data mining technology in microblogging The research of application is concentrated mainly on information propagation, user characteristics, user network structure etc., and is directed to microblog users interest group The research of body subdivision is opposite to be lacked.Although there is researcher to realize the subdivision of microblog users with clustering algorithm, and confirm The feasibility that microblog users are finely divided using clustering algorithm, but it only uses simple K-means algorithm.One side Face not can solve K-means algorithm and need the artificial deficiency for determining k value；On the other hand, with microblog users data set to be clustered The execution of the continuous increase of scale, K-means algorithm is limited by the actually available memory of system.

Summary of the invention

SSLOK- is improved based on Calinski-Harabasz in view of the deficiencies of the prior art, the present invention proposes a kind of The extensive microblog users community of interest of means finds method, excavates user's big data for microblogging operator, carries out microblogging individual character The promotion of the optimization, income of marketing of changing service provides support.

A kind of extensive microblog users community of interest discovery method, comprising the following steps:

Step 1) obtains the information of microblog users；

Step 2), data detection and pretreatment: including data detection, user filtering, microblog users concern account number classification, micro- The expression of rich user interest；

Step 3) data normalization: calculates interest preference degree, the microblog data vectorization of microblog users；

Step 4) carries out microblog data cluster using the improved SSLOK-means of Calinski-Harabasez function, Automatically determine clustering cluster number.

Further, the information of the microblog users includes user basic information and microblog account information, and the user is basic Information includes user name, gender, area, registion time, and the microblog account information includes the title of account, microblogging certification, letter Jie, bean vermicelli quantity, concern quantity.

Further, the data detection includes availability of data and correlation test.

Further, the user filtering method specifically: microblog account quantity is less than all microblog users and pays close attention to quantity The microblog users of mean value 1/10th are labeled as " silent user ", reject from tables of data.

Further, the microblog users pay close attention to account number classification method specifically: the microblogging account being had focused on using microblog users " brief introduction " and " certification " field in number identify different classes of account, classify to concern list account.

Further, the expression of the microblog users interest includes determining interest set, rejecting invalid account, mapping interest collection It closes, the determination of the interest set is incited somebody to action by reference to the domain classification of the big V of classification system and microblogging of mainstream microblog The interest of microblog users is classified, and interest set is constituted；The invalid account of rejecting is cannot to reflect that the interest of user is inclined Good account is rejected, and the account that can obviously reflect user interest is filtered out；The mapping interest set refers in interest set, There is always an interest, so that any one account is corresponding with the interest in account set.

Further, the interest preference degree of the microblog users isWherein Count(h_i) (i=1,2,10) and it is the number that the account of an interest is mapped in concern account set, Count (L) be account in each microblog users concern account set number, P (h_i) ∈ [0,1], to improve the subsequent clustering stage Numerical value counting accuracy, by user to the preference degree P (h of interest_i) expand 100 times of processing.

Further, the microblog data vectorization specifically: microblog data is regarded as document so that microblog users data with Text document is corresponding, and interest set is corresponding with particular subject, and the final interest preference degree of user is corresponding with weight, utilizes vector sky Between model microblog data is converted to the bivariate table of numerical value, complete vectorization procedure.

Further, the step 4) specifically:

1. constructing label sets, terminate-and-stay-resident；

2. the parameter k of clustering cluster number from 2 toSSLOK-means clustering algorithm is executed, wherein num is entire The number of data；

3. being evaluated using quality of the Calinski-Harabasz function to cluster result.

Further, the execution SSLOK-means clustering algorithm specifically: read data point from cluster data concentration sequence Into memory, until memory headroom fills up, to the information SS of data point, compressed data in memory, the information of data is removed OUTS, label sets labels carry out the semi-supervised clustering in limited memory, until convergence, to the data point in memory according to ownership And aggregation situation is compressed and is abandoned, and judges whether still there is data point in data set to be clustered, if so, continuing to read number According to otherwise, cluster process terminates.

The invention has the benefit that

The present invention absorbs the thought of semi-supervised learning, using based on the improved SSLOK-means of Calinski-Harabasz Clustering algorithm carries out clustering.Gather firstly, Calinski-Harabasz distinguishing validity function solves SSLOK-means Class algorithm needs artificial the problem of determining k value in advance.Secondly, SSLOK-means algorithm is directed to large-scale dataset, using resident The label sets of main memory instruct cluster process, compare K-means, make to cluster efficiency and the quality of cluster result has obtained further Raising.New technology introducing is used for microblog users sample data set by the present invention, and clustering is carried out to it, according to cluster knot Fruit has carried out group's division according to interest preference to microblog users, and proposes targetedly personalization for each community of interest and push away Suggestion is recommended, there is theory directive significance and more practical value to the promotion of microblogging personalized service.

Detailed description of the invention

Fig. 1 is a kind of extensive microblog users community of interest that SSLOK-means is improved based on Calinski-Harabasz It was found that the flow chart of method；

Fig. 2 is SSLOK-means cluster process figure；

Fig. 3 once clusters for SSLOK-means restrains later cluster result figure again；

Fig. 4 is sample data and normal data age distribution situation comparison diagram.

Specific embodiment

The present invention is further illustrated with reference to the accompanying drawings and embodiments, but protection scope of the present invention is not limited to This.

The present invention by it is a kind of based on Calinski-Harabasz improve SSLOK-means extensive microblog users interest Group finds that method is used for microblog users sample data set, and clustering is carried out to it, presses according to cluster result to microblog users Group's division has been carried out according to interest preference, and has proposed targetedly personalized recommendation suggestion for each community of interest.

As shown in Figure 1, a kind of extensive microblog users for improving SSLOK-means based on Calinski-Harabasz are emerging Interesting group's discovery technique, comprising steps of

Step 1), data acquisition

Using web crawlers tool crawl microblog users information, including user name, gender, area, registion time and Pay close attention to the title of microblog account in list, microblogging certification, brief introduction, bean vermicelli quantity, concern quantity.

Step 2), data detection and pretreatment

Step 2.1), data detection

It before carrying out clustering, needs test samples data that can represent totality, chooses gender, age, area three A index, sample data and normal data are compared.

Whether before cluster, detecting between variable has correlation most important.If having strong correlation between variable, The weight of so meaning representated by these variables just will increase, and cluster to such variable nonsensical, can use Pearson Related coefficient, Spearman related coefficient etc. are tested.

Step 2.2), user filtering

In the microblog users crawled, there can be " silent user ", this kind of user's is mainly characterized by concern list Microblog account quantity is seldom, not can truly reflect the interest preference of the user, needs to reject.It therefore, will be micro- in concern list The microblog users that rich account quantity is less than all microblog users concern number average value 1/10th are labeled as " silent user ", from number According to being rejected in table.

Step 2.3), microblog users pay close attention to account number classification

" brief introduction " and " certification " field in the microblog account being had focused on using microblog users identify different classes of account Number, classify to concern list account.According to mainstream mode classification, microblog account of interest is divided into relatives and friends, is known Name personage, functional three classes microblogging.The microblogging of microblogging, that is, user friend of relatives and friends, relatives；The microblogging of celebrity is Refer to the microblog account in the representational celebrity of specific area；Functional microblogging refers to the account with certain social functions, Generally comprise official's certification account, the information account of news media etc. of various industries.

Step 2.4), the expression of microblog users interest

It on the one hand, is not to pay close attention to account all in list to reflect the interest preference of user, needing cannot be anti- The account for reflecting user interest is rejected.On the other hand, functionally similar account is able to reflect the of a sort interest preference of user, needs Integration classification is carried out to account.

Step 2.4.1), determine interest set

By reference to mainstream microblog, such as Sina, Tencent, the classification system of Sohu's microblogging and the big V of microblogging neck The interest of microblog users is divided into ten major class, constitutes interest set H by domain classification, is respectively as follows: fashionable shopping, cuisines, tourism are taken the photograph Shadow, sport, video display amusement, music, game animation, literature works reading, industry work and IT are digital.

Step 2.4.2), reject invalid account

If the account collection in a microblog users concern list is combined into P (p₁,p₂,p₃,...,p_n)(n∈N⁺), wherein p_i(i ∈ n) represent user pay close attention to list in an account, this kind of interest preference that cannot reflect user of " relatives and friends " will be classified as Account reject, filter out the account that can obviously reflect user interest, constitute concern account set L (l₁,l₂,l₃,...,l_m)(m ∈N⁺), whereinRecord the number Count (L) of account in each microblog users concern account set L.

Step 2.4.3), map interest set

In interest set H, an interest h can be found_i, so that any one account l in account set L_iIt can be with h_iIt is corresponding, i.e.,Thus an interest of microblog users can be determined.Can exist multiple concern accounts with it is same emerging Situation corresponding to interest, i.e.,It records in concern account set L and is mapped to each interest h_i(i=1,2 ..., 10) the number Count (h of account_i) (i=1,2 ..., 10).So far, by the account in the account set L of each microblog users Number information is converted into interest variable.

Step 3), data normalization

By data prediction, obtain each microblog users concern account set L number Count (L) and each Microblog users corresponding interest preference Count (h in interest set H_i) (i=1,2 ..., 10).Vector space model is used for reference, Data with existing is standardized, converts non-structured data to the data of structuring.

Step 3.1) calculates the interest preference degree of microblog users

Microblog users, can be by being mapped to the account number Count (h of the interest to the preference degree of any interest_i) (i= 1,2 ..., the ratio of 10) paying close attention to the number Count (L) of account set L in list with the user reflect, if user is to any Interest h_iPreference degree beWherein P (h_i) ∈ [0,1] its value is bigger, it indicates User more has a preference for interest.For the ease of handling the numerical value counting accuracy in subsequent clustering stage, by user to interest Preference degree P (h_i) expand 100 times of processing, i.e., it is G (h by the final interest preference degree for expanding processing_i)=P (h_i)× 100, wherein G (h_i)∈[0,100].In this way, preference of the user to each interest is converted to numerical value.And so on, it calculates User is to interest set H (h₁,h₂,h₃,...,h₁₀) in each interest preference degree G (h₁,h₂,h₃,...,h₁₀), into one Step calculates all users to be interested in preference degree.

Step 3.2), microblog data vectorization

Vector space model can be described as: for some text document d_i, particular subject t_j(j=1 ..., n) is not mutually not Identical word, t_jIn text d_iIn weight be denoted as w_ij, i.e. text d_iIt may be expressed as: V (d_i)=((t₁, w_il), (t₂, w_i2) ..., (t_n, w_in)).So that microblog users data and text document d_iIt is corresponding, interest set h_i(i=1,2 ..., 10) with Particular subject t_j(j=1 ..., n) is corresponding, the final interest preference degree G (h of user_i) and weight w_ijIt is corresponding, it is complete using the method At the vectorization of microblog users data, user becomes a numerical value bivariate table to the final vectorization of the preference of interest.

Step 4) carries out microblog data cluster using the improved SSLOK-means of Calinski-Harabasez function

The definition of CH function isWherein,Indicate cluster with The mark of mean dispersion error matrix, n between cluster_jFor the tuple number of j-th of cluster, u is the mean value of entire data set, u_jIt is the equal of j-th of cluster Value；Indicate the mark of mean dispersion error matrix between class inside, n_jFor the member of j-th of cluster Group number, k is the number of cluster,Indicate a tuple in cluster j.CH function is the ratio of between class distance and inter- object distance, CH Be worth bigger, it is better to represent clustering result quality, i.e., CH maximum when cluster number it is best.

Step 4.1) constructs label sets, terminate-and-stay-resident

Under the premise of considering to mark cost of labor, a small amount of data set is marked in a manual manner from data set, mark Numeration is according to generic.These label sets are resident main memory, instruct the cluster process of SSLOK-means, improve cluster efficiency and gather The quality of class result.

Step 4.2), the parameter k of clustering cluster number from 2 to(number that num is entire data) executes SSLOK- Means clustering algorithm, SSLOK-means algorithm implementation procedure are as follows:

Step 4.2.1), data point is sequentially read from data set and enters memory, until memory headroom fills up, reaches data The purpose clustered in batches.

Step 4.2.2), in memory data point, SS (information of compressed data), the OUTS information of data (remove), Labels (label sets) carries out the semi-supervised clustering in limited memory, until convergence.Cluster process is as shown in Figures 2 and 3.At this In the cluster process of step, the belonging relation of SS can change, or new SS, New collection as shown in Figure 3 occur. The belonging relation of OUTS cannot change, and be subordinated to corresponding MC (main classes) always, be used for the bulk registration clustering information. OUTS and SS during this do not store related specific data point, but utilize a triple (sum_j,sumsq_j, num_j) replace, whereinnum_jIt is the number of j-th of cluster According to number, RⁿIndicate real vector space.

Assuming that OUTS ∈ MC, x_j∈ MC, j=1 ..., n, n is target cluster numbers, if partial data collection meets compressor bar Part (under the premise of clusters number is N times of target cluster numbers,N General value is that 4, β is a preset lesser value for indicating cluster tightness), then triple substitution can be used.Each main classes MC It may particularly denote are as follows:

Ss ∈ MC, OUTS ∈ MC, x_j∈ MC, j=1 ..., n, count expression It counts.

In single cluster process, to terminate algorithm can with controlled final convergence, a small value can be defined End of identification ε, such as ε=1e-8.If twice in succession in cluster circulation, offset total amount move <=ε of all cluster centres, Then think that this clustering convergence terminates.

Step 4.2.3), the data point in memory is compressed and abandoned

The data point set for meeting contractive condition is substituted by SS, and respective counts strong point is removed out main memory.Meet discarding condition Data point be moved out of main memory, into OUTS gather.To clear out space in main memory, other data point set is read in.

Step 4.2.4), judge whether still there is data point in data set to be clustered, if so, jump procedure 4.2.1), it is no Then, cluster process terminates, and algorithm terminates, and calculates CH value at this time.

Step 4.3) is evaluated using quality of the Calinski-Harabasz function to cluster result.Select CH function Being worth maximum k value is best cluster number, and cluster result is the division of optimal microblog users community of interest.

Implementation condition: experimental situation is ThinkPad 20FWA00VCD, and CPU is Core i7-6700@2.6GHz, memory 64 bit manipulation system of 8GB, Win10, web crawlers tool, Matlab platform.

Embodiment 1:

Clustering is carried out to Sina weibo user's big data, finds microblog users community of interest, and for different emerging Interesting group provides personalized service, provides support for the optimization of microblogging personalized service, the promotion for income of marketing.

1, " octopus collector " is utilized to grab the Sina weibo information of 627 ordinary users, by having collected The information of 627 microblog users is filtered, and weeds out 60 " silent users ", is finally remained 567 effective microbloggings and is used The information at family completes the filter process of user.After the present embodiment Sina weibo data prediction as shown in table 1, before being chosen in table 1 5 row data instances carry out example.

The preference vector of 1 user interest of table

2, sample data is examined:

1) it before carrying out clustering, needs test samples data that can represent totality, otherwise clusters nonsensical.Root According to " microblog media characteristic and user's behaviour in service research report " (hereinafter referred to as " reporting ") that 2010 officials issue, the property chosen Not, age, regional three indexs, sample data and normal data are compared, as a result as follows:

1. gender.This is tested in 567 Sina weibo users collected, male 243, women 324, is accounted for respectively Than 42.9% and 57.1%.And microblog users standard masculinity proportion is 39% in " report ", women 61%.The male of sample data Female's ratio is not much different with normal data, and data have confidence level.

2. the age.In 567 users of this experiment acquisition, 39 people of under-18s, accounting 6.88%, 18-30 years old 364 People, accounting 64.20%, 30 years old or more 164 people, accounting 28.92%.Fig. 4 is seen with the comparison of " report " Plays age accounting.Sample Data are almost the same with normal data.

3. area.This test sample data region collected have Jiangsu, Hebei, Liaoning, Heilungkiang, Beijing, Henan, Tianjin, Hainan, Xinjiang, Guangxi, Guangdong, Shenzhen, Shandong, Fujian, Jilin, Hong Kong, Zhejiang, Hubei, overseas, Ningxia, four River, Chongqing, Yunnan, sample data have generality, have confidence level.

Therefore, the sample data that the present embodiment is chosen can represent domestic microblog users overall condition, be examined by availability It tests.

2) whether before cluster, detecting between variable has correlation most important.If having strong correlation between variable Property, then the weight of meaning representated by these variables just will increase, cluster to such variable nonsensical, the present embodiment Correlation test is carried out to ten interest using Pearson model, the results are shown in Table 2.

2 Pearson correlation test result of table

As shown in Table 2, the correlation of variables between ten interest is respectively less than 0.3, therefore, correlation between ten interest Weak, sample data is examined by correlation of variables.

Data analysis result shows: sample data has passed through availability and correlation of variables is examined, and can be used in cluster point Analysis.

3, the clustering of SSLOK-means is improved based on CH

Semisupervised Labels Onescan K-means (SSLOK- is realized using MATLAB platform Means) algorithm.Under the guidance of the label sets of resident main memory, 100 sample data sets are read every time and are clustered, cluster Number k value range is set as 2≤k≤10, finally obtains 9 groups of Different Results.Calinsi- is carried out to the result of 9 groups of data The distinguishing validity of Harabasz (CH) function, finally obtains the corresponding validity CH value of different cluster number k values, among algorithm Output the results are shown in Table 3.

The comparison of 3 CH value of table

As shown in Table 3, when k takes 6, Validity Function CH is maximum, is based on the improved SSLOK- of Calinski-Harabasz Means clustering algorithm will cluster number k and be chosen for 6 automatically and complete cluster, obtains Sina weibo user interest group and draws The result divided.

4, Sina weibo user interest group division result describes

After SSLOK-means clustering and CH criteria function, the number for obtaining most reasonable cluster is 6, each cluster In number of samples be shown in Table 4, the distance between each cluster centre is shown in Table 5, and final cluster centre is shown in Table 6.

The record number in each cluster that 4 clusters number of table is 6

Distance between each cluster center that 5 clusters number of table is 6

It can be obtained by table 5, the significant difference between each cluster center, wherein the distance between the 2nd cluster and the 3rd cluster cluster centre are most Greatly 55.5, the distance between the 5th cluster and the 6th cluster cluster centre are minimum, are 26.8.

The final cluster centre that 6 clusters number of table is 6

In order to which individual is to the preference degree of each interest in each cluster of more intuitive description, by cluster centre Digital Discrete Change, setting [0,2) in be extremely weak interest, [2,10) be weaker interest, [10,15) in be medium interest, [15,25) in be relatively strong Interest, is strong interest in [25,100], and the cluster centre after processing the results are shown in Table 7.

7 interest sliding-model control result of table

5, the analysis of Sina weibo user interest group division result and personalized ventilation system suggestion

Clustering is carried out using based on information of the improved SSLOK-means algorithm of CH function to microblog users, finally User is divided into the different community of interest of 6 class interest preferences, each group has different characteristics.From 6 cluster internal analysis, i.e., Longitudinal 2 observation interest sliding-model control result (table 7), it can be deduced that the different characteristic of 6 class users.6 class users are according to 1~cluster of cluster 6 Sequence can successively conclude are as follows: knowledge type, substance type, IT class, cause type, network-type and balanced type.For each type Community of interest gives personalized recommended suggestion according to table 7.

(1) knowledge type.Sample size is medium, is 90.User in such community of interest has " word read " dense Interest, to industry work have stronger interest, have medium interest to " fashionable shopping ", but weaker to other class interest, it is especially right The interest of sports field is extremely weak.This kind of user likes reading, and the news dynamic in Concerned Industry field, also compares fashionable shopping often It is interested.Therefore, their demand is concentrated mainly in reading and the acquisition of practical information, focuses on the raising of quality of the life And the raising of itself mastery, pursue the life for having quality.

For this kind of user, microblog periodically should push the account in terms of the literature information for having depth is read to them, 40% (table 6) that the push ratio of such account about once pushes.In addition, should also push a part of practical industry to them Relevant high quality account, the push ratio of such account about once push 20%.For the collocation of related fashion, trend product The account of board, the push ratio of such account about once push 12%.In addition, since such user is to interest in physical education pole It is weak, the push of a small amount of sport account can be carried out after interval repeatedly push to such user, interest induction is carried out and attempts.

(2) substance type.Sample size is larger, is 123.User characteristics in such community of interest are to " fashionable shopping " There is keen interest, has moderate interest to " cuisines " and " literature works reading ", it is weaker to the interest of other field.This kind of use Family trend-conscious trend pursues material life.The life style of fashion has very strong attraction to them.In addition, they are also right Cuisines and reading have certain interest.

For this kind of user, should periodically to they push the related fashion brand for having depth, trend trend, clothing matching, The microblog account of star tide-person, the push ratio of such account about once push 50%.Further, it is also possible to obtaining user Location information after, the restaurant etc. of recommending local cuisines, public praise good to them.The about primary push of the push ratio of such account 12%, the ratio of push can carry out appropriate adjustment using the time of microblogging according to user.If than user in time for eating meals Using microblogging, it can suitably increase push ratio, non-time for eating meals can reduce push ratio in right amount.Similarly, since such user couple Interest in physical education is extremely weak, can carry out the induction push of a small amount of sport account after interval repeatedly push to such user.

(3) IT class.Sample size is less, is 55.User in such community of interest has " IT is digital " dense emerging Interest, while having medium interest to " literature works reading " and " industry work ".This kind of user belongs to poly-talented user, likes paying close attention to Internet dynamic, the forward position knowledge of IT field have keen interest to electronic product, the relevent information of Concerned Industry work often, Focus on the promotion of self-skill.

For such user, it should which the high-quality account for recommending IT internet area to it periodically pushes newest electricity to them The information of sub- digital product, the push ratio of such account about once push 42%；Periodically recommend good text to them Word read account, the push ratio of such account about once push 11%；It is excellent that industry Zone Information correlation periodically is pushed to them Matter account, the push ratio of such account about once push 11%；In addition, since such user is extremely weak to interest in music, The push that can carry out a small amount of music account after interval repeatedly push to such user, carries out interest induction and attempts.In addition, This kind of poly-talented user often familiar internet the relevant technologies can the problem of if there is related internet the relevant technologies To ask for help to them.

(4) cause type.Sample size is most, is 142.User in such community of interest is to " industry work " class microblogging There is strong interest, have medium interest to " fashionable shopping " and " word read ", to " sport ", " video display amusement " and " game is dynamic It is unrestrained " interest is extremely weak.They pay close attention to Domestic News class relevant to specific industry, the information of Practical Skill class.Such user Belong to cause type user, they focus on promoting self-skill, it is desirable to be able to obtain trade information relevant to oneself field.

For such user, under the premise of obtaining the specific area that user selectes, it should to user's keypoint recommendation and its Deep knowledge class account in related field, the push ratio of such account about once push 50%；Periodically to them Push good word read account information, the push ratio of such account about once push 15%；Can also periodically to They recommend the relevant information of the fashionable shopping of some shallow-layers, to meet them to the needs of improving the quality of living, such account Push ratio about once push 13%；In addition, since such user is to " sport ", " video display amusement " and " game animation " Interest is extremely weak, can carry out the push of a small amount of relevent information information after interval repeatedly push to such user, carry out interest Induction.

(5) network-type.Sample size is less, is 37.User in such community of interest has " sport " field dense Thick interest, they be keen to pay close attention to some sports stars, race official website, sports class microblog account, while to " text Read " there is stronger interest, there is medium interest to " video display amusement ".For such user mostly based on university student, the time is more well-to-do, They are larger to the knowledge interest of study class, have certain interest to some recreational information.

For such user, it should according to the selected specific sports items of user, targetedly recommend to have to them Close the information such as the race advance notices of sports items, sports star, the push ratio of such account about once push 30%；Periodically Recommend good word read, examination information skill improvement class account to such user, the push ratio of such account is about one The 17% of secondary push；In addition, this kind of user to video display amusement have certain interest, can periodically to they push with TV play, The relevant message of film, the push ratio of such account about once push 14%.

(6) balanced type.Sample size is larger, is 120.User in such community of interest is not special to each interest Preference, substantially to " cuisines ", " fashionable shopping ", " tourism photograph ", " video display amusement ", " word read " and " industry " class Information interest is balanced, and such user belongs to the user of balanced class, and interest is extensive, can speculate that their personality are partially export-oriented, be easy to connect By fangle.

For such user, recommend the information of each category of interest to them, the push ratio of all kinds of accounts about once pushes away 15% sent；Further, it is also possible to recommend the message of some fangles to them and in relation to the microblog account of social category, such as Today's tops, association etc..

The user of 6 class difference community of interest is made a concrete analysis of above, the typical case that 6 class users are described in detail is special Sign, and specific aim suggestion is proposed to the content of different type user-customized recommended.It is special for different types of user group The personalized recommendation carried out is levied, the usage experience of user is able to ascend, increases user's viscosity, and then improve microblog operation Economic benefit.

The experimental results showed that it is a kind of for improving SSLOK-means big data clustering technique based on Calinski-Harabasz Effective novel microblog users community of interest discovery technique, the discovery result of microblogging community of interest are the excellent of microblogging personalized service Change, the promotion of marketing income provides support

Above embodiments are merely to illustrate design philosophy and feature of the invention, and its object is to make technology in the art Personnel can understand the content of the present invention and implement it accordingly, and protection scope of the present invention is not limited to the above embodiments.So it is all according to It is within the scope of the present invention according to equivalent variations made by disclosed principle, mentality of designing or modification.

Claims

1. a kind of extensive microblog users community of interest finds method, which comprises the following steps:

Step 1) obtains the information of microblog users；

Step 2), data detection and pretreatment: it is used including data detection, user filtering, microblog users concern account number classification, microblogging The expression of family interest；

Step 4) carries out microblog data cluster, automatically using the improved SSLOK-means of Calinski-Harabasez function Determine clustering cluster number.

2. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging The information of user includes user basic information and microblog account information, the user basic information include user name, gender, Area, registion time, the microblog account information include the title of account, microblogging certification, brief introduction, bean vermicelli quantity, concern quantity.

3. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the data Examine includes availability of data and correlation test.

4. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the user Filter method specifically: microblog account quantity is less than to the microblog users mark of all microblog users concern number average value 1/10th It is denoted as " silent user ", is rejected from tables of data.

5. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging User pays close attention to account number classification method specifically: " brief introduction " and " certification " field in the microblog account being had focused on using microblog users It identifies different classes of account, classifies to concern list account.

6. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging The expression of user interest includes determining interest set, rejecting invalid account, mapping interest set, and the determination of the interest set is By reference to the domain classification of the big V of classification system and microblogging of mainstream microblog, the interest of microblog users is classified, Constitute interest set；The invalid account of rejecting is cannot to reflect that the account of the interest preference of user is rejected, and filtering out can be bright The account of aobvious reflection user interest；The mapping interest set refers to that in interest set, there is always an interest, so that account Any one account is corresponding with the interest in set.

7. a kind of extensive microblog users community of interest as claimed in claim 1 or 2 finds method, which is characterized in that described The interest preference degree of microblog users isWherein Count (h_i) (i=1,2 ..., It 10) is the number for paying close attention to the account that an interest is mapped in account set, Count (L) is each microblog users concern account The number of account in set, P (h_i) ∈ [0,1], for the numerical value counting accuracy for improving subsequent clustering stage, by user couple Preference degree P (the h of interest_i) expand 100 times of processing.

8. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging Data vector specifically: microblog data is regarded as document so that microblog users data are corresponding with text document, interest set with Particular subject is corresponding, and the final interest preference degree of user is corresponding with weight, and microblog data is converted to using vector space model The bivariate table of numerical value completes vectorization procedure.

9. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the step 4) specifically:

1. constructing label sets, terminate-and-stay-resident；

2. the parameter k of clustering cluster number from 2 toSSLOK-means clustering algorithm is executed, wherein num is entire data Number；

10. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that described to hold Row SSLOK-means clustering algorithm specifically: read data point from cluster data concentration sequence and enter memory, until memory headroom It fills up, has to the information SS of data point, compressed data in memory, information OUTS, the label sets labels of removal data Limit memory in semi-supervised clustering, until convergence, to the data point in memory according to belong to and aggregation situation carry out compression and It abandons, judges whether still there is data point in data set to be clustered, if so, continuing to read data, otherwise, cluster process terminates.