CN109447833A - A kind of extensive microblog users community of interest discovery method - Google Patents
A kind of extensive microblog users community of interest discovery method Download PDFInfo
- Publication number
- CN109447833A CN109447833A CN201811124489.8A CN201811124489A CN109447833A CN 109447833 A CN109447833 A CN 109447833A CN 201811124489 A CN201811124489 A CN 201811124489A CN 109447833 A CN109447833 A CN 109447833A
- Authority
- CN
- China
- Prior art keywords
- interest
- data
- account
- microblog
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Abstract
The invention discloses a kind of extensive microblog users community of interest to find method, belongs to data mining technology field, this method comprises: (1) carries out the acquisition of data, the data source as the discovery of microblog users community of interest;(2) data detection and pretreatment are carried out;(3) data normalization processing is carried out, converts structural data for non-structured data so as to clustering;(4) clustering that SSLOK-means is improved based on Calinski-Harabasz is carried out;(5) best cluster number k is determined using CH distinguishing validity function, complete the building of community of interest discovery model.The invention enables microblog users big data can be directed to, clustering is carried out in limited memory and automatically determines the cluster number of cluster, provide support for the optimization of microblogging personalized service, the promotion for income of marketing.
Description
Technical field
The invention belongs to data mining technology fields, and in particular to a kind of extensive microblog users community of interest discovery side
Method.
Background technique
As an open network platform, microblogging provides wide sharing and communication space for user.By in real time,
Succinctly, open characteristic, microblogging have huge user.Show according to Sina weibo user's development report in 2017: cut-off is extremely
On September 30th, 2017, the moon of Sina weibo 3.76 hundred million enlivens number and creates the highest again, and than 2016 same period increased by 27%.In face of with
The user group that day all increases, it is current urgent need solution that how microblogging operator, which provides more accurate personalized service for user,
A great problem certainly.Contain user behavior information abundant in the mass data that microblog users generate on platform, by right
The analysis and research of user data find user group similar in interest preference, can optimize personalized service for microblog and provide
It supports.
It is found by the combing to domestic and foreign scholars' research achievement with analysis, at present about data mining technology in microblogging
The research of application is concentrated mainly on information propagation, user characteristics, user network structure etc., and is directed to microblog users interest group
The research of body subdivision is opposite to be lacked.Although there is researcher to realize the subdivision of microblog users with clustering algorithm, and confirm
The feasibility that microblog users are finely divided using clustering algorithm, but it only uses simple K-means algorithm.One side
Face not can solve K-means algorithm and need the artificial deficiency for determining k value;On the other hand, with microblog users data set to be clustered
The execution of the continuous increase of scale, K-means algorithm is limited by the actually available memory of system.
Summary of the invention
SSLOK- is improved based on Calinski-Harabasz in view of the deficiencies of the prior art, the present invention proposes a kind of
The extensive microblog users community of interest of means finds method, excavates user's big data for microblogging operator, carries out microblogging individual character
The promotion of the optimization, income of marketing of changing service provides support.
A kind of extensive microblog users community of interest discovery method, comprising the following steps:
Step 1) obtains the information of microblog users;
Step 2), data detection and pretreatment: including data detection, user filtering, microblog users concern account number classification, micro-
The expression of rich user interest;
Step 3) data normalization: calculates interest preference degree, the microblog data vectorization of microblog users;
Step 4) carries out microblog data cluster using the improved SSLOK-means of Calinski-Harabasez function,
Automatically determine clustering cluster number.
Further, the information of the microblog users includes user basic information and microblog account information, and the user is basic
Information includes user name, gender, area, registion time, and the microblog account information includes the title of account, microblogging certification, letter
Jie, bean vermicelli quantity, concern quantity.
Further, the data detection includes availability of data and correlation test.
Further, the user filtering method specifically: microblog account quantity is less than all microblog users and pays close attention to quantity
The microblog users of mean value 1/10th are labeled as " silent user ", reject from tables of data.
Further, the microblog users pay close attention to account number classification method specifically: the microblogging account being had focused on using microblog users
" brief introduction " and " certification " field in number identify different classes of account, classify to concern list account.
Further, the expression of the microblog users interest includes determining interest set, rejecting invalid account, mapping interest collection
It closes, the determination of the interest set is incited somebody to action by reference to the domain classification of the big V of classification system and microblogging of mainstream microblog
The interest of microblog users is classified, and interest set is constituted;The invalid account of rejecting is cannot to reflect that the interest of user is inclined
Good account is rejected, and the account that can obviously reflect user interest is filtered out;The mapping interest set refers in interest set,
There is always an interest, so that any one account is corresponding with the interest in account set.
Further, the interest preference degree of the microblog users isWherein
Count(hi) (i=1,2,10) and it is the number that the account of an interest is mapped in concern account set, Count
(L) be account in each microblog users concern account set number, P (hi) ∈ [0,1], to improve the subsequent clustering stage
Numerical value counting accuracy, by user to the preference degree P (h of interesti) expand 100 times of processing.
Further, the microblog data vectorization specifically: microblog data is regarded as document so that microblog users data with
Text document is corresponding, and interest set is corresponding with particular subject, and the final interest preference degree of user is corresponding with weight, utilizes vector sky
Between model microblog data is converted to the bivariate table of numerical value, complete vectorization procedure.
Further, the step 4) specifically:
1. constructing label sets, terminate-and-stay-resident;
2. the parameter k of clustering cluster number from 2 toSSLOK-means clustering algorithm is executed, wherein num is entire
The number of data;
3. being evaluated using quality of the Calinski-Harabasz function to cluster result.
Further, the execution SSLOK-means clustering algorithm specifically: read data point from cluster data concentration sequence
Into memory, until memory headroom fills up, to the information SS of data point, compressed data in memory, the information of data is removed
OUTS, label sets labels carry out the semi-supervised clustering in limited memory, until convergence, to the data point in memory according to ownership
And aggregation situation is compressed and is abandoned, and judges whether still there is data point in data set to be clustered, if so, continuing to read number
According to otherwise, cluster process terminates.
The invention has the benefit that
The present invention absorbs the thought of semi-supervised learning, using based on the improved SSLOK-means of Calinski-Harabasz
Clustering algorithm carries out clustering.Gather firstly, Calinski-Harabasz distinguishing validity function solves SSLOK-means
Class algorithm needs artificial the problem of determining k value in advance.Secondly, SSLOK-means algorithm is directed to large-scale dataset, using resident
The label sets of main memory instruct cluster process, compare K-means, make to cluster efficiency and the quality of cluster result has obtained further
Raising.New technology introducing is used for microblog users sample data set by the present invention, and clustering is carried out to it, according to cluster knot
Fruit has carried out group's division according to interest preference to microblog users, and proposes targetedly personalization for each community of interest and push away
Suggestion is recommended, there is theory directive significance and more practical value to the promotion of microblogging personalized service.
Detailed description of the invention
Fig. 1 is a kind of extensive microblog users community of interest that SSLOK-means is improved based on Calinski-Harabasz
It was found that the flow chart of method;
Fig. 2 is SSLOK-means cluster process figure;
Fig. 3 once clusters for SSLOK-means restrains later cluster result figure again;
Fig. 4 is sample data and normal data age distribution situation comparison diagram.
Specific embodiment
The present invention is further illustrated with reference to the accompanying drawings and embodiments, but protection scope of the present invention is not limited to
This.
The present invention by it is a kind of based on Calinski-Harabasz improve SSLOK-means extensive microblog users interest
Group finds that method is used for microblog users sample data set, and clustering is carried out to it, presses according to cluster result to microblog users
Group's division has been carried out according to interest preference, and has proposed targetedly personalized recommendation suggestion for each community of interest.
As shown in Figure 1, a kind of extensive microblog users for improving SSLOK-means based on Calinski-Harabasz are emerging
Interesting group's discovery technique, comprising steps of
Step 1), data acquisition
Using web crawlers tool crawl microblog users information, including user name, gender, area, registion time and
Pay close attention to the title of microblog account in list, microblogging certification, brief introduction, bean vermicelli quantity, concern quantity.
Step 2), data detection and pretreatment
Step 2.1), data detection
It before carrying out clustering, needs test samples data that can represent totality, chooses gender, age, area three
A index, sample data and normal data are compared.
Whether before cluster, detecting between variable has correlation most important.If having strong correlation between variable,
The weight of so meaning representated by these variables just will increase, and cluster to such variable nonsensical, can use Pearson
Related coefficient, Spearman related coefficient etc. are tested.
Step 2.2), user filtering
In the microblog users crawled, there can be " silent user ", this kind of user's is mainly characterized by concern list
Microblog account quantity is seldom, not can truly reflect the interest preference of the user, needs to reject.It therefore, will be micro- in concern list
The microblog users that rich account quantity is less than all microblog users concern number average value 1/10th are labeled as " silent user ", from number
According to being rejected in table.
Step 2.3), microblog users pay close attention to account number classification
" brief introduction " and " certification " field in the microblog account being had focused on using microblog users identify different classes of account
Number, classify to concern list account.According to mainstream mode classification, microblog account of interest is divided into relatives and friends, is known
Name personage, functional three classes microblogging.The microblogging of microblogging, that is, user friend of relatives and friends, relatives;The microblogging of celebrity is
Refer to the microblog account in the representational celebrity of specific area;Functional microblogging refers to the account with certain social functions,
Generally comprise official's certification account, the information account of news media etc. of various industries.
Step 2.4), the expression of microblog users interest
It on the one hand, is not to pay close attention to account all in list to reflect the interest preference of user, needing cannot be anti-
The account for reflecting user interest is rejected.On the other hand, functionally similar account is able to reflect the of a sort interest preference of user, needs
Integration classification is carried out to account.
Step 2.4.1), determine interest set
By reference to mainstream microblog, such as Sina, Tencent, the classification system of Sohu's microblogging and the big V of microblogging neck
The interest of microblog users is divided into ten major class, constitutes interest set H by domain classification, is respectively as follows: fashionable shopping, cuisines, tourism are taken the photograph
Shadow, sport, video display amusement, music, game animation, literature works reading, industry work and IT are digital.
Step 2.4.2), reject invalid account
If the account collection in a microblog users concern list is combined into P (p1,p2,p3,...,pn)(n∈N+), wherein pi(i
∈ n) represent user pay close attention to list in an account, this kind of interest preference that cannot reflect user of " relatives and friends " will be classified as
Account reject, filter out the account that can obviously reflect user interest, constitute concern account set L (l1,l2,l3,...,lm)(m
∈N+), whereinRecord the number Count (L) of account in each microblog users concern account set L.
Step 2.4.3), map interest set
In interest set H, an interest h can be foundi, so that any one account l in account set LiIt can be with
hiIt is corresponding, i.e.,Thus an interest of microblog users can be determined.Can exist multiple concern accounts with it is same emerging
Situation corresponding to interest, i.e.,It records in concern account set L and is mapped to each interest hi(i=1,2 ...,
10) the number Count (h of accounti) (i=1,2 ..., 10).So far, by the account in the account set L of each microblog users
Number information is converted into interest variable.
Step 3), data normalization
By data prediction, obtain each microblog users concern account set L number Count (L) and each
Microblog users corresponding interest preference Count (h in interest set Hi) (i=1,2 ..., 10).Vector space model is used for reference,
Data with existing is standardized, converts non-structured data to the data of structuring.
Step 3.1) calculates the interest preference degree of microblog users
Microblog users, can be by being mapped to the account number Count (h of the interest to the preference degree of any interesti) (i=
1,2 ..., the ratio of 10) paying close attention to the number Count (L) of account set L in list with the user reflect, if user is to any
Interest hiPreference degree beWherein P (hi) ∈ [0,1] its value is bigger, it indicates
User more has a preference for interest.For the ease of handling the numerical value counting accuracy in subsequent clustering stage, by user to interest
Preference degree P (hi) expand 100 times of processing, i.e., it is G (h by the final interest preference degree for expanding processingi)=P (hi)×
100, wherein G (hi)∈[0,100].In this way, preference of the user to each interest is converted to numerical value.And so on, it calculates
User is to interest set H (h1,h2,h3,...,h10) in each interest preference degree G (h1,h2,h3,...,h10), into one
Step calculates all users to be interested in preference degree.
Step 3.2), microblog data vectorization
Vector space model can be described as: for some text document di, particular subject tj(j=1 ..., n) is not mutually not
Identical word, tjIn text diIn weight be denoted as wij, i.e. text diIt may be expressed as: V (di)=((t1, wil), (t2,
wi2) ..., (tn, win)).So that microblog users data and text document diIt is corresponding, interest set hi(i=1,2 ..., 10) with
Particular subject tj(j=1 ..., n) is corresponding, the final interest preference degree G (h of useri) and weight wijIt is corresponding, it is complete using the method
At the vectorization of microblog users data, user becomes a numerical value bivariate table to the final vectorization of the preference of interest.
Step 4) carries out microblog data cluster using the improved SSLOK-means of Calinski-Harabasez function
The definition of CH function isWherein,Indicate cluster with
The mark of mean dispersion error matrix, n between clusterjFor the tuple number of j-th of cluster, u is the mean value of entire data set, ujIt is the equal of j-th of cluster
Value;Indicate the mark of mean dispersion error matrix between class inside, njFor the member of j-th of cluster
Group number, k is the number of cluster,Indicate a tuple in cluster j.CH function is the ratio of between class distance and inter- object distance, CH
Be worth bigger, it is better to represent clustering result quality, i.e., CH maximum when cluster number it is best.
Step 4.1) constructs label sets, terminate-and-stay-resident
Under the premise of considering to mark cost of labor, a small amount of data set is marked in a manual manner from data set, mark
Numeration is according to generic.These label sets are resident main memory, instruct the cluster process of SSLOK-means, improve cluster efficiency and gather
The quality of class result.
Step 4.2), the parameter k of clustering cluster number from 2 to(number that num is entire data) executes SSLOK-
Means clustering algorithm, SSLOK-means algorithm implementation procedure are as follows:
Step 4.2.1), data point is sequentially read from data set and enters memory, until memory headroom fills up, reaches data
The purpose clustered in batches.
Step 4.2.2), in memory data point, SS (information of compressed data), the OUTS information of data (remove),
Labels (label sets) carries out the semi-supervised clustering in limited memory, until convergence.Cluster process is as shown in Figures 2 and 3.At this
In the cluster process of step, the belonging relation of SS can change, or new SS, New collection as shown in Figure 3 occur.
The belonging relation of OUTS cannot change, and be subordinated to corresponding MC (main classes) always, be used for the bulk registration clustering information.
OUTS and SS during this do not store related specific data point, but utilize a triple (sumj,sumsqj,
numj) replace, whereinnumjIt is the number of j-th of cluster
According to number, RnIndicate real vector space.
Assuming that OUTS ∈ MC, xj∈ MC, j=1 ..., n, n is target cluster numbers, if partial data collection meets compressor bar
Part (under the premise of clusters number is N times of target cluster numbers,N
General value is that 4, β is a preset lesser value for indicating cluster tightness), then triple substitution can be used.Each main classes MC
It may particularly denote are as follows:
Ss ∈ MC, OUTS ∈ MC, xj∈ MC, j=1 ..., n, count expression
It counts.
In single cluster process, to terminate algorithm can with controlled final convergence, a small value can be defined
End of identification ε, such as ε=1e-8.If twice in succession in cluster circulation, offset total amount move <=ε of all cluster centres,
Then think that this clustering convergence terminates.
Step 4.2.3), the data point in memory is compressed and abandoned
The data point set for meeting contractive condition is substituted by SS, and respective counts strong point is removed out main memory.Meet discarding condition
Data point be moved out of main memory, into OUTS gather.To clear out space in main memory, other data point set is read in.
Step 4.2.4), judge whether still there is data point in data set to be clustered, if so, jump procedure 4.2.1), it is no
Then, cluster process terminates, and algorithm terminates, and calculates CH value at this time.
Step 4.3) is evaluated using quality of the Calinski-Harabasz function to cluster result.Select CH function
Being worth maximum k value is best cluster number, and cluster result is the division of optimal microblog users community of interest.
Implementation condition: experimental situation is ThinkPad 20FWA00VCD, and CPU is Core i7-6700@2.6GHz, memory
64 bit manipulation system of 8GB, Win10, web crawlers tool, Matlab platform.
Embodiment 1:
Clustering is carried out to Sina weibo user's big data, finds microblog users community of interest, and for different emerging
Interesting group provides personalized service, provides support for the optimization of microblogging personalized service, the promotion for income of marketing.
1, " octopus collector " is utilized to grab the Sina weibo information of 627 ordinary users, by having collected
The information of 627 microblog users is filtered, and weeds out 60 " silent users ", is finally remained 567 effective microbloggings and is used
The information at family completes the filter process of user.After the present embodiment Sina weibo data prediction as shown in table 1, before being chosen in table 1
5 row data instances carry out example.
The preference vector of 1 user interest of table
2, sample data is examined:
1) it before carrying out clustering, needs test samples data that can represent totality, otherwise clusters nonsensical.Root
According to " microblog media characteristic and user's behaviour in service research report " (hereinafter referred to as " reporting ") that 2010 officials issue, the property chosen
Not, age, regional three indexs, sample data and normal data are compared, as a result as follows:
1. gender.This is tested in 567 Sina weibo users collected, male 243, women 324, is accounted for respectively
Than 42.9% and 57.1%.And microblog users standard masculinity proportion is 39% in " report ", women 61%.The male of sample data
Female's ratio is not much different with normal data, and data have confidence level.
2. the age.In 567 users of this experiment acquisition, 39 people of under-18s, accounting 6.88%, 18-30 years old 364
People, accounting 64.20%, 30 years old or more 164 people, accounting 28.92%.Fig. 4 is seen with the comparison of " report " Plays age accounting.Sample
Data are almost the same with normal data.
3. area.This test sample data region collected have Jiangsu, Hebei, Liaoning, Heilungkiang, Beijing,
Henan, Tianjin, Hainan, Xinjiang, Guangxi, Guangdong, Shenzhen, Shandong, Fujian, Jilin, Hong Kong, Zhejiang, Hubei, overseas, Ningxia, four
River, Chongqing, Yunnan, sample data have generality, have confidence level.
Therefore, the sample data that the present embodiment is chosen can represent domestic microblog users overall condition, be examined by availability
It tests.
2) whether before cluster, detecting between variable has correlation most important.If having strong correlation between variable
Property, then the weight of meaning representated by these variables just will increase, cluster to such variable nonsensical, the present embodiment
Correlation test is carried out to ten interest using Pearson model, the results are shown in Table 2.
2 Pearson correlation test result of table
As shown in Table 2, the correlation of variables between ten interest is respectively less than 0.3, therefore, correlation between ten interest
Weak, sample data is examined by correlation of variables.
Data analysis result shows: sample data has passed through availability and correlation of variables is examined, and can be used in cluster point
Analysis.
3, the clustering of SSLOK-means is improved based on CH
Semisupervised Labels Onescan K-means (SSLOK- is realized using MATLAB platform
Means) algorithm.Under the guidance of the label sets of resident main memory, 100 sample data sets are read every time and are clustered, cluster
Number k value range is set as 2≤k≤10, finally obtains 9 groups of Different Results.Calinsi- is carried out to the result of 9 groups of data
The distinguishing validity of Harabasz (CH) function, finally obtains the corresponding validity CH value of different cluster number k values, among algorithm
Output the results are shown in Table 3.
The comparison of 3 CH value of table
As shown in Table 3, when k takes 6, Validity Function CH is maximum, is based on the improved SSLOK- of Calinski-Harabasz
Means clustering algorithm will cluster number k and be chosen for 6 automatically and complete cluster, obtains Sina weibo user interest group and draws
The result divided.
4, Sina weibo user interest group division result describes
After SSLOK-means clustering and CH criteria function, the number for obtaining most reasonable cluster is 6, each cluster
In number of samples be shown in Table 4, the distance between each cluster centre is shown in Table 5, and final cluster centre is shown in Table 6.
The record number in each cluster that 4 clusters number of table is 6
Distance between each cluster center that 5 clusters number of table is 6
It can be obtained by table 5, the significant difference between each cluster center, wherein the distance between the 2nd cluster and the 3rd cluster cluster centre are most
Greatly 55.5, the distance between the 5th cluster and the 6th cluster cluster centre are minimum, are 26.8.
The final cluster centre that 6 clusters number of table is 6
In order to which individual is to the preference degree of each interest in each cluster of more intuitive description, by cluster centre Digital Discrete
Change, setting [0,2) in be extremely weak interest, [2,10) be weaker interest, [10,15) in be medium interest, [15,25) in be relatively strong
Interest, is strong interest in [25,100], and the cluster centre after processing the results are shown in Table 7.
7 interest sliding-model control result of table
5, the analysis of Sina weibo user interest group division result and personalized ventilation system suggestion
Clustering is carried out using based on information of the improved SSLOK-means algorithm of CH function to microblog users, finally
User is divided into the different community of interest of 6 class interest preferences, each group has different characteristics.From 6 cluster internal analysis, i.e.,
Longitudinal 2 observation interest sliding-model control result (table 7), it can be deduced that the different characteristic of 6 class users.6 class users are according to 1~cluster of cluster 6
Sequence can successively conclude are as follows: knowledge type, substance type, IT class, cause type, network-type and balanced type.For each type
Community of interest gives personalized recommended suggestion according to table 7.
(1) knowledge type.Sample size is medium, is 90.User in such community of interest has " word read " dense
Interest, to industry work have stronger interest, have medium interest to " fashionable shopping ", but weaker to other class interest, it is especially right
The interest of sports field is extremely weak.This kind of user likes reading, and the news dynamic in Concerned Industry field, also compares fashionable shopping often
It is interested.Therefore, their demand is concentrated mainly in reading and the acquisition of practical information, focuses on the raising of quality of the life
And the raising of itself mastery, pursue the life for having quality.
For this kind of user, microblog periodically should push the account in terms of the literature information for having depth is read to them,
40% (table 6) that the push ratio of such account about once pushes.In addition, should also push a part of practical industry to them
Relevant high quality account, the push ratio of such account about once push 20%.For the collocation of related fashion, trend product
The account of board, the push ratio of such account about once push 12%.In addition, since such user is to interest in physical education pole
It is weak, the push of a small amount of sport account can be carried out after interval repeatedly push to such user, interest induction is carried out and attempts.
(2) substance type.Sample size is larger, is 123.User characteristics in such community of interest are to " fashionable shopping "
There is keen interest, has moderate interest to " cuisines " and " literature works reading ", it is weaker to the interest of other field.This kind of use
Family trend-conscious trend pursues material life.The life style of fashion has very strong attraction to them.In addition, they are also right
Cuisines and reading have certain interest.
For this kind of user, should periodically to they push the related fashion brand for having depth, trend trend, clothing matching,
The microblog account of star tide-person, the push ratio of such account about once push 50%.Further, it is also possible to obtaining user
Location information after, the restaurant etc. of recommending local cuisines, public praise good to them.The about primary push of the push ratio of such account
12%, the ratio of push can carry out appropriate adjustment using the time of microblogging according to user.If than user in time for eating meals
Using microblogging, it can suitably increase push ratio, non-time for eating meals can reduce push ratio in right amount.Similarly, since such user couple
Interest in physical education is extremely weak, can carry out the induction push of a small amount of sport account after interval repeatedly push to such user.
(3) IT class.Sample size is less, is 55.User in such community of interest has " IT is digital " dense emerging
Interest, while having medium interest to " literature works reading " and " industry work ".This kind of user belongs to poly-talented user, likes paying close attention to
Internet dynamic, the forward position knowledge of IT field have keen interest to electronic product, the relevent information of Concerned Industry work often,
Focus on the promotion of self-skill.
For such user, it should which the high-quality account for recommending IT internet area to it periodically pushes newest electricity to them
The information of sub- digital product, the push ratio of such account about once push 42%;Periodically recommend good text to them
Word read account, the push ratio of such account about once push 11%;It is excellent that industry Zone Information correlation periodically is pushed to them
Matter account, the push ratio of such account about once push 11%;In addition, since such user is extremely weak to interest in music,
The push that can carry out a small amount of music account after interval repeatedly push to such user, carries out interest induction and attempts.In addition,
This kind of poly-talented user often familiar internet the relevant technologies can the problem of if there is related internet the relevant technologies
To ask for help to them.
(4) cause type.Sample size is most, is 142.User in such community of interest is to " industry work " class microblogging
There is strong interest, have medium interest to " fashionable shopping " and " word read ", to " sport ", " video display amusement " and " game is dynamic
It is unrestrained " interest is extremely weak.They pay close attention to Domestic News class relevant to specific industry, the information of Practical Skill class.Such user
Belong to cause type user, they focus on promoting self-skill, it is desirable to be able to obtain trade information relevant to oneself field.
For such user, under the premise of obtaining the specific area that user selectes, it should to user's keypoint recommendation and its
Deep knowledge class account in related field, the push ratio of such account about once push 50%;Periodically to them
Push good word read account information, the push ratio of such account about once push 15%;Can also periodically to
They recommend the relevant information of the fashionable shopping of some shallow-layers, to meet them to the needs of improving the quality of living, such account
Push ratio about once push 13%;In addition, since such user is to " sport ", " video display amusement " and " game animation "
Interest is extremely weak, can carry out the push of a small amount of relevent information information after interval repeatedly push to such user, carry out interest
Induction.
(5) network-type.Sample size is less, is 37.User in such community of interest has " sport " field dense
Thick interest, they be keen to pay close attention to some sports stars, race official website, sports class microblog account, while to " text
Read " there is stronger interest, there is medium interest to " video display amusement ".For such user mostly based on university student, the time is more well-to-do,
They are larger to the knowledge interest of study class, have certain interest to some recreational information.
For such user, it should according to the selected specific sports items of user, targetedly recommend to have to them
Close the information such as the race advance notices of sports items, sports star, the push ratio of such account about once push 30%;Periodically
Recommend good word read, examination information skill improvement class account to such user, the push ratio of such account is about one
The 17% of secondary push;In addition, this kind of user to video display amusement have certain interest, can periodically to they push with TV play,
The relevant message of film, the push ratio of such account about once push 14%.
(6) balanced type.Sample size is larger, is 120.User in such community of interest is not special to each interest
Preference, substantially to " cuisines ", " fashionable shopping ", " tourism photograph ", " video display amusement ", " word read " and " industry " class
Information interest is balanced, and such user belongs to the user of balanced class, and interest is extensive, can speculate that their personality are partially export-oriented, be easy to connect
By fangle.
For such user, recommend the information of each category of interest to them, the push ratio of all kinds of accounts about once pushes away
15% sent;Further, it is also possible to recommend the message of some fangles to them and in relation to the microblog account of social category, such as
Today's tops, association etc..
The user of 6 class difference community of interest is made a concrete analysis of above, the typical case that 6 class users are described in detail is special
Sign, and specific aim suggestion is proposed to the content of different type user-customized recommended.It is special for different types of user group
The personalized recommendation carried out is levied, the usage experience of user is able to ascend, increases user's viscosity, and then improve microblog operation
Economic benefit.
The experimental results showed that it is a kind of for improving SSLOK-means big data clustering technique based on Calinski-Harabasz
Effective novel microblog users community of interest discovery technique, the discovery result of microblogging community of interest are the excellent of microblogging personalized service
Change, the promotion of marketing income provides support
Above embodiments are merely to illustrate design philosophy and feature of the invention, and its object is to make technology in the art
Personnel can understand the content of the present invention and implement it accordingly, and protection scope of the present invention is not limited to the above embodiments.So it is all according to
It is within the scope of the present invention according to equivalent variations made by disclosed principle, mentality of designing or modification.
Claims (10)
1. a kind of extensive microblog users community of interest finds method, which comprises the following steps:
Step 1) obtains the information of microblog users;
Step 2), data detection and pretreatment: it is used including data detection, user filtering, microblog users concern account number classification, microblogging
The expression of family interest;
Step 3) data normalization: calculates interest preference degree, the microblog data vectorization of microblog users;
Step 4) carries out microblog data cluster, automatically using the improved SSLOK-means of Calinski-Harabasez function
Determine clustering cluster number.
2. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging
The information of user includes user basic information and microblog account information, the user basic information include user name, gender,
Area, registion time, the microblog account information include the title of account, microblogging certification, brief introduction, bean vermicelli quantity, concern quantity.
3. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the data
Examine includes availability of data and correlation test.
4. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the user
Filter method specifically: microblog account quantity is less than to the microblog users mark of all microblog users concern number average value 1/10th
It is denoted as " silent user ", is rejected from tables of data.
5. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging
User pays close attention to account number classification method specifically: " brief introduction " and " certification " field in the microblog account being had focused on using microblog users
It identifies different classes of account, classifies to concern list account.
6. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging
The expression of user interest includes determining interest set, rejecting invalid account, mapping interest set, and the determination of the interest set is
By reference to the domain classification of the big V of classification system and microblogging of mainstream microblog, the interest of microblog users is classified,
Constitute interest set;The invalid account of rejecting is cannot to reflect that the account of the interest preference of user is rejected, and filtering out can be bright
The account of aobvious reflection user interest;The mapping interest set refers to that in interest set, there is always an interest, so that account
Any one account is corresponding with the interest in set.
7. a kind of extensive microblog users community of interest as claimed in claim 1 or 2 finds method, which is characterized in that described
The interest preference degree of microblog users isWherein Count (hi) (i=1,2 ...,
It 10) is the number for paying close attention to the account that an interest is mapped in account set, Count (L) is each microblog users concern account
The number of account in set, P (hi) ∈ [0,1], for the numerical value counting accuracy for improving subsequent clustering stage, by user couple
Preference degree P (the h of interesti) expand 100 times of processing.
8. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the microblogging
Data vector specifically: microblog data is regarded as document so that microblog users data are corresponding with text document, interest set with
Particular subject is corresponding, and the final interest preference degree of user is corresponding with weight, and microblog data is converted to using vector space model
The bivariate table of numerical value completes vectorization procedure.
9. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that the step
4) specifically:
1. constructing label sets, terminate-and-stay-resident;
2. the parameter k of clustering cluster number from 2 toSSLOK-means clustering algorithm is executed, wherein num is entire data
Number;
3. being evaluated using quality of the Calinski-Harabasz function to cluster result.
10. a kind of extensive microblog users community of interest as described in claim 1 finds method, which is characterized in that described to hold
Row SSLOK-means clustering algorithm specifically: read data point from cluster data concentration sequence and enter memory, until memory headroom
It fills up, has to the information SS of data point, compressed data in memory, information OUTS, the label sets labels of removal data
Limit memory in semi-supervised clustering, until convergence, to the data point in memory according to belong to and aggregation situation carry out compression and
It abandons, judges whether still there is data point in data set to be clustered, if so, continuing to read data, otherwise, cluster process terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811124489.8A CN109447833A (en) | 2018-09-26 | 2018-09-26 | A kind of extensive microblog users community of interest discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811124489.8A CN109447833A (en) | 2018-09-26 | 2018-09-26 | A kind of extensive microblog users community of interest discovery method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109447833A true CN109447833A (en) | 2019-03-08 |
Family
ID=65544480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811124489.8A Pending CN109447833A (en) | 2018-09-26 | 2018-09-26 | A kind of extensive microblog users community of interest discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109447833A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112712116A (en) * | 2020-12-29 | 2021-04-27 | 山西大学 | User community structure dividing method and system of microblog network |
CN112712115A (en) * | 2020-12-29 | 2021-04-27 | 山西大学 | Network user group division method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103823890A (en) * | 2014-03-10 | 2014-05-28 | 中国科学院信息工程研究所 | Microblog hot topic detection method and device aiming at specific group |
CN106484764A (en) * | 2016-08-30 | 2017-03-08 | 江苏名通信息科技有限公司 | User's similarity calculating method based on crowd portrayal technology |
US20180052991A1 (en) * | 2013-12-04 | 2018-02-22 | Plentyoffish Media Ulc | Apparatus, method and article to facilitate automatic detection and removal of fraudulent user information in a network environment |
WO2018085401A1 (en) * | 2016-11-03 | 2018-05-11 | Thomson Reuters Global Resources Unlimited Company | Systems and methods for event detection and clustering |
-
2018
- 2018-09-26 CN CN201811124489.8A patent/CN109447833A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180052991A1 (en) * | 2013-12-04 | 2018-02-22 | Plentyoffish Media Ulc | Apparatus, method and article to facilitate automatic detection and removal of fraudulent user information in a network environment |
CN103823890A (en) * | 2014-03-10 | 2014-05-28 | 中国科学院信息工程研究所 | Microblog hot topic detection method and device aiming at specific group |
CN106484764A (en) * | 2016-08-30 | 2017-03-08 | 江苏名通信息科技有限公司 | User's similarity calculating method based on crowd portrayal technology |
WO2018085401A1 (en) * | 2016-11-03 | 2018-05-11 | Thomson Reuters Global Resources Unlimited Company | Systems and methods for event detection and clustering |
Non-Patent Citations (1)
Title |
---|
YAN SHEN等: "The Discovery of Micro-blog Users" Interest Groups Based on SSLOK-means Clustering Focus on Big Data Improved by Calinski-Harabasz", 《2018 2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND BUSINESS ANALYTICS (ICDSBA)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112712116A (en) * | 2020-12-29 | 2021-04-27 | 山西大学 | User community structure dividing method and system of microblog network |
CN112712115A (en) * | 2020-12-29 | 2021-04-27 | 山西大学 | Network user group division method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107766371B (en) | Text information classification method and device | |
CN105893609B (en) | A kind of mobile APP recommended method based on weighted blend | |
CN106845717B (en) | Energy efficiency evaluation method based on multi-model fusion strategy | |
CN103544663B (en) | The recommendation method of network open class, system and mobile terminal | |
CN106339416B (en) | Educational data clustering method based on grid fast searching density peaks | |
CN105760888B (en) | A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute | |
CN107122352A (en) | A kind of method of the extracting keywords based on K MEANS, WORD2VEC | |
CN105787068B (en) | The academic recommended method and system analyzed based on citation network and user's proficiency | |
CN104331506A (en) | Multiclass emotion analyzing method and system facing bilingual microblog text | |
CN107562947A (en) | A kind of Mobile Space-time perceives the lower dynamic method for establishing model of recommendation service immediately | |
CN103186538A (en) | Image classification method, image classification device, image retrieval method and image retrieval device | |
CN104572733B (en) | The method and device of user interest labeling | |
CN105868347A (en) | Tautonym disambiguation method based on multistep clustering | |
CN109949174A (en) | A kind of isomery social network user entity anchor chain connects recognition methods | |
CN106951471A (en) | A kind of construction method of the label prediction of the development trend model based on SVM | |
CN109376352A (en) | A kind of patent text modeling method based on word2vec and semantic similarity | |
CN109165273A (en) | General Chinese address matching method facing big data environment | |
CN109271427A (en) | A kind of clustering method based on neighbour's density and manifold distance | |
CN102708164A (en) | Method and system for calculating movie expectation | |
CN107341199A (en) | A kind of recommendation method based on documentation & info general model | |
CN107992550A (en) | A kind of network comment analysis method and system | |
CN103268346B (en) | Semisupervised classification method and system | |
CN109447833A (en) | A kind of extensive microblog users community of interest discovery method | |
CN105159898B (en) | A kind of method and apparatus of search | |
CN109086794A (en) | A kind of driving behavior mode knowledge method based on T-LDA topic model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190308 |
|
RJ01 | Rejection of invention patent application after publication |