CN106202480B - A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification - Google Patents
A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification Download PDFInfo
- Publication number
- CN106202480B CN106202480B CN201610565749.XA CN201610565749A CN106202480B CN 106202480 B CN106202480 B CN 106202480B CN 201610565749 A CN201610565749 A CN 201610565749A CN 106202480 B CN106202480 B CN 106202480B
- Authority
- CN
- China
- Prior art keywords
- personnel
- tag along
- along sort
- theme
- lda
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of network behaviors of bi-directional verification based on K-means and LDA to be accustomed to clustering method, the present invention utilizes the webpage attribute in personnel's internet records, keyword and frequency, in conjunction with K-means algorithm, LDA document subject matter extracts model and annealing algorithm, first to personnel-label-frequency set, personnel browse record-personnel-keyword set progress K-means algorithm cluster and LDA document subject matter is extracted model and generated, storage calculates intermediate result, K-means and LDA is subjected to bi-directional verification using annealing algorithm later, calculate global best theme-tag along sort sequence, the result of optimization network behavior habit cluster on this basis, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, annealing algorithm can The efficiency of optimization cluster result is improved, and then improves cluster accuracy.
Description
Technical field
The invention belongs to clusterings, optimization algorithm field, in particular to a kind of to be based on K-means and Latent
The network behavior of the bi-directional verification of Dirichlet Allocation (LDA) is accustomed to clustering method, for optimizing cluster result, into
And cluster accuracy is improved, and increase the use value of personnel's internet records information with this.
Background technique
The clustering method for grasping network behavior habit data has important role and meaning for the online habit of researcher
Justice, continuous with internet are popularized, and more and more people's selection obtains interested information by network.Personnel's online is clear
The information content for the content look at is huge, only analyzes these data not only inefficiency by artificial, but also accuracy is not also high.
By clustering, along with another clustering method bi-directional verification, can be improved analysis efficiency and analysis it is accurate
Rate.General clustering algorithm has K-means cluster and LDA document subject matter to extract model etc., and general optimization algorithm has simulation to move back
Fiery algorithm and genetic algorithm etc..
Clustering algorithm and the correlative theses of optimization algorithm have: the principle and algorithm of Pang Feng simulated annealing are asked in optimization
Application Jilin University Master's thesis in topic, 2006;Li Xiangping, Zhang Hongyang simulated annealing principle and improvement are soft
Part guide, 2008 (4): 47-48;10 years progress computers of Yang Mengduo, Li Fanchang, Zhang Li Lie-group machine learning
Journal, 2015 (7): 1337-1356;Blei D M, Ng A Y, Jordan M I. Latent dirichlet
allocation. Proceeding of Journal of Machine Learning Research. 2003, Vol.3:
993-1022;Yuan J, Gao F, Ho Q, et al. LightLDA: Big Topic Models on Modest
Computer Clusters. Proceeding of International Conference on World Wide Web.
ACM, 2015;The existing Research foundation of Zhu Quanyin et al. includes: Li Xiang, what Zhu Quanyin joint cluster and rating matrix were shared
Collaborative filtering recommending computer science and exploration 2014. Vol.8 of (6): 751-759;Suqun Cao, Quanyin
Zhu, Zhiwei Hou. Customer Segmentation Based on a Novel Hierarchical
Clustering Algorithm. 2009, p:1-5;Quanyin Zhu, Sunqun Cao. A Novel
Classifier-independent Feature Selection Algorithm for Imbalanced Datasets.
2009, p:77-82;Suqun Cao, Zhiwei Hou, Liuyang Wang, Quanyin Zhu. Kernelized
Fuzzy Fisher Criterion based Clustering Algorithm. DCABES 2010, p:87-91;
Quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian. The Case Study for Price
Extracting of Mobile Phone Sell Online. 2011, p:282-285;Quanyin Zhu, Suqun
Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated Price Forecast based on
Dichotomy Backfilling and Disturbance Factor Algorithm. International Review
on Computers and Software, 2011, Vol.6(6):1089-1093;Suqun Cao, Gelan Yang,
Quanyin Zhu, Haihei Zhai. A novel feature extraction method for mechanical
part recognition. Applied Mechanics and Materials, 2011, p:116-121;Pei Zhou,
Quanyin Zhu. Multi-factor Matching Method for Basic Information of Science
and Technology Experts Based on Web Mining. 2012, P:718-720;Jianping Deng,
Fengwen Cao, Quanyin Zhu, Yu Zhang. The Web Data Extracting and Application
for Shop Online Based on Commodities Classified. Communications in Computer
and Information Science, Vol.234(4):120-128;Hui Zong, Quanyin Zhu, Ming Sun,
Yahong Zhang. The case study for human resource management research based on
web mining and semantic analysis. Applied Mechanics and Materials, Vol.488,
2014 p:1336-1339;Zhu Quanyin et al. application, openly with the related patents of authorization: Zhu Quanyin, Hu Rongjing, Cao Suqun,
A kind of price forecasting of commodity method Chinese patents based on linear interpolation Yu Adaptive windowing mouth of such as week training: ZL 2011
1 0423015.5, 2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong are waited quietly one kind and are repaired based on two divided datas
Mend the price forecasting of commodity method Chinese patent with Discontinuous Factors: 2,011 1 0422274.6,2013.01.02 of ZL;Zhu Quan
Silver, Yin Yonghua, Yan Yunyang, a kind of data of multi items price forecasting of commodity neural network based of Chen Ting, Cao Suqun
Preprocess method Chinese patent: 2,012 1 0325368.6,2016.06.08 of ZL;Zhu Quanyin, Pan Lu, Liu Wenru, Lee
Xiang, Zhou Hong, Hu Ronglin, Ding Jin, Jin Ying, Shaowu is outstanding, a kind of incremental learning of science and technology news multi-level two of Tanghai wave
Classification method China Patent Publication No.: CN 105205163A, 2015.12.30;Zhu Quanyin, Yan Yunyang, Huang Taoyi,
Bright, the implementation method China of a kind of campus personality palm service of Zhang Yuyang, Xin Cheng and user behavior habit analysis is specially
Sharp publication number: CN 104731971A, 2015.06.24;Zhu Quanyin, Shen Enqiang, Qian Yaping, all deep equal one kind are based on K-
Means clusters the adaptive student's Learning behavior analyzing method Chinese Patent Application No. of more weights: 201610222553.0,
2016.04.13;Zhu Quanyin, Shaowu is outstanding, Tanghai wave, Zhou Hong, Li Xiang, Hu Ronglin, Jin Ying, Cao Suqun, a kind of science of Pan Shuxin
Multi-level more classification method China Patent Publication No. of headline: CN 105205163A, 2016.07.13;Li Xiang, Zhu
A kind of Cold Chain Logistics prestowage intelligent recommendation method China Patent Publication No. based on spectral clustering of Quan Yin, Hu Ronglin, Zhou Hong: CN
105654267A, 2016.06.08。
LDA document subject matter extracts model:
LDA(Latent Dirichlet Allocation) it is that a kind of document subject matter generates model, also referred to as one three layers
Bayesian probability model includes word, theme and document three-decker.So-called generation model, that is, it is believed that an article
Each word be by " with some theme of certain probability selection, and with some word of certain probability selection from this theme
Such a process of language " obtains.Document obeys multinomial distribution to theme, and theme to word obeys multinomial distribution.LDA is a kind of
Non-supervisory machine learning techniques can be used to identify extensive document sets (document collection) or corpus
(corpus) subject information hidden in.The method that it uses bag of words (bag of words), this method is by each text
Shelves are considered as a word frequency vector, so that text information is converted the digital information for ease of modeling.But bag of words method does not have
Consider the sequence between word and word, this simplifies the complex natures of the problem, while also providing opportunity for the improvement of model.Each piece
Probability distribution that some themes of documents representative are constituted, and each theme represents one that many words are constituted
A probability distribution.
K-means clustering algorithm:
It is now then more popular as a kind of clustering method derived from one of signal processing vector quantization method
In the field of data mining.kThe purpose of average cluster is: n point (the primary observation or an example that can be sample) is divided
It arriveskIn a cluster so that each point belongs to the corresponding cluster of closest to him mean value (this i.e. cluster centre), using as poly-
The standard of class.The problem of this problem will be attributed to one data space is divided into Voronoi cells.This problem is being counted
It counts in being difficult (NP is difficult), but there are efficient heuritic approaches.Under normal circumstances, all service efficiency is relatively high opens
Hairdo algorithm, they can quickly converge on a locally optimal solution.These algorithms are generally similar to pass through iterative optimization method
Handle the EM algorithm (EM algorithm) of Gaussian Mixture distribution.Moreover, they are all using cluster centre come for data modeling;So
AndkAverage cluster tends to find cluster in comparable spatial dimension, it is expected that-maximization technology, which but allows to cluster, to be had not
Same shape.
Simulated annealing:
Simulated annealing (Simulated Annealing, SA) earliest thought is by N. Metropolis et al.
It is proposed in nineteen fifty-three.1983, annealing thought was successfully introduced into Combinatorial Optimization field by S. Kirkpatrick etc..It is
Based on a kind of random optimizing algorithm of Monte-Carlo iterative solution strategy, starting point is based on solid matter in physics
Similitude between annealing process and general combinatorial optimization problem.Simulated annealing is from a certain higher initial temperature, with temperature
The continuous decline of parameter is spent, join probability kick characteristic finds the globally optimal solution of objective function at random in solution space, that is, exists
Locally optimal solution can be jumped out probabilityly and finally tend to global optimum.Simulated annealing is a kind of general optimization algorithm,
Theoretically algorithm has the global optimization performance of probability, is widely applied in engineering at present, and such as VLSI, production are adjusted
The fields such as degree, control engineering, machine learning, neural network, signal processing.Simulated annealing is by assigning search process one
Kind time-varying and the probabilistic jumping property finally to go to zero, fall into local minimum and finally tend to global optimum to can effectively avoid
The optimization algorithm of serial structure.
Heuristic search:
The big basic target of the two of computer science exactly finds that its provable operational efficiency is good and can obtain optimum solution or secondary
The algorithm of good solution.And heuritic approach then attempts once to provide one or all target.Such as it often can find all well and good solution,
But it can not also prove that it will not obtain worse solution;It usually can solve answer in the reasonable time, but can not also know it
Whether every time can such speed solution.In some special cases, heuritic approach can obtain very bad answer or effect
Rate is very poor, however causes the data structure of those special circumstances, perhaps never occurs in real world.Therefore real world
Middle heuritic approach is in daily use to solve the problems, such as.Heuritic approach usually can be within the reasonable time when handling many practical problems
Obtain good answer.Common heuritic approach has ant group algorithm, genetic algorithm, simulated annealing etc..
Summary of the invention
In order to help administrative staff understand and improve be managed personnel online habit, reduce internet adverse effect with
And the generality excavated between the Internet data for the personnel of being managed and hobby contacts, and is managed personnel by comprehensive analysis
Internet records, using document clustering algorithm and K mean cluster algorithm based on LDA model, design realizes one kind and is based on
The network behavior of K-means and LDA bi-directional verification is accustomed to clustering method, mentions to be managed the analysis of personnel's internet behavior with management
The system model of preferable reference value is supplied.
The theoretical basis of patent to facilitate the understanding of the present invention describes such as the difference of theory with traditional theory of the invention
Under:
In traditional clustering method, clustering usually is carried out using a kind of mode to primary data, then by artificial
The mode of analysis is verified.The present invention is on the basis of conventional method, creative two kinds of clustering methods of use, by making by oneself
The accuracy of the verification method verifying clustering algorithm of justice, and the efficiency for optimizing cluster result is improved using simulated annealing.
The technical scheme is that using webpage attribute, keyword and frequency in personnel's internet records, in conjunction with K-
Means algorithm, LDA document subject matter extract model and annealing algorithm, first browse note to personnel-label-frequency set, personnel
Record-personnel-keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate knot
K-means and LDA is carried out bi-directional verification using annealing algorithm later by fruit, calculates global best theme-tag along sort sequence,
The result of optimization network behavior habit cluster on this basis;It wherein, include simulated annealing main flow step A and cost letter
Number process flow steps B:
Simulated annealing main flow step A1 to step A26:
Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSONp1, LABELp1,
FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein
PERSONp1, PERSONp2, …, PERSONpaRepresent personnel's unique identification, LABELp1, LABELp2, …, LABELpaGeneration
Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQp1, FREQp2,
…, FREQpaThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web
Integrate as RECORDIDPERSONKEYWORD={ (RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2,
PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein RECORDIDr1,
RECORDIDr2, …, RECORDIDraPersonnel record's unique identification is represented, is made of personnel's unique identification and online date,
PERSONr1, PERSONr2, …, PERSONraRepresent personnel's unique identification, KEYWORDr1, KEYWORDr2, …,
KEYWORDraThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing
The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means
Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould
The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if
Simulated annealing cooling parameter is COOL;
Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as
KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer;If theme-keyword set of LDA model is
LDATOPICWORD, wherein theme is indicated by natural integer;If theme-personnel of LDA model integrate as LDATOPICPERSON,
Wherein, theme is indicated by natural integer;If global best theme-tag along sort sequence is FACTOR, if global maximum matching number
For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower
It is designated as index, if simulated annealing current step is curstep;
Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step of step A1
Personnel-label-frequency set of K-means the clustering algorithm the number of iterations ITERKMEANS and step A1 of rapid A1
PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained
KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …,
(PERSONa, CATEGORYca), wherein KMEANSPERSONCATEGORY comes from step A2;
Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme that LDA document subject matter extracts model
The Di Li Cray parameter ALPHA of distribution, Di Li Cray parameter ETA, the LDA document subject matter of keyword distribution extract model iteration time
Number ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model
Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2),
…, (TOPICtb, KEYWORDb) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPICt1,
PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein LDATOPICWORD and
LDATOPICPERSON comes from step A2;
Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1
FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence
0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA
It is 0, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25;
It is no to then follow the steps A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, random number
Range between 0 and TOPICNUM-1, wherein TOPICNUM be step A1 in LDA document subject matter extract model theme
Sum;
Step A8: being curstep with random number assignment to the simulated annealing current step in step A2, random number
Range is between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length;
Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the overall situation in step A2 best main
Topic-tag along sort sequence FACTOR, that is, vecb=FACTOR;
Step A10: changing the step current topic-tag along sort sequence vecb in A2, the numerical value on the i-th position ndex,
Enable vecbindexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, curstep
For the simulated annealing current step in step A2, that is, vecbindex= vecbindex+curstep;
Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is small
When 0, that is, vecbindex< 0, then follow the steps A12;It is no to then follow the steps A13;
Step A12: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled
In 0, that is, vecbindex=0;Go to step A15;
Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is big
When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecbindex>
CATEGORYNUM-1 thens follow the steps A14;It is no to then follow the steps A15;
Step A14: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled
In CATEGORYNUM-1, that is, vecbindex=CATEGORYNUM-1, wherein CATEGORYNUM is the contingency table in step A1
Label sum;
Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: step B is executed;
Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2;
Step A18: current topic-tag along sort sequence vecb in obtaining step A2;
Step A19: step B is executed;
Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2;
Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is,
Eb > EA thens follow the steps A22;It is no to then follow the steps A25;
Step A22: random number random is generated, wherein numberical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TWhen, that is, random < e(eb-EA)/T,
In, eb is the current matching number in step A2, and EA is the global maximum matching number in step A2, thens follow the steps A24;Otherwise it holds
Row step A25;
Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is working as in step A2
Preceding theme-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching in step A2
Number eb, that is, FACTOR=vecb, EA=eb;
Step A25: reducing the simulated annealing temperature T in step A1, cold using the simulated annealing in step A1
But parameter COOL, that is, T=T × COOL executes step A6;
Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR1,
FACTOR2, …, FACTORTOPICNUM};Global maximum matching number EA in return step A2;
Cost function process flow steps B is from step B1 to step B15:
Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A;
Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if people
Member's unique identification integrates as LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY,
If coupling number is SUM, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter extracts mould
Current class personnel collection is combined into ldacurcategoryperson in type, if current class personnel collect in K-means clustering algorithm
It is combined into kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is
unionpersonnum;
Step B3: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out in step B2
Personnel unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π2(LDATOPICPERSON) =
{PERSONp1, PERSONp2, …, PERSONpd};
Step B4: from personnel-tag along sort collection of the K-means algorithm in step A2
Filter out all tag along sort collection CATEGORY of step B2 in KMEANSPERSONCATEGORY, and to the selection result duplicate removal,
That is, CATEGORY=Π2(KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is the personnel in step B2
Sum;
Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out
LDAPERSONiCorresponding theme set singlepersontopic, that is,={TOPICt1, TOPICt2, …,
TOPICtc }, wherein singlepersontopic comes from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step
Corresponding tag along sort is found in theme-tag along sort sequence TMPFACTOR in B1, wherein enables categoryt1 =
TMPFACTORTOPICt1、categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein
categoryt1 ,categoryt2 , …, categorytcTag along sort is represented, also, different variables may represent together
One tag along sort, singlepersontopic come from step B2;The number that each tag along sort occurs is counted, is denoted as
categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out tag along sort frequency of occurrence most
Big tag along sort category;Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model
LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi,
category)};
Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is the people in step B2
Member's sum, thens follow the steps B9;Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is the contingency table in step A1
Label sum, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: from personnel-tag along sort collection of the K-means algorithm in step A2
CATEGORY is filtered out in KMEANSPERSONCATEGORYjCorresponding personnel's set kmeanscurcategoryperson,
That is, =
{PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein kmeanscurcategoryperson comes from
Step B2;
Step B11: personnel-tag along sort collection of model is extracted from the LDA document subject matter in step B2
CATEGORY is filtered out in LDAPERSONCATEGORYjCorresponding personnel's set ldacurcategoryperson, that is, ={PERSONlda1, PERSONlda2,
… , PERSONldac, wherein ldacurcategoryperson comes from step B2;
Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 collect
Close the intersection unionperson of kmeanscurcategoryperson, that is, unionperson=
ldacurcategoryperson∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … ,
PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned to step
The number unionpersonnum of coincidence in B2, and the coupling number SUM being added in step B2, that is, SUM=SUM+
unionpersonnum;
Step B14: it when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, thens follow the steps
B15;Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13;
Step B15: the coupling number SUM in return step B2.
Wherein, cluster is that personnel's internet records are carried out cluster point using K-means clustering algorithm and LDA model
Analysis, then two cluster results are mutually authenticated, and the efficiency of optimization cluster result is improved using simulated annealing, so as to improve
Cluster accuracy.
Wherein, primary data needed for step A3 to step A4 provides simulated annealing;Step A7 to step A12 be
The numerical value on the random site in current solution sequence is changed in simulated annealing;Step B5 to step B8 is by step B1
The theme of LDA model in theme-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and carry out
Association, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model;Step
Rapid B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category
Personnel number, and be superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence
Cost;Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15(eb-EA)/T's
Size, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb > EA and random
<e(eb-EA)/TWhen, then the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA are updated,
B is obtained the cost values of eb and EA through the above steps;Final result returns to overall situation maximum matching number EA and the overall situation is best
Theme-tag along sort sequence FACTOR.
Wherein, the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum
CATEGORYNUM is 3, and the Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of keyword distribution
Thunder parameter ETA is that 0.01, LDA document subject matter extraction model the number of iterations ITERLDA is 2000, the LDA document subject matter of step A1
The theme sum TOPICNUM for extracting model is 20.
Webpage attribute, keyword and frequency of the invention land productivity in personnel's internet records are calculated in conjunction with K-means
Method, LDA document subject matter extract model and annealing algorithm, first browse record-personnel-to personnel-label-frequency set, personnel
Keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, makes later
K-means and LDA is subjected to bi-directional verification with annealing algorithm and calculates global best theme-tag along sort sequence, with this according to excellent
Change network behavior habit cluster as a result, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, move back
Fiery algorithm can be improved the efficiency of optimization cluster result, and then improve cluster accuracy.
Detailed description of the invention
Attached drawing 1 is simulated annealing main flow.
Attached drawing 2 is cost function process flow.
Specific embodiment
Technical solution of the present invention is described in detail with reference to the accompanying drawing:
Such as attached drawing 1, simulated annealing main flow step A1 to step A26:
Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSONp1, LABELp1,
FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein
PERSONp1, PERSONp2, …, PERSONpaRepresent personnel's unique identification, LABELp1, LABELp2, …, LABELpaGeneration
Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQp1, FREQp2,
…, FREQpaThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web
Integrate as RECORDIDPERSONKEYWORD={ (RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2,
PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein RECORDIDr1,
RECORDIDr2, …, RECORDIDraPersonnel record's unique identification is represented, is made of personnel's unique identification and online date,
PERSONr1, PERSONr2, …, PERSONraRepresent personnel's unique identification, KEYWORDr1, KEYWORDr2, …,
KEYWORDraThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing
The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means
Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould
The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if
Simulated annealing cooling parameter is COOL;
Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as
KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer;If theme-keyword set of LDA model is
LDATOPICWORD, wherein theme is indicated by natural integer;If theme-personnel of LDA model integrate as LDATOPICPERSON,
Wherein, theme is indicated by natural integer;If global best theme-tag along sort sequence is FACTOR, if global maximum matching number
For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower
It is designated as index, if simulated annealing current step is curstep;
Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step of step A1
Personnel-label-frequency set of K-means the clustering algorithm the number of iterations ITERKMEANS and step A1 of rapid A1
PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained
KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …,
(PERSONa, CATEGORYca), wherein KMEANSPERSONCATEGORY comes from step A2;
Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme that LDA document subject matter extracts model
The Di Li Cray parameter ALPHA of distribution, Di Li Cray parameter ETA, the LDA document subject matter of keyword distribution extract model iteration time
Number ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model
Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2),
…, (TOPICtb, KEYWORDb) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPICt1,
PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein LDATOPICWORD and
LDATOPICPERSON comes from step A2;
Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1
FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence
0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA
It is 0, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25;
It is no to then follow the steps A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, random number
Range between 0 and TOPICNUM-1, wherein TOPICNUM be step A1 in LDA document subject matter extract model theme
Sum;
Step A8: being curstep with random number assignment to the simulated annealing current step in step A2, random number
Range is between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length;
Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the overall situation in step A2 best main
Topic-tag along sort sequence FACTOR, that is, vecb=FACTOR;
Step A10: changing the step current topic-tag along sort sequence vecb in A2, the numerical value on the i-th position ndex,
Enable vecbindexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, curstep
For the simulated annealing current step in step A2, that is, vecbindex= vecbindex+curstep;
Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is small
When 0, that is, vecbindex< 0, then follow the steps A12;It is no to then follow the steps A13;
Step A12: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled
In 0, that is, vecbindex=0;Go to step A15;
Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is big
When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecbindex>
CATEGORYNUM-1 thens follow the steps A14;It is no to then follow the steps A15;
Step A14: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled
In CATEGORYNUM-1, that is, vecbindex=CATEGORYNUM-1, wherein CATEGORYNUM is the contingency table in step A1
Label sum;
Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: step B is executed;
Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2;
Step A18: current topic-tag along sort sequence vecb in obtaining step A2;
Step A19: step B is executed;
Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2;
Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is,
Eb > EA thens follow the steps A22;It is no to then follow the steps A25;
Step A22: random number random is generated, wherein numberical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TWhen, that is, random < e(eb-EA)/T,
In, eb is the current matching number in step A2, and EA is the global maximum matching number in step A2, thens follow the steps A24;Otherwise it holds
Row step A25;
Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is working as in step A2
Preceding theme-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching in step A2
Number eb, that is, FACTOR=vecb, EA=eb;
Step A25: reducing the simulated annealing temperature T in step A1, cold using the simulated annealing in step A1
But parameter COOL, that is, T=T × COOL executes step A6;
Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR1,
FACTOR2, …, FACTORTOPICNUM};Global maximum matching number EA in return step A2;
Such as attached drawing 2, cost function process flow steps B is from step B1 to step B15:
Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A;
Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if people
Member's unique identification integrates as LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY,
If coupling number is SUM, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter extracts mould
Current class personnel collection is combined into ldacurcategoryperson in type, if current class personnel collect in K-means clustering algorithm
It is combined into kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is
unionpersonnum;
Step B3: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out in step B2
Personnel unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π2(LDATOPICPERSON) =
{PERSONp1, PERSONp2, …, PERSONpd};
Step B4: from personnel-tag along sort collection of the K-means algorithm in step A2
Filter out all tag along sort collection CATEGORY of step B2 in KMEANSPERSONCATEGORY, and to the selection result duplicate removal,
That is, CATEGORY=Π2(KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is the personnel in step B2
Sum;
Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out
LDAPERSONiCorresponding theme set singlepersontopic, that is,={TOPICt1, TOPICt2, …,
TOPICtc }, wherein singlepersontopic comes from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step
Corresponding tag along sort is found in theme-tag along sort sequence TMPFACTOR in B1, wherein enables categoryt1 =
TMPFACTORTOPICt1、categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein
categoryt1 ,categoryt2 , …, categorytcTag along sort is represented, also, different variables may represent together
One tag along sort, singlepersontopic come from step B2;The number that each tag along sort occurs is counted, is denoted as
categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out tag along sort frequency of occurrence most
Big tag along sort category;Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model
LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi,
category)};
Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is the people in step B2
Member's sum, thens follow the steps B9;Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is the contingency table in step A1
Label sum, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: from personnel-tag along sort collection of the K-means algorithm in step A2
CATEGORY is filtered out in KMEANSPERSONCATEGORYjCorresponding personnel's set kmeanscurcategoryperson,
That is, =
{PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein kmeanscurcategoryperson comes from
Step B2;
Step B11: personnel-tag along sort collection of model is extracted from the LDA document subject matter in step B2
CATEGORY is filtered out in LDAPERSONCATEGORYjCorresponding personnel's set ldacurcategoryperson, that is,={PERSONlda1, PERSONlda2, … ,
PERSONldac, wherein ldacurcategoryperson comes from step B2;
Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 collect
Close the intersection unionperson of kmeanscurcategoryperson, that is, unionperson=
ldacurcategoryperson∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … ,
PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned to step
The number unionpersonnum of coincidence in B2, and the coupling number SUM being added in step B2, that is, SUM=SUM+
unionpersonnum;
Step B14: it when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, thens follow the steps
B15;Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13;
Step B15: the coupling number SUM in return step B2.
Wherein, cluster is that personnel's internet records are carried out cluster point using K-means clustering algorithm and LDA model
Analysis, then two cluster results are mutually authenticated, and the efficiency of optimization cluster result is improved using simulated annealing, so as to improve
Cluster accuracy.
Primary data needed for step A3 to step A4 provides simulated annealing;Step A7 is to simulate to step A12
The numerical value on the random site in current solution sequence is changed in annealing algorithm;Step B5 is to the master that step B8 is by step B1
The theme of LDA model in topic-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and close
Connection, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model;Step
B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category
The number of personnel, and it is superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence
Cost;Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15(eb-EA)/TIt is big
It is small, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb>EA and random<e(eb-EA)/TWhen, then update the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA, eb
B is obtained through the above steps with the cost values of EA;Final result returns to overall situation maximum matching number EA and global best master
Topic-tag along sort sequence FACTOR.
Wherein, the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum
CATEGORYNUM is 3, and the Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of keyword distribution
Thunder parameter ETA is that 0.01, LDA document subject matter extraction model the number of iterations ITERLDA is 2000, the LDA document subject matter of step A1
The theme sum TOPICNUM for extracting model is 20.
The present invention can be in conjunction with computer system, to be automatically performed personnel's network behavior habit cluster.
Webpage attribute of the invention land productivity in personnel's internet records, keyword, frequency are calculated in conjunction with K-means
Method, LDA document subject matter extract model, annealing algorithm, first browse record-personnel-pass to personnel-label-frequency set, personnel
Keyword collection carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, uses later
K-means and LDA is carried out bi-directional verification and calculates global best theme-tag along sort sequence by annealing algorithm, with this according to optimization
Network behavior habit cluster as a result, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, annealing
Algorithm can be improved the efficiency of optimization cluster result, and then improve cluster accuracy.
Claims (4)
1. one kind is accustomed to based on the network behavior of K-means and Latent Dirichlet Allocation (LDA) bi-directional verification
Clustering method, which is characterized in that using webpage attribute, keyword and the frequency in personnel's internet records, calculated in conjunction with K-means
Method, LDA document subject matter extract model and annealing algorithm, first browse record-personnel-to personnel-label-frequency set, personnel
Keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, makes later
K-means and LDA is subjected to bi-directional verification with annealing algorithm, calculates global best theme-tag along sort sequence, on this basis
Optimize the result of network behavior habit cluster;It wherein, include simulated annealing main flow step A and cost function process flow
Step B:
Simulated annealing main flow step A1 to step A26:
Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSONp1, LABELp1,
FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein
PERSONp1, PERSONp2, …, PERSONpaRepresent personnel's unique identification, LABELp1, LABELp2, …, LABELpaGeneration
Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQp1, FREQp2,
…, FREQpaThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web
Integrate as RECORDIDPERSONKEYWORD={ (RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2,
PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein RECORDIDr1,
RECORDIDr2, …, RECORDIDraPersonnel record's unique identification is represented, is made of personnel's unique identification and online date,
PERSONr1, PERSONr2, …, PERSONraRepresent personnel's unique identification, KEYWORDr1, KEYWORDr2, …,
KEYWORDraThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing
The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means
Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould
The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if
Simulated annealing cooling parameter is COOL;
Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as
KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer;If theme-keyword set of LDA model is
LDATOPICWORD, wherein theme is indicated by natural integer;If theme-personnel of LDA model integrate as LDATOPICPERSON,
Wherein, theme is indicated by natural integer;If global best theme-tag along sort sequence is FACTOR, if global maximum matching number
For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower
It is designated as index, if simulated annealing current step is curstep;
Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step A1 of step A1
K-means clustering algorithm the number of iterations ITERKMEANS and step A1 personnel-label-frequency set
PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained
KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …,
(PERSONa, CATEGORYca), wherein KMEANSPERSONCATEGORY comes from step A2;
Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme distribution that LDA document subject matter extracts model
Di Li Cray parameter ALPHA, keyword distribution Di Li Cray parameter ETA, LDA document subject matter extract model the number of iterations
ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model
Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2),
…, (TOPICtb, KEYWORDb) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPICt1,
PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein LDATOPICWORD and
LDATOPICPERSON comes from step A2;
Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1
FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence
0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA
It is 0, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25;Otherwise
Execute step A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number
It is trapped among between 0 and TOPICNUM-1, wherein TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1;
Step A8: being curstep with random number assignment, the range of random number to the simulated annealing current step in step A2
Between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length;
Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the best theme-point of the overall situation in step A2
Class sequence label FACTOR, that is, vecb=FACTOR;
Step A10: changing the step current topic-tag along sort sequence vecb in A2, and the numerical value on the i-th position ndex enables
vecbindexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, and curstep is
Simulated annealing current step in step A2, that is, vecbindex= vecbindex+curstep;
Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is less than 0
When, that is, vecbindex< 0, then follow the steps A12;It is no to then follow the steps A13;
Step A12: enabling numerical value of the current topic-tag along sort sequence vecb on the i-th position ndex in step A2 be equal to 0,
That is, vecbindex=0;Go to step A15;
Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is greater than
When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecbindex>
CATEGORYNUM-1 thens follow the steps A14;It is no to then follow the steps A15;
Step A14: numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is enabled to be equal to
CATEGORYNUM-1, that is, vecbindex=CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort in step A1
Sum;
Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: step B is executed;
Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2;
Step A18: current topic-tag along sort sequence vecb in obtaining step A2;
Step A19: step B is executed;
Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2;
Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is, eb >
EA thens follow the steps A22;It is no to then follow the steps A25;
Step A22: random number random is generated, wherein numberical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TWhen, that is, random < e(eb-EA)/T, wherein eb
For the current matching number in step A2, EA is the global maximum matching number in step A2, thens follow the steps A24;Otherwise step is executed
Rapid A25;
Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2
Topic-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching number eb in step A2,
That is, FACTOR=vecb, EA=eb;
Step A25: reducing the simulated annealing temperature T in step A1, uses the cooling ginseng of simulated annealing in step A1
Number COOL, that is, T=T × COOL executes step A6;
Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR1,
FACTOR2, …, FACTORTOPICNUM};Global maximum matching number EA in return step A2;
Cost function process flow steps B is from step B1 to step B15:
Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A;
Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only
One identification sets are LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if
It is SUM with number, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter is extracted in model
Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined into K-means clustering algorithm
Kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is
unionpersonnum;
Step B3: collect the people filtered out in step B2 in LDATOPICPERSON from theme-personnel of the LDA model in step A2
Member unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π2(LDATOPICPERSON) =
{PERSONp1, PERSONp2, …, PERSONpd};
Step B4: from personnel-tag along sort collection KMEANSPERSONCATEGORY of the K-means algorithm in step A2
All tag along sort collection CATEGORY of step B2 are filtered out, and to the selection result duplicate removal, that is, CATEGORY=Π2
(KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is that the personnel in step B2 are total
Number;
Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out LDAPERSONiIt is right
The theme set singlepersontopic answered, that is,=
{ TOPICt1, TOPICt2 ..., TOPICtc }, wherein singlepersontopic comes from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step B1
Theme-tag along sort sequence TMPFACTOR in find corresponding tag along sort, wherein enable categoryt1 =
TMPFACTORTOPICt1、categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein
categoryt1 ,categoryt2 , …, categorytcTag along sort is represented, also, different variables may represent together
One tag along sort, singlepersontopic come from step B2;The number that each tag along sort occurs is counted, is denoted as
categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out tag along sort frequency of occurrence most
Big tag along sort category;Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model
LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi,
category)};
Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is that the personnel in step B2 are total
Number, thens follow the steps B9;Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is that the tag along sort in step A1 is total
Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: from personnel-tag along sort collection KMEANSPERSONCATEGORY of the K-means algorithm in step A2
In filter out CATEGORYjCorresponding personnel's set kmeanscurcategoryperson, that is, = {PERSONkmeans1,
PERSONkmeans2, … , PERSONkmeansc, wherein kmeanscurcategoryperson comes from step B2;
Step B11: personnel-tag along sort collection LDAPERSONCATEGORY of model is extracted from the LDA document subject matter in step B2
In filter out CATEGORYjCorresponding personnel's set ldacurcategoryperson, that is,={PERSONlda1, PERSONlda2, …
, PERSONldac, wherein ldacurcategoryperson comes from step B2;
Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 gather
The intersection unionperson of kmeanscurcategoryperson, that is, unionperson=ldacurcategoryperson
∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned in step B2
Coincidence number unionpersonnum, and the coupling number SUM being added in step B2, that is, SUM=SUM+
unionpersonnum;
Step B14: when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, B15 is thened follow the steps;
Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13;
Step B15: the coupling number SUM in return step B2.
2. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method,
It is characterized in that, cluster is that personnel's internet records are carried out clustering using K-means clustering algorithm and LDA model,
Two cluster results are mutually authenticated again, and improve the efficiency of optimization cluster result using simulated annealing, so as to improve poly-
Class accuracy.
3. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method,
It is characterized in that, primary data needed for step A3 to step A4 offer simulated annealing;Step A7 to step A12 is in mould
The numerical value on the random site in current solution sequence is changed in quasi- annealing algorithm;Step B5 is to the master that step B8 is by step B1
The theme of LDA model in topic-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and close
Connection, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model;Step
B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category
The number of personnel, and it is superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence
Cost;Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15(eb-EA)/TIt is big
It is small, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb>EA and random<e(eb-EA)/TWhen, then update the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA, eb
B is obtained through the above steps with the cost values of EA;Final result returns to overall situation maximum matching number EA and global best master
Topic-tag along sort sequence FACTOR.
4. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method,
If the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is
The Di Li Cray parameter ALPHA of theme distribution in 3, step A4 is 0.1, and the Di Li Cray parameter ETA of keyword distribution is
It is 2000 that 0.01, LDA document subject matter, which extracts model the number of iterations ITERLDA, and the LDA document subject matter of step A1 extracts the master of model
Inscribing sum TOPICNUM is 20.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610565749.XA CN106202480B (en) | 2016-07-19 | 2016-07-19 | A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610565749.XA CN106202480B (en) | 2016-07-19 | 2016-07-19 | A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106202480A CN106202480A (en) | 2016-12-07 |
CN106202480B true CN106202480B (en) | 2019-06-11 |
Family
ID=57493136
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610565749.XA Active CN106202480B (en) | 2016-07-19 | 2016-07-19 | A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202480B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984551A (en) * | 2017-05-31 | 2018-12-11 | 广州智慧城市发展研究院 | A kind of recommended method and system based on the multi-class soft cluster of joint |
CN107305614B (en) * | 2017-08-12 | 2020-05-26 | 西安电子科技大学 | Method for processing big data based on MLDM algorithm meeting secondary aggregation |
CN108460630B (en) * | 2018-02-12 | 2021-11-02 | 广州虎牙信息科技有限公司 | Method and device for carrying out classification analysis based on user data |
CN110276503B (en) * | 2018-03-14 | 2023-04-21 | 吉旗物联科技(上海)有限公司 | Method for automatically identifying cold chain vehicle task |
CN108897815B (en) * | 2018-06-20 | 2021-07-16 | 淮阴工学院 | Multi-label text classification method based on similarity model and FastText |
CN112800419A (en) * | 2019-11-13 | 2021-05-14 | 北京数安鑫云信息技术有限公司 | Method, apparatus, medium and device for identifying IP group |
CN112214515A (en) * | 2020-10-16 | 2021-01-12 | 平安国际智慧城市科技股份有限公司 | Data automatic matching method and device, electronic equipment and storage medium |
CN112883154B (en) * | 2021-01-28 | 2022-02-01 | 平安科技(深圳)有限公司 | Text topic mining method and device, computer equipment and storage medium |
CN113204641B (en) * | 2021-04-12 | 2022-09-02 | 武汉大学 | Annealing attention rumor identification method and device based on user characteristics |
CN113312450B (en) * | 2021-05-28 | 2022-05-31 | 北京航空航天大学 | Method for preventing text stream sequence conversion attack |
CN114742869B (en) * | 2022-06-15 | 2022-08-16 | 西安交通大学医学院第一附属医院 | Brain neurosurgery registration method based on pattern recognition and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194012A (en) * | 2011-06-17 | 2011-09-21 | 清华大学 | Microblog topic detecting method and system |
CN102609719A (en) * | 2012-01-19 | 2012-07-25 | 北京工业大学 | Method for identifying place image on the basis of improved probabilistic topic model |
CN103632166A (en) * | 2013-12-04 | 2014-03-12 | 西安电子科技大学 | Aurora image classification method based on latent theme combining with saliency information |
CN103678500A (en) * | 2013-11-18 | 2014-03-26 | 南京邮电大学 | Data mining improved type K mean value clustering method based on linear discriminant analysis |
CN103793501A (en) * | 2014-01-20 | 2014-05-14 | 惠州学院 | Theme community discovery method based on social network |
CN104462286A (en) * | 2014-11-27 | 2015-03-25 | 重庆邮电大学 | Microblog topic finding method based on modified LDA |
CN104750856A (en) * | 2015-04-16 | 2015-07-01 | 天天艾米(北京)网络科技有限公司 | System and method for multi-dimensional synergic recommendation |
CN105303199A (en) * | 2015-12-08 | 2016-02-03 | 南京信息工程大学 | Data fragment type identification method based on content characteristics and K-means |
CN105677769A (en) * | 2015-12-29 | 2016-06-15 | 广州神马移动信息科技有限公司 | Keyword recommending method and system based on latent Dirichlet allocation (LDA) model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090290508A1 (en) * | 2008-05-22 | 2009-11-26 | At&T Labs, Inc. | Method for optimizing network "Point of Presence" locations |
-
2016
- 2016-07-19 CN CN201610565749.XA patent/CN106202480B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194012A (en) * | 2011-06-17 | 2011-09-21 | 清华大学 | Microblog topic detecting method and system |
CN102609719A (en) * | 2012-01-19 | 2012-07-25 | 北京工业大学 | Method for identifying place image on the basis of improved probabilistic topic model |
CN103678500A (en) * | 2013-11-18 | 2014-03-26 | 南京邮电大学 | Data mining improved type K mean value clustering method based on linear discriminant analysis |
CN103632166A (en) * | 2013-12-04 | 2014-03-12 | 西安电子科技大学 | Aurora image classification method based on latent theme combining with saliency information |
CN103793501A (en) * | 2014-01-20 | 2014-05-14 | 惠州学院 | Theme community discovery method based on social network |
CN104462286A (en) * | 2014-11-27 | 2015-03-25 | 重庆邮电大学 | Microblog topic finding method based on modified LDA |
CN104750856A (en) * | 2015-04-16 | 2015-07-01 | 天天艾米(北京)网络科技有限公司 | System and method for multi-dimensional synergic recommendation |
CN105303199A (en) * | 2015-12-08 | 2016-02-03 | 南京信息工程大学 | Data fragment type identification method based on content characteristics and K-means |
CN105677769A (en) * | 2015-12-29 | 2016-06-15 | 广州神马移动信息科技有限公司 | Keyword recommending method and system based on latent Dirichlet allocation (LDA) model |
Non-Patent Citations (3)
Title |
---|
"Tropical wood species recognition system based on multi-feature extractors and classifiers";Marzuki Khalid et al.;《2011 2nd International Conference on Instrumentation Control and Automation》;20120119;全文 |
"基于隐含狄利克雷分配模型的图像分类算法";杨赛 等;《计算机工程》;20120731;第38卷(第14期);全文 |
"基于隐含狄利克雷分配的微博推荐模型研究";唐晓波 等;《情报科学》;20150228;第33卷(第2期);全文 |
Also Published As
Publication number | Publication date |
---|---|
CN106202480A (en) | 2016-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202480B (en) | A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification | |
CN104951548B (en) | A kind of computational methods and system of negative public sentiment index | |
Zhang et al. | A multi-objective evolutionary approach for mining frequent and high utility itemsets | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN110750645A (en) | Cross-domain false comment identification method based on countermeasure training | |
Sharma et al. | Trend analysis in machine learning research using text mining | |
Guo et al. | Multi-label classification methods for green computing and application for mobile medical recommendations | |
Gan et al. | R-RNN: Extracting user recent behavior sequence for click-through rate prediction | |
Yu et al. | Data cleaning for personal credit scoring by utilizing social media data: An empirical study | |
Liu et al. | A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge | |
Naghavipour et al. | Hybrid metaheuristics for QoS-aware service composition: a systematic mapping study | |
CN106919997A (en) | A kind of customer consumption Forecasting Methodology of the ecommerce based on LDA | |
Yu et al. | Spectrum-enhanced pairwise learning to rank | |
Zhang et al. | Multi-dimension topic mining based on hierarchical semantic graph model | |
Sharma et al. | A study of tree based machine learning techniques for restaurant reviews | |
Niu et al. | Deep adversarial autoencoder recommendation algorithm based on group influence | |
Xiao | A Survey of Document Clustering Techniques & Comparison of LDA and moVMF | |
Dehghan et al. | An improvement in the quality of expert finding in community question answering networks | |
CN106649380A (en) | Hot spot recommendation method and system based on tag | |
Niham et al. | Utilization of Big Data in Libraries by Using Data Mining | |
Zhang | A short introduction to data mining and its applications | |
Ahn et al. | Using genetic algorithms to optimize nearest neighbors for data mining | |
Xin et al. | When factorization meets heterogeneous latent topics: an interpretable cross-site recommendation framework | |
Singh | Sentiment analysis of online mobile reviews | |
Osial et al. | Smartphone recommendation system using web data integration techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |