CN106202480B

CN106202480B - A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Info

Publication number: CN106202480B
Application number: CN201610565749.XA
Authority: CN
Inventors: 朱全银; 辛诚; 李翔; 许康; 潘舒新; 孙青怡; 周泓; 严云洋; 胡荣林; 冯万利; 王留洋; 王海云; 袁媛; 唐海波
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2019-06-11
Anticipated expiration: 2036-07-19
Also published as: CN106202480A

Abstract

The invention discloses a kind of network behaviors of bi-directional verification based on K-means and LDA to be accustomed to clustering method, the present invention utilizes the webpage attribute in personnel's internet records, keyword and frequency, in conjunction with K-means algorithm, LDA document subject matter extracts model and annealing algorithm, first to personnel-label-frequency set, personnel browse record-personnel-keyword set progress K-means algorithm cluster and LDA document subject matter is extracted model and generated, storage calculates intermediate result, K-means and LDA is subjected to bi-directional verification using annealing algorithm later, calculate global best theme-tag along sort sequence, the result of optimization network behavior habit cluster on this basis, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, annealing algorithm can The efficiency of optimization cluster result is improved, and then improves cluster accuracy.

Description

A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Technical field

The invention belongs to clusterings, optimization algorithm field, in particular to a kind of to be based on K-means and Latent The network behavior of the bi-directional verification of Dirichlet Allocation (LDA) is accustomed to clustering method, for optimizing cluster result, into And cluster accuracy is improved, and increase the use value of personnel's internet records information with this.

Background technique

The clustering method for grasping network behavior habit data has important role and meaning for the online habit of researcher Justice, continuous with internet are popularized, and more and more people's selection obtains interested information by network.Personnel's online is clear The information content for the content look at is huge, only analyzes these data not only inefficiency by artificial, but also accuracy is not also high. By clustering, along with another clustering method bi-directional verification, can be improved analysis efficiency and analysis it is accurate Rate.General clustering algorithm has K-means cluster and LDA document subject matter to extract model etc., and general optimization algorithm has simulation to move back Fiery algorithm and genetic algorithm etc..

Clustering algorithm and the correlative theses of optimization algorithm have: the principle and algorithm of Pang Feng simulated annealing are asked in optimization Application Jilin University Master's thesis in topic, 2006；Li Xiangping, Zhang Hongyang simulated annealing principle and improvement are soft Part guide, 2008 (4): 47-48；10 years progress computers of Yang Mengduo, Li Fanchang, Zhang Li Lie-group machine learning Journal, 2015 (7): 1337-1356；Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Proceeding of Journal of Machine Learning Research. 2003, Vol.3: 993-1022；Yuan J, Gao F, Ho Q, et al. LightLDA: Big Topic Models on Modest Computer Clusters. Proceeding of International Conference on World Wide Web. ACM, 2015；The existing Research foundation of Zhu Quanyin et al. includes: Li Xiang, what Zhu Quanyin joint cluster and rating matrix were shared Collaborative filtering recommending computer science and exploration 2014. Vol.8 of (6): 751-759；Suqun Cao, Quanyin Zhu, Zhiwei Hou. Customer Segmentation Based on a Novel Hierarchical Clustering Algorithm. 2009, p:1-5；Quanyin Zhu, Sunqun Cao. A Novel Classifier-independent Feature Selection Algorithm for Imbalanced Datasets. 2009, p:77-82；Suqun Cao, Zhiwei Hou, Liuyang Wang, Quanyin Zhu. Kernelized Fuzzy Fisher Criterion based Clustering Algorithm. DCABES 2010, p:87-91； Quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian. The Case Study for Price Extracting of Mobile Phone Sell Online. 2011, p:282-285；Quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm. International Review on Computers and Software, 2011, Vol.6(6):1089-1093；Suqun Cao, Gelan Yang, Quanyin Zhu, Haihei Zhai. A novel feature extraction method for mechanical part recognition. Applied Mechanics and Materials, 2011, p:116-121；Pei Zhou, Quanyin Zhu. Multi-factor Matching Method for Basic Information of Science and Technology Experts Based on Web Mining. 2012, P:718-720；Jianping Deng, Fengwen Cao, Quanyin Zhu, Yu Zhang. The Web Data Extracting and Application for Shop Online Based on Commodities Classified. Communications in Computer and Information Science, Vol.234(4):120-128；Hui Zong, Quanyin Zhu, Ming Sun, Yahong Zhang. The case study for human resource management research based on web mining and semantic analysis. Applied Mechanics and Materials, Vol.488, 2014 p:1336-1339；Zhu Quanyin et al. application, openly with the related patents of authorization: Zhu Quanyin, Hu Rongjing, Cao Suqun, A kind of price forecasting of commodity method Chinese patents based on linear interpolation Yu Adaptive windowing mouth of such as week training: ZL 2011 1 0423015.5, 2015.07.01；Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong are waited quietly one kind and are repaired based on two divided datas Mend the price forecasting of commodity method Chinese patent with Discontinuous Factors: 2,011 1 0422274.6,2013.01.02 of ZL；Zhu Quan Silver, Yin Yonghua, Yan Yunyang, a kind of data of multi items price forecasting of commodity neural network based of Chen Ting, Cao Suqun Preprocess method Chinese patent: 2,012 1 0325368.6,2016.06.08 of ZL；Zhu Quanyin, Pan Lu, Liu Wenru, Lee Xiang, Zhou Hong, Hu Ronglin, Ding Jin, Jin Ying, Shaowu is outstanding, a kind of incremental learning of science and technology news multi-level two of Tanghai wave Classification method China Patent Publication No.: CN 105205163A, 2015.12.30；Zhu Quanyin, Yan Yunyang, Huang Taoyi, Bright, the implementation method China of a kind of campus personality palm service of Zhang Yuyang, Xin Cheng and user behavior habit analysis is specially Sharp publication number: CN 104731971A, 2015.06.24；Zhu Quanyin, Shen Enqiang, Qian Yaping, all deep equal one kind are based on K- Means clusters the adaptive student's Learning behavior analyzing method Chinese Patent Application No. of more weights: 201610222553.0, 2016.04.13；Zhu Quanyin, Shaowu is outstanding, Tanghai wave, Zhou Hong, Li Xiang, Hu Ronglin, Jin Ying, Cao Suqun, a kind of science of Pan Shuxin Multi-level more classification method China Patent Publication No. of headline: CN 105205163A, 2016.07.13；Li Xiang, Zhu A kind of Cold Chain Logistics prestowage intelligent recommendation method China Patent Publication No. based on spectral clustering of Quan Yin, Hu Ronglin, Zhou Hong: CN 105654267A, 2016.06.08。

LDA document subject matter extracts model:

LDA(Latent Dirichlet Allocation) it is that a kind of document subject matter generates model, also referred to as one three layers Bayesian probability model includes word, theme and document three-decker.So-called generation model, that is, it is believed that an article Each word be by " with some theme of certain probability selection, and with some word of certain probability selection from this theme Such a process of language " obtains.Document obeys multinomial distribution to theme, and theme to word obeys multinomial distribution.LDA is a kind of Non-supervisory machine learning techniques can be used to identify extensive document sets (document collection) or corpus (corpus) subject information hidden in.The method that it uses bag of words (bag of words), this method is by each text Shelves are considered as a word frequency vector, so that text information is converted the digital information for ease of modeling.But bag of words method does not have Consider the sequence between word and word, this simplifies the complex natures of the problem, while also providing opportunity for the improvement of model.Each piece Probability distribution that some themes of documents representative are constituted, and each theme represents one that many words are constituted A probability distribution.

K-means clustering algorithm:

It is now then more popular as a kind of clustering method derived from one of signal processing vector quantization method In the field of data mining.kThe purpose of average cluster is: n point (the primary observation or an example that can be sample) is divided It arriveskIn a cluster so that each point belongs to the corresponding cluster of closest to him mean value (this i.e. cluster centre), using as poly- The standard of class.The problem of this problem will be attributed to one data space is divided into Voronoi cells.This problem is being counted It counts in being difficult (NP is difficult), but there are efficient heuritic approaches.Under normal circumstances, all service efficiency is relatively high opens Hairdo algorithm, they can quickly converge on a locally optimal solution.These algorithms are generally similar to pass through iterative optimization method Handle the EM algorithm (EM algorithm) of Gaussian Mixture distribution.Moreover, they are all using cluster centre come for data modeling；So AndkAverage cluster tends to find cluster in comparable spatial dimension, it is expected that-maximization technology, which but allows to cluster, to be had not Same shape.

Simulated annealing:

Simulated annealing (Simulated Annealing, SA) earliest thought is by N. Metropolis et al. It is proposed in nineteen fifty-three.1983, annealing thought was successfully introduced into Combinatorial Optimization field by S. Kirkpatrick etc..It is Based on a kind of random optimizing algorithm of Monte-Carlo iterative solution strategy, starting point is based on solid matter in physics Similitude between annealing process and general combinatorial optimization problem.Simulated annealing is from a certain higher initial temperature, with temperature The continuous decline of parameter is spent, join probability kick characteristic finds the globally optimal solution of objective function at random in solution space, that is, exists Locally optimal solution can be jumped out probabilityly and finally tend to global optimum.Simulated annealing is a kind of general optimization algorithm, Theoretically algorithm has the global optimization performance of probability, is widely applied in engineering at present, and such as VLSI, production are adjusted The fields such as degree, control engineering, machine learning, neural network, signal processing.Simulated annealing is by assigning search process one Kind time-varying and the probabilistic jumping property finally to go to zero, fall into local minimum and finally tend to global optimum to can effectively avoid The optimization algorithm of serial structure.

Heuristic search:

The big basic target of the two of computer science exactly finds that its provable operational efficiency is good and can obtain optimum solution or secondary The algorithm of good solution.And heuritic approach then attempts once to provide one or all target.Such as it often can find all well and good solution, But it can not also prove that it will not obtain worse solution；It usually can solve answer in the reasonable time, but can not also know it Whether every time can such speed solution.In some special cases, heuritic approach can obtain very bad answer or effect Rate is very poor, however causes the data structure of those special circumstances, perhaps never occurs in real world.Therefore real world Middle heuritic approach is in daily use to solve the problems, such as.Heuritic approach usually can be within the reasonable time when handling many practical problems Obtain good answer.Common heuritic approach has ant group algorithm, genetic algorithm, simulated annealing etc..

Summary of the invention

In order to help administrative staff understand and improve be managed personnel online habit, reduce internet adverse effect with And the generality excavated between the Internet data for the personnel of being managed and hobby contacts, and is managed personnel by comprehensive analysis Internet records, using document clustering algorithm and K mean cluster algorithm based on LDA model, design realizes one kind and is based on The network behavior of K-means and LDA bi-directional verification is accustomed to clustering method, mentions to be managed the analysis of personnel's internet behavior with management The system model of preferable reference value is supplied.

The theoretical basis of patent to facilitate the understanding of the present invention describes such as the difference of theory with traditional theory of the invention Under:

In traditional clustering method, clustering usually is carried out using a kind of mode to primary data, then by artificial The mode of analysis is verified.The present invention is on the basis of conventional method, creative two kinds of clustering methods of use, by making by oneself The accuracy of the verification method verifying clustering algorithm of justice, and the efficiency for optimizing cluster result is improved using simulated annealing.

The technical scheme is that using webpage attribute, keyword and frequency in personnel's internet records, in conjunction with K- Means algorithm, LDA document subject matter extract model and annealing algorithm, first browse note to personnel-label-frequency set, personnel Record-personnel-keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate knot K-means and LDA is carried out bi-directional verification using annealing algorithm later by fruit, calculates global best theme-tag along sort sequence, The result of optimization network behavior habit cluster on this basis；It wherein, include simulated annealing main flow step A and cost letter Number process flow steps B:

Simulated annealing main flow step A1 to step A26:

Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSON_p1, LABEL_p1, FREQ_p1), (PERSON_p2, LABEL_p2, FREQ_p2), …, (PERSON_pa, LABEL_pa, FREQ_pa), wherein PERSON_p1, PERSON_p2, …, PERSON_paRepresent personnel's unique identification, LABEL_p1, LABEL_p2, …, LABEL_paGeneration Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQ_p1, FREQ_p2, …, FREQ_paThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web Integrate as RECORDIDPERSONKEYWORD={ (RECORDID_r1, PERSON_r1, KEYWORD_r1), (RECORDID_r2, PERSON_r2, KEYWORD_r2), …, (RECORDID_ra, PERSON_ra, KEYWORD_ra), wherein RECORDID_r1, RECORDID_r2, …, RECORDID_raPersonnel record's unique identification is represented, is made of personnel's unique identification and online date, PERSON_r1, PERSON_r2, …, PERSON_raRepresent personnel's unique identification, KEYWORD_r1, KEYWORD_r2, …, KEYWORD_raThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if Simulated annealing cooling parameter is COOL；

Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer；If theme-keyword set of LDA model is LDATOPICWORD, wherein theme is indicated by natural integer；If theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is indicated by natural integer；If global best theme-tag along sort sequence is FACTOR, if global maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep；

Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step of step A1 Personnel-label-frequency set of K-means the clustering algorithm the number of iterations ITERKMEANS and step A1 of rapid A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained KMEANSPERSONCATEGORY={(PERSON₁, CATEGORY_c1), (PERSON₂, CATEGORY_c2), …, (PERSON_a, CATEGORY_ca), wherein KMEANSPERSONCATEGORY comes from step A2；

Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme that LDA document subject matter extracts model The Di Li Cray parameter ALPHA of distribution, Di Li Cray parameter ETA, the LDA document subject matter of keyword distribution extract model iteration time Number ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPIC_t1, KEYWORD₁), (TOPIC_t2, KEYWORD₂), …, (TOPIC_tb, KEYWORD_b) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPIC_t1, PERSON_p1), (TOPIC_t2, PERSON_p2), …, (TOPIC_tc, PERSON_pc), wherein LDATOPICWORD and LDATOPICPERSON comes from step A2；

Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1 FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence 0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA It is 0, that is, FACTOR={ FACTOR₁, FACTOR₂, …, FACTOR_TOPICNUM, EA=0；

Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25； It is no to then follow the steps A26；

Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, random number Range between 0 and TOPICNUM-1, wherein TOPICNUM be step A1 in LDA document subject matter extract model theme Sum；

Step A8: being curstep with random number assignment to the simulated annealing current step in step A2, random number Range is between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length；

Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the overall situation in step A2 best main Topic-tag along sort sequence FACTOR, that is, vecb=FACTOR；

Step A10: changing the step current topic-tag along sort sequence vecb in A2, the numerical value on the i-th position ndex, Enable vecb_indexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, curstep For the simulated annealing current step in step A2, that is, vecb_index= vecb_index+curstep；

Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is small When 0, that is, vecb_index< 0, then follow the steps A12；It is no to then follow the steps A13；

Step A12: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled In 0, that is, vecb_index=0；Go to step A15；

Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is big When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecb_index> CATEGORYNUM-1 thens follow the steps A14；It is no to then follow the steps A15；

Step A14: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled In CATEGORYNUM-1, that is, vecb_index=CATEGORYNUM-1, wherein CATEGORYNUM is the contingency table in step A1 Label sum；

Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2；

Step A16: step B is executed；

Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2；

Step A18: current topic-tag along sort sequence vecb in obtaining step A2；

Step A19: step B is executed；

Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2；

Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is, Eb > EA thens follow the steps A22；It is no to then follow the steps A25；

Step A22: random number random is generated, wherein numberical range is between 0 to 1；

Step A23: when the random number random in step A22 is less than e^(eb-EA)/TWhen, that is, random < e^(eb-EA)/T, In, eb is the current matching number in step A2, and EA is the global maximum matching number in step A2, thens follow the steps A24；Otherwise it holds Row step A25；

Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is working as in step A2 Preceding theme-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching in step A2 Number eb, that is, FACTOR=vecb, EA=eb；

Step A25: reducing the simulated annealing temperature T in step A1, cold using the simulated annealing in step A1 But parameter COOL, that is, T=T × COOL executes step A6；

Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR₁, FACTOR₂, …, FACTOR_TOPICNUM}；Global maximum matching number EA in return step A2；

Cost function process flow steps B is from step B1 to step B15:

Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A；

Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if people Member's unique identification integrates as LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, If coupling number is SUM, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter extracts mould Current class personnel collection is combined into ldacurcategoryperson in type, if current class personnel collect in K-means clustering algorithm It is combined into kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is unionpersonnum；

Step B3: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out in step B2 Personnel unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π₂(LDATOPICPERSON) = {PERSON_p1, PERSON_p2, …, PERSON_pd}；

Step B4: from personnel-tag along sort collection of the K-means algorithm in step A2 Filter out all tag along sort collection CATEGORY of step B2 in KMEANSPERSONCATEGORY, and to the selection result duplicate removal, That is, CATEGORY=Π₂(KMEANSPERSONCATEGORY)={CATEGORY_c1, CATEGORY_c2, …, CATEGORY_cd}；

Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is the personnel in step B2 Sum；

Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out LDAPERSON_iCorresponding theme set singlepersontopic, that is,={TOPICt1, TOPICt2, …, TOPICtc }, wherein singlepersontopic comes from step B2；

Step B7: to LDAPERSON_iEach theme in corresponding theme set singlepersontopic, from step Corresponding tag along sort is found in theme-tag along sort sequence TMPFACTOR in B1, wherein enables category_t1 = TMPFACTOR_TOPICt1、category_t2 = TMPFACTOR_TOPICt2、…、category_tc = TMPFACTOR_TOPICtc, wherein category_t1 ,category_t2 , …, category_tcTag along sort is represented, also, different variables may represent together One tag along sort, singlepersontopic come from step B2；The number that each tag along sort occurs is counted, is denoted as categorysnum₁, categorysnum₂, …, categorysnum_CATEGORYNUM, find out tag along sort frequency of occurrence most Big tag along sort category；Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSON_i, category)}；

Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is the people in step B2 Member's sum, thens follow the steps B9；Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7；

Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is the contingency table in step A1 Label sum, if the coupling number SUM in step B2 is 0, i.e. SUM=0；

Step B10: from personnel-tag along sort collection of the K-means algorithm in step A2 CATEGORY is filtered out in KMEANSPERSONCATEGORY_jCorresponding personnel's set kmeanscurcategoryperson, That is, = {PERSON_kmeans1, PERSON_kmeans2, … , PERSON_kmeansc, wherein kmeanscurcategoryperson comes from Step B2；

Step B11: personnel-tag along sort collection of model is extracted from the LDA document subject matter in step B2 CATEGORY is filtered out in LDAPERSONCATEGORY_jCorresponding personnel's set ldacurcategoryperson, that is, ={PERSON_lda1, PERSON_lda2, … , PERSON_ldac, wherein ldacurcategoryperson comes from step B2；

Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 collect Close the intersection unionperson of kmeanscurcategoryperson, that is, unionperson= ldacurcategoryperson∩kmeanscurcategoryperson={PERSON_union1, PERSON_union2, … , PERSON_unionc}；

Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned to step The number unionpersonnum of coincidence in B2, and the coupling number SUM being added in step B2, that is, SUM=SUM+ unionpersonnum；

Step B14: it when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, thens follow the steps B15；Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13；

Step B15: the coupling number SUM in return step B2.

Wherein, cluster is that personnel's internet records are carried out cluster point using K-means clustering algorithm and LDA model Analysis, then two cluster results are mutually authenticated, and the efficiency of optimization cluster result is improved using simulated annealing, so as to improve Cluster accuracy.

Wherein, primary data needed for step A3 to step A4 provides simulated annealing；Step A7 to step A12 be The numerical value on the random site in current solution sequence is changed in simulated annealing；Step B5 to step B8 is by step B1 The theme of LDA model in theme-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and carry out Association, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model；Step Rapid B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category Personnel number, and be superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence Cost；Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15^(eb-EA)/T's Size, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb > EA and random <e^(eb-EA)/TWhen, then the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA are updated, B is obtained the cost values of eb and EA through the above steps；Final result returns to overall situation maximum matching number EA and the overall situation is best Theme-tag along sort sequence FACTOR.

Wherein, the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and the Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of keyword distribution Thunder parameter ETA is that 0.01, LDA document subject matter extraction model the number of iterations ITERLDA is 2000, the LDA document subject matter of step A1 The theme sum TOPICNUM for extracting model is 20.

Webpage attribute, keyword and frequency of the invention land productivity in personnel's internet records are calculated in conjunction with K-means Method, LDA document subject matter extract model and annealing algorithm, first browse record-personnel-to personnel-label-frequency set, personnel Keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, makes later K-means and LDA is subjected to bi-directional verification with annealing algorithm and calculates global best theme-tag along sort sequence, with this according to excellent Change network behavior habit cluster as a result, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, move back Fiery algorithm can be improved the efficiency of optimization cluster result, and then improve cluster accuracy.

Detailed description of the invention

Attached drawing 1 is simulated annealing main flow.

Attached drawing 2 is cost function process flow.

Specific embodiment

Technical solution of the present invention is described in detail with reference to the accompanying drawing:

Such as attached drawing 1, simulated annealing main flow step A1 to step A26:

Step A16: step B is executed；

Step A18: current topic-tag along sort sequence vecb in obtaining step A2；

Step A19: step B is executed；

Such as attached drawing 2, cost function process flow steps B is from step B1 to step B15:

Step B11: personnel-tag along sort collection of model is extracted from the LDA document subject matter in step B2 CATEGORY is filtered out in LDAPERSONCATEGORY_jCorresponding personnel's set ldacurcategoryperson, that is,={PERSON_lda1, PERSON_lda2, … , PERSON_ldac, wherein ldacurcategoryperson comes from step B2；

Step B15: the coupling number SUM in return step B2.

Primary data needed for step A3 to step A4 provides simulated annealing；Step A7 is to simulate to step A12 The numerical value on the random site in current solution sequence is changed in annealing algorithm；Step B5 is to the master that step B8 is by step B1 The theme of LDA model in topic-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and close Connection, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model；Step B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category The number of personnel, and it is superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence Cost；Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15^(eb-EA)/TIt is big It is small, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb>EA and random<e^(eb-EA)/TWhen, then update the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA, eb B is obtained through the above steps with the cost values of EA；Final result returns to overall situation maximum matching number EA and global best master Topic-tag along sort sequence FACTOR.

The present invention can be in conjunction with computer system, to be automatically performed personnel's network behavior habit cluster.

Webpage attribute of the invention land productivity in personnel's internet records, keyword, frequency are calculated in conjunction with K-means Method, LDA document subject matter extract model, annealing algorithm, first browse record-personnel-pass to personnel-label-frequency set, personnel Keyword collection carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, uses later K-means and LDA is carried out bi-directional verification and calculates global best theme-tag along sort sequence by annealing algorithm, with this according to optimization Network behavior habit cluster as a result, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, annealing Algorithm can be improved the efficiency of optimization cluster result, and then improve cluster accuracy.

Claims

1. one kind is accustomed to based on the network behavior of K-means and Latent Dirichlet Allocation (LDA) bi-directional verification Clustering method, which is characterized in that using webpage attribute, keyword and the frequency in personnel's internet records, calculated in conjunction with K-means Method, LDA document subject matter extract model and annealing algorithm, first browse record-personnel-to personnel-label-frequency set, personnel Keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, makes later K-means and LDA is subjected to bi-directional verification with annealing algorithm, calculates global best theme-tag along sort sequence, on this basis Optimize the result of network behavior habit cluster；It wherein, include simulated annealing main flow step A and cost function process flow Step B:

Simulated annealing main flow step A1 to step A26:

Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step A1 of step A1 K-means clustering algorithm the number of iterations ITERKMEANS and step A1 personnel-label-frequency set PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained KMEANSPERSONCATEGORY={(PERSON₁, CATEGORY_c1), (PERSON₂, CATEGORY_c2), …, (PERSON_a, CATEGORY_ca), wherein KMEANSPERSONCATEGORY comes from step A2；

Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme distribution that LDA document subject matter extracts model Di Li Cray parameter ALPHA, keyword distribution Di Li Cray parameter ETA, LDA document subject matter extract model the number of iterations ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPIC_t1, KEYWORD₁), (TOPIC_t2, KEYWORD₂), …, (TOPIC_tb, KEYWORD_b) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPIC_t1, PERSON_p1), (TOPIC_t2, PERSON_p2), …, (TOPIC_tc, PERSON_pc), wherein LDATOPICWORD and LDATOPICPERSON comes from step A2；

Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25；Otherwise Execute step A26；

Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number It is trapped among between 0 and TOPICNUM-1, wherein TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1；

Step A8: being curstep with random number assignment, the range of random number to the simulated annealing current step in step A2 Between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length；

Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the best theme-point of the overall situation in step A2 Class sequence label FACTOR, that is, vecb=FACTOR；

Step A10: changing the step current topic-tag along sort sequence vecb in A2, and the numerical value on the i-th position ndex enables vecb_indexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, and curstep is Simulated annealing current step in step A2, that is, vecb_index= vecb_index+curstep；

Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is less than 0 When, that is, vecb_index< 0, then follow the steps A12；It is no to then follow the steps A13；

Step A12: enabling numerical value of the current topic-tag along sort sequence vecb on the i-th position ndex in step A2 be equal to 0, That is, vecb_index=0；Go to step A15；

Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is greater than When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecb_index> CATEGORYNUM-1 thens follow the steps A14；It is no to then follow the steps A15；

Step A14: numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is enabled to be equal to CATEGORYNUM-1, that is, vecb_index=CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort in step A1 Sum；

Step A16: step B is executed；

Step A18: current topic-tag along sort sequence vecb in obtaining step A2；

Step A19: step B is executed；

Step A23: when the random number random in step A22 is less than e^(eb-EA)/TWhen, that is, random < e^(eb-EA)/T, wherein eb For the current matching number in step A2, EA is the global maximum matching number in step A2, thens follow the steps A24；Otherwise step is executed Rapid A25；

Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2 Topic-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching number eb in step A2, That is, FACTOR=vecb, EA=eb；

Step A25: reducing the simulated annealing temperature T in step A1, uses the cooling ginseng of simulated annealing in step A1 Number COOL, that is, T=T × COOL executes step A6；

Cost function process flow steps B is from step B1 to step B15:

Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only One identification sets are LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if It is SUM with number, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter is extracted in model Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined into K-means clustering algorithm Kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is unionpersonnum；

Step B3: collect the people filtered out in step B2 in LDATOPICPERSON from theme-personnel of the LDA model in step A2 Member unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π₂(LDATOPICPERSON) = {PERSON_p1, PERSON_p2, …, PERSON_pd}；

Step B4: from personnel-tag along sort collection KMEANSPERSONCATEGORY of the K-means algorithm in step A2 All tag along sort collection CATEGORY of step B2 are filtered out, and to the selection result duplicate removal, that is, CATEGORY=Π₂ (KMEANSPERSONCATEGORY)={CATEGORY_c1, CATEGORY_c2, …, CATEGORY_cd}；

Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is that the personnel in step B2 are total Number；

Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out LDAPERSON_iIt is right The theme set singlepersontopic answered, that is,= { TOPICt1, TOPICt2 ..., TOPICtc }, wherein singlepersontopic comes from step B2；

Step B7: to LDAPERSON_iEach theme in corresponding theme set singlepersontopic, from step B1 Theme-tag along sort sequence TMPFACTOR in find corresponding tag along sort, wherein enable category_t1 = TMPFACTOR_TOPICt1、category_t2 = TMPFACTOR_TOPICt2、…、category_tc = TMPFACTOR_TOPICtc, wherein category_t1 ,category_t2 , …, category_tcTag along sort is represented, also, different variables may represent together One tag along sort, singlepersontopic come from step B2；The number that each tag along sort occurs is counted, is denoted as categorysnum₁, categorysnum₂, …, categorysnum_CATEGORYNUM, find out tag along sort frequency of occurrence most Big tag along sort category；Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSON_i, category)}；

Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is that the personnel in step B2 are total Number, thens follow the steps B9；Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7；

Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is that the tag along sort in step A1 is total Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0；

Step B10: from personnel-tag along sort collection KMEANSPERSONCATEGORY of the K-means algorithm in step A2 In filter out CATEGORY_jCorresponding personnel's set kmeanscurcategoryperson, that is, = {PERSON_kmeans1, PERSON_kmeans2, … , PERSON_kmeansc, wherein kmeanscurcategoryperson comes from step B2；

Step B11: personnel-tag along sort collection LDAPERSONCATEGORY of model is extracted from the LDA document subject matter in step B2 In filter out CATEGORY_jCorresponding personnel's set ldacurcategoryperson, that is,={PERSON_lda1, PERSON_lda2, … , PERSON_ldac, wherein ldacurcategoryperson comes from step B2；

Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 gather The intersection unionperson of kmeanscurcategoryperson, that is, unionperson=ldacurcategoryperson ∩kmeanscurcategoryperson={PERSON_union1, PERSON_union2, … , PERSON_unionc}；

Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned in step B2 Coincidence number unionpersonnum, and the coupling number SUM being added in step B2, that is, SUM=SUM+ unionpersonnum；

Step B14: when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, B15 is thened follow the steps； Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13；

Step B15: the coupling number SUM in return step B2.

2. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method, It is characterized in that, cluster is that personnel's internet records are carried out clustering using K-means clustering algorithm and LDA model, Two cluster results are mutually authenticated again, and improve the efficiency of optimization cluster result using simulated annealing, so as to improve poly- Class accuracy.

3. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method, It is characterized in that, primary data needed for step A3 to step A4 offer simulated annealing；Step A7 to step A12 is in mould The numerical value on the random site in current solution sequence is changed in quasi- annealing algorithm；Step B5 is to the master that step B8 is by step B1 The theme of LDA model in topic-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and close Connection, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model；Step B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category The number of personnel, and it is superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence Cost；Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15^(eb-EA)/TIt is big It is small, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb>EA and random<e^(eb-EA)/TWhen, then update the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA, eb B is obtained through the above steps with the cost values of EA；Final result returns to overall situation maximum matching number EA and global best master Topic-tag along sort sequence FACTOR.

4. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method, If the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is The Di Li Cray parameter ALPHA of theme distribution in 3, step A4 is 0.1, and the Di Li Cray parameter ETA of keyword distribution is It is 2000 that 0.01, LDA document subject matter, which extracts model the number of iterations ITERLDA, and the LDA document subject matter of step A1 extracts the master of model Inscribing sum TOPICNUM is 20.