CN106202480B - A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification - Google Patents

A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification Download PDF

Info

Publication number
CN106202480B
CN106202480B CN201610565749.XA CN201610565749A CN106202480B CN 106202480 B CN106202480 B CN 106202480B CN 201610565749 A CN201610565749 A CN 201610565749A CN 106202480 B CN106202480 B CN 106202480B
Authority
CN
China
Prior art keywords
personnel
tag along
along sort
theme
lda
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610565749.XA
Other languages
Chinese (zh)
Other versions
CN106202480A (en
Inventor
朱全银
辛诚
李翔
许康
潘舒新
孙青怡
周泓
严云洋
胡荣林
冯万利
王留洋
王海云
袁媛
唐海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN201610565749.XA priority Critical patent/CN106202480B/en
Publication of CN106202480A publication Critical patent/CN106202480A/en
Application granted granted Critical
Publication of CN106202480B publication Critical patent/CN106202480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of network behaviors of bi-directional verification based on K-means and LDA to be accustomed to clustering method, the present invention utilizes the webpage attribute in personnel's internet records, keyword and frequency, in conjunction with K-means algorithm, LDA document subject matter extracts model and annealing algorithm, first to personnel-label-frequency set, personnel browse record-personnel-keyword set progress K-means algorithm cluster and LDA document subject matter is extracted model and generated, storage calculates intermediate result, K-means and LDA is subjected to bi-directional verification using annealing algorithm later, calculate global best theme-tag along sort sequence, the result of optimization network behavior habit cluster on this basis, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, annealing algorithm can The efficiency of optimization cluster result is improved, and then improves cluster accuracy.

Description

A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification
Technical field
The invention belongs to clusterings, optimization algorithm field, in particular to a kind of to be based on K-means and Latent The network behavior of the bi-directional verification of Dirichlet Allocation (LDA) is accustomed to clustering method, for optimizing cluster result, into And cluster accuracy is improved, and increase the use value of personnel's internet records information with this.
Background technique
The clustering method for grasping network behavior habit data has important role and meaning for the online habit of researcher Justice, continuous with internet are popularized, and more and more people's selection obtains interested information by network.Personnel's online is clear The information content for the content look at is huge, only analyzes these data not only inefficiency by artificial, but also accuracy is not also high. By clustering, along with another clustering method bi-directional verification, can be improved analysis efficiency and analysis it is accurate Rate.General clustering algorithm has K-means cluster and LDA document subject matter to extract model etc., and general optimization algorithm has simulation to move back Fiery algorithm and genetic algorithm etc..
Clustering algorithm and the correlative theses of optimization algorithm have: the principle and algorithm of Pang Feng simulated annealing are asked in optimization Application Jilin University Master's thesis in topic, 2006;Li Xiangping, Zhang Hongyang simulated annealing principle and improvement are soft Part guide, 2008 (4): 47-48;10 years progress computers of Yang Mengduo, Li Fanchang, Zhang Li Lie-group machine learning Journal, 2015 (7): 1337-1356;Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Proceeding of Journal of Machine Learning Research. 2003, Vol.3: 993-1022;Yuan J, Gao F, Ho Q, et al. LightLDA: Big Topic Models on Modest Computer Clusters. Proceeding of International Conference on World Wide Web. ACM, 2015;The existing Research foundation of Zhu Quanyin et al. includes: Li Xiang, what Zhu Quanyin joint cluster and rating matrix were shared Collaborative filtering recommending computer science and exploration 2014. Vol.8 of (6): 751-759;Suqun Cao, Quanyin Zhu, Zhiwei Hou. Customer Segmentation Based on a Novel Hierarchical Clustering Algorithm. 2009, p:1-5;Quanyin Zhu, Sunqun Cao. A Novel Classifier-independent Feature Selection Algorithm for Imbalanced Datasets. 2009, p:77-82;Suqun Cao, Zhiwei Hou, Liuyang Wang, Quanyin Zhu. Kernelized Fuzzy Fisher Criterion based Clustering Algorithm. DCABES 2010, p:87-91; Quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian. The Case Study for Price Extracting of Mobile Phone Sell Online. 2011, p:282-285;Quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm. International Review on Computers and Software, 2011, Vol.6(6):1089-1093;Suqun Cao, Gelan Yang, Quanyin Zhu, Haihei Zhai. A novel feature extraction method for mechanical part recognition. Applied Mechanics and Materials, 2011, p:116-121;Pei Zhou, Quanyin Zhu. Multi-factor Matching Method for Basic Information of Science and Technology Experts Based on Web Mining. 2012, P:718-720;Jianping Deng, Fengwen Cao, Quanyin Zhu, Yu Zhang. The Web Data Extracting and Application for Shop Online Based on Commodities Classified. Communications in Computer and Information Science, Vol.234(4):120-128;Hui Zong, Quanyin Zhu, Ming Sun, Yahong Zhang. The case study for human resource management research based on web mining and semantic analysis. Applied Mechanics and Materials, Vol.488, 2014 p:1336-1339;Zhu Quanyin et al. application, openly with the related patents of authorization: Zhu Quanyin, Hu Rongjing, Cao Suqun, A kind of price forecasting of commodity method Chinese patents based on linear interpolation Yu Adaptive windowing mouth of such as week training: ZL 2011 1 0423015.5, 2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong are waited quietly one kind and are repaired based on two divided datas Mend the price forecasting of commodity method Chinese patent with Discontinuous Factors: 2,011 1 0422274.6,2013.01.02 of ZL;Zhu Quan Silver, Yin Yonghua, Yan Yunyang, a kind of data of multi items price forecasting of commodity neural network based of Chen Ting, Cao Suqun Preprocess method Chinese patent: 2,012 1 0325368.6,2016.06.08 of ZL;Zhu Quanyin, Pan Lu, Liu Wenru, Lee Xiang, Zhou Hong, Hu Ronglin, Ding Jin, Jin Ying, Shaowu is outstanding, a kind of incremental learning of science and technology news multi-level two of Tanghai wave Classification method China Patent Publication No.: CN 105205163A, 2015.12.30;Zhu Quanyin, Yan Yunyang, Huang Taoyi, Bright, the implementation method China of a kind of campus personality palm service of Zhang Yuyang, Xin Cheng and user behavior habit analysis is specially Sharp publication number: CN 104731971A, 2015.06.24;Zhu Quanyin, Shen Enqiang, Qian Yaping, all deep equal one kind are based on K- Means clusters the adaptive student's Learning behavior analyzing method Chinese Patent Application No. of more weights: 201610222553.0, 2016.04.13;Zhu Quanyin, Shaowu is outstanding, Tanghai wave, Zhou Hong, Li Xiang, Hu Ronglin, Jin Ying, Cao Suqun, a kind of science of Pan Shuxin Multi-level more classification method China Patent Publication No. of headline: CN 105205163A, 2016.07.13;Li Xiang, Zhu A kind of Cold Chain Logistics prestowage intelligent recommendation method China Patent Publication No. based on spectral clustering of Quan Yin, Hu Ronglin, Zhou Hong: CN 105654267A, 2016.06.08。
LDA document subject matter extracts model:
LDA(Latent Dirichlet Allocation) it is that a kind of document subject matter generates model, also referred to as one three layers Bayesian probability model includes word, theme and document three-decker.So-called generation model, that is, it is believed that an article Each word be by " with some theme of certain probability selection, and with some word of certain probability selection from this theme Such a process of language " obtains.Document obeys multinomial distribution to theme, and theme to word obeys multinomial distribution.LDA is a kind of Non-supervisory machine learning techniques can be used to identify extensive document sets (document collection) or corpus (corpus) subject information hidden in.The method that it uses bag of words (bag of words), this method is by each text Shelves are considered as a word frequency vector, so that text information is converted the digital information for ease of modeling.But bag of words method does not have Consider the sequence between word and word, this simplifies the complex natures of the problem, while also providing opportunity for the improvement of model.Each piece Probability distribution that some themes of documents representative are constituted, and each theme represents one that many words are constituted A probability distribution.
K-means clustering algorithm:
It is now then more popular as a kind of clustering method derived from one of signal processing vector quantization method In the field of data mining.kThe purpose of average cluster is: n point (the primary observation or an example that can be sample) is divided It arriveskIn a cluster so that each point belongs to the corresponding cluster of closest to him mean value (this i.e. cluster centre), using as poly- The standard of class.The problem of this problem will be attributed to one data space is divided into Voronoi cells.This problem is being counted It counts in being difficult (NP is difficult), but there are efficient heuritic approaches.Under normal circumstances, all service efficiency is relatively high opens Hairdo algorithm, they can quickly converge on a locally optimal solution.These algorithms are generally similar to pass through iterative optimization method Handle the EM algorithm (EM algorithm) of Gaussian Mixture distribution.Moreover, they are all using cluster centre come for data modeling;So AndkAverage cluster tends to find cluster in comparable spatial dimension, it is expected that-maximization technology, which but allows to cluster, to be had not Same shape.
Simulated annealing:
Simulated annealing (Simulated Annealing, SA) earliest thought is by N. Metropolis et al. It is proposed in nineteen fifty-three.1983, annealing thought was successfully introduced into Combinatorial Optimization field by S. Kirkpatrick etc..It is Based on a kind of random optimizing algorithm of Monte-Carlo iterative solution strategy, starting point is based on solid matter in physics Similitude between annealing process and general combinatorial optimization problem.Simulated annealing is from a certain higher initial temperature, with temperature The continuous decline of parameter is spent, join probability kick characteristic finds the globally optimal solution of objective function at random in solution space, that is, exists Locally optimal solution can be jumped out probabilityly and finally tend to global optimum.Simulated annealing is a kind of general optimization algorithm, Theoretically algorithm has the global optimization performance of probability, is widely applied in engineering at present, and such as VLSI, production are adjusted The fields such as degree, control engineering, machine learning, neural network, signal processing.Simulated annealing is by assigning search process one Kind time-varying and the probabilistic jumping property finally to go to zero, fall into local minimum and finally tend to global optimum to can effectively avoid The optimization algorithm of serial structure.
Heuristic search:
The big basic target of the two of computer science exactly finds that its provable operational efficiency is good and can obtain optimum solution or secondary The algorithm of good solution.And heuritic approach then attempts once to provide one or all target.Such as it often can find all well and good solution, But it can not also prove that it will not obtain worse solution;It usually can solve answer in the reasonable time, but can not also know it Whether every time can such speed solution.In some special cases, heuritic approach can obtain very bad answer or effect Rate is very poor, however causes the data structure of those special circumstances, perhaps never occurs in real world.Therefore real world Middle heuritic approach is in daily use to solve the problems, such as.Heuritic approach usually can be within the reasonable time when handling many practical problems Obtain good answer.Common heuritic approach has ant group algorithm, genetic algorithm, simulated annealing etc..
Summary of the invention
In order to help administrative staff understand and improve be managed personnel online habit, reduce internet adverse effect with And the generality excavated between the Internet data for the personnel of being managed and hobby contacts, and is managed personnel by comprehensive analysis Internet records, using document clustering algorithm and K mean cluster algorithm based on LDA model, design realizes one kind and is based on The network behavior of K-means and LDA bi-directional verification is accustomed to clustering method, mentions to be managed the analysis of personnel's internet behavior with management The system model of preferable reference value is supplied.
The theoretical basis of patent to facilitate the understanding of the present invention describes such as the difference of theory with traditional theory of the invention Under:
In traditional clustering method, clustering usually is carried out using a kind of mode to primary data, then by artificial The mode of analysis is verified.The present invention is on the basis of conventional method, creative two kinds of clustering methods of use, by making by oneself The accuracy of the verification method verifying clustering algorithm of justice, and the efficiency for optimizing cluster result is improved using simulated annealing.
The technical scheme is that using webpage attribute, keyword and frequency in personnel's internet records, in conjunction with K- Means algorithm, LDA document subject matter extract model and annealing algorithm, first browse note to personnel-label-frequency set, personnel Record-personnel-keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate knot K-means and LDA is carried out bi-directional verification using annealing algorithm later by fruit, calculates global best theme-tag along sort sequence, The result of optimization network behavior habit cluster on this basis;It wherein, include simulated annealing main flow step A and cost letter Number process flow steps B:
Simulated annealing main flow step A1 to step A26:
Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSONp1, LABELp1, FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein PERSONp1, PERSONp2, …, PERSONpaRepresent personnel's unique identification, LABELp1, LABELp2, …, LABELpaGeneration Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQp1, FREQp2, …, FREQpaThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web Integrate as RECORDIDPERSONKEYWORD={ (RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2, PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein RECORDIDr1, RECORDIDr2, …, RECORDIDraPersonnel record's unique identification is represented, is made of personnel's unique identification and online date, PERSONr1, PERSONr2, …, PERSONraRepresent personnel's unique identification, KEYWORDr1, KEYWORDr2, …, KEYWORDraThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if Simulated annealing cooling parameter is COOL;
Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer;If theme-keyword set of LDA model is LDATOPICWORD, wherein theme is indicated by natural integer;If theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is indicated by natural integer;If global best theme-tag along sort sequence is FACTOR, if global maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep;
Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step of step A1 Personnel-label-frequency set of K-means the clustering algorithm the number of iterations ITERKMEANS and step A1 of rapid A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …, (PERSONa, CATEGORYca), wherein KMEANSPERSONCATEGORY comes from step A2;
Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme that LDA document subject matter extracts model The Di Li Cray parameter ALPHA of distribution, Di Li Cray parameter ETA, the LDA document subject matter of keyword distribution extract model iteration time Number ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2), …, (TOPICtb, KEYWORDb) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPICt1, PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein LDATOPICWORD and LDATOPICPERSON comes from step A2;
Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1 FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence 0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA It is 0, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25; It is no to then follow the steps A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, random number Range between 0 and TOPICNUM-1, wherein TOPICNUM be step A1 in LDA document subject matter extract model theme Sum;
Step A8: being curstep with random number assignment to the simulated annealing current step in step A2, random number Range is between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length;
Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the overall situation in step A2 best main Topic-tag along sort sequence FACTOR, that is, vecb=FACTOR;
Step A10: changing the step current topic-tag along sort sequence vecb in A2, the numerical value on the i-th position ndex, Enable vecbindexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, curstep For the simulated annealing current step in step A2, that is, vecbindex= vecbindex+curstep;
Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is small When 0, that is, vecbindex< 0, then follow the steps A12;It is no to then follow the steps A13;
Step A12: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled In 0, that is, vecbindex=0;Go to step A15;
Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is big When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecbindex> CATEGORYNUM-1 thens follow the steps A14;It is no to then follow the steps A15;
Step A14: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled In CATEGORYNUM-1, that is, vecbindex=CATEGORYNUM-1, wherein CATEGORYNUM is the contingency table in step A1 Label sum;
Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: step B is executed;
Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2;
Step A18: current topic-tag along sort sequence vecb in obtaining step A2;
Step A19: step B is executed;
Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2;
Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is, Eb > EA thens follow the steps A22;It is no to then follow the steps A25;
Step A22: random number random is generated, wherein numberical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TWhen, that is, random < e(eb-EA)/T, In, eb is the current matching number in step A2, and EA is the global maximum matching number in step A2, thens follow the steps A24;Otherwise it holds Row step A25;
Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is working as in step A2 Preceding theme-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching in step A2 Number eb, that is, FACTOR=vecb, EA=eb;
Step A25: reducing the simulated annealing temperature T in step A1, cold using the simulated annealing in step A1 But parameter COOL, that is, T=T × COOL executes step A6;
Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM};Global maximum matching number EA in return step A2;
Cost function process flow steps B is from step B1 to step B15:
Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A;
Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if people Member's unique identification integrates as LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, If coupling number is SUM, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter extracts mould Current class personnel collection is combined into ldacurcategoryperson in type, if current class personnel collect in K-means clustering algorithm It is combined into kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is unionpersonnum;
Step B3: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out in step B2 Personnel unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π2(LDATOPICPERSON) = {PERSONp1, PERSONp2, …, PERSONpd};
Step B4: from personnel-tag along sort collection of the K-means algorithm in step A2 Filter out all tag along sort collection CATEGORY of step B2 in KMEANSPERSONCATEGORY, and to the selection result duplicate removal, That is, CATEGORY=Π2(KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is the personnel in step B2 Sum;
Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out LDAPERSONiCorresponding theme set singlepersontopic, that is,={TOPICt1, TOPICt2, …, TOPICtc }, wherein singlepersontopic comes from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step Corresponding tag along sort is found in theme-tag along sort sequence TMPFACTOR in B1, wherein enables categoryt1 = TMPFACTORTOPICt1、categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein categoryt1 ,categoryt2 , …, categorytcTag along sort is represented, also, different variables may represent together One tag along sort, singlepersontopic come from step B2;The number that each tag along sort occurs is counted, is denoted as categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out tag along sort frequency of occurrence most Big tag along sort category;Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi, category)};
Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is the people in step B2 Member's sum, thens follow the steps B9;Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is the contingency table in step A1 Label sum, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: from personnel-tag along sort collection of the K-means algorithm in step A2 CATEGORY is filtered out in KMEANSPERSONCATEGORYjCorresponding personnel's set kmeanscurcategoryperson, That is, = {PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein kmeanscurcategoryperson comes from Step B2;
Step B11: personnel-tag along sort collection of model is extracted from the LDA document subject matter in step B2 CATEGORY is filtered out in LDAPERSONCATEGORYjCorresponding personnel's set ldacurcategoryperson, that is, ={PERSONlda1, PERSONlda2, … , PERSONldac, wherein ldacurcategoryperson comes from step B2;
Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 collect Close the intersection unionperson of kmeanscurcategoryperson, that is, unionperson= ldacurcategoryperson∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned to step The number unionpersonnum of coincidence in B2, and the coupling number SUM being added in step B2, that is, SUM=SUM+ unionpersonnum;
Step B14: it when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, thens follow the steps B15;Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13;
Step B15: the coupling number SUM in return step B2.
Wherein, cluster is that personnel's internet records are carried out cluster point using K-means clustering algorithm and LDA model Analysis, then two cluster results are mutually authenticated, and the efficiency of optimization cluster result is improved using simulated annealing, so as to improve Cluster accuracy.
Wherein, primary data needed for step A3 to step A4 provides simulated annealing;Step A7 to step A12 be The numerical value on the random site in current solution sequence is changed in simulated annealing;Step B5 to step B8 is by step B1 The theme of LDA model in theme-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and carry out Association, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model;Step Rapid B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category Personnel number, and be superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence Cost;Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15(eb-EA)/T's Size, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb > EA and random <e(eb-EA)/TWhen, then the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA are updated, B is obtained the cost values of eb and EA through the above steps;Final result returns to overall situation maximum matching number EA and the overall situation is best Theme-tag along sort sequence FACTOR.
Wherein, the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and the Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of keyword distribution Thunder parameter ETA is that 0.01, LDA document subject matter extraction model the number of iterations ITERLDA is 2000, the LDA document subject matter of step A1 The theme sum TOPICNUM for extracting model is 20.
Webpage attribute, keyword and frequency of the invention land productivity in personnel's internet records are calculated in conjunction with K-means Method, LDA document subject matter extract model and annealing algorithm, first browse record-personnel-to personnel-label-frequency set, personnel Keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, makes later K-means and LDA is subjected to bi-directional verification with annealing algorithm and calculates global best theme-tag along sort sequence, with this according to excellent Change network behavior habit cluster as a result, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, move back Fiery algorithm can be improved the efficiency of optimization cluster result, and then improve cluster accuracy.
Detailed description of the invention
Attached drawing 1 is simulated annealing main flow.
Attached drawing 2 is cost function process flow.
Specific embodiment
Technical solution of the present invention is described in detail with reference to the accompanying drawing:
Such as attached drawing 1, simulated annealing main flow step A1 to step A26:
Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSONp1, LABELp1, FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein PERSONp1, PERSONp2, …, PERSONpaRepresent personnel's unique identification, LABELp1, LABELp2, …, LABELpaGeneration Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQp1, FREQp2, …, FREQpaThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web Integrate as RECORDIDPERSONKEYWORD={ (RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2, PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein RECORDIDr1, RECORDIDr2, …, RECORDIDraPersonnel record's unique identification is represented, is made of personnel's unique identification and online date, PERSONr1, PERSONr2, …, PERSONraRepresent personnel's unique identification, KEYWORDr1, KEYWORDr2, …, KEYWORDraThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if Simulated annealing cooling parameter is COOL;
Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer;If theme-keyword set of LDA model is LDATOPICWORD, wherein theme is indicated by natural integer;If theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is indicated by natural integer;If global best theme-tag along sort sequence is FACTOR, if global maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep;
Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step of step A1 Personnel-label-frequency set of K-means the clustering algorithm the number of iterations ITERKMEANS and step A1 of rapid A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …, (PERSONa, CATEGORYca), wherein KMEANSPERSONCATEGORY comes from step A2;
Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme that LDA document subject matter extracts model The Di Li Cray parameter ALPHA of distribution, Di Li Cray parameter ETA, the LDA document subject matter of keyword distribution extract model iteration time Number ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2), …, (TOPICtb, KEYWORDb) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPICt1, PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein LDATOPICWORD and LDATOPICPERSON comes from step A2;
Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1 FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence 0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA It is 0, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25; It is no to then follow the steps A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, random number Range between 0 and TOPICNUM-1, wherein TOPICNUM be step A1 in LDA document subject matter extract model theme Sum;
Step A8: being curstep with random number assignment to the simulated annealing current step in step A2, random number Range is between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length;
Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the overall situation in step A2 best main Topic-tag along sort sequence FACTOR, that is, vecb=FACTOR;
Step A10: changing the step current topic-tag along sort sequence vecb in A2, the numerical value on the i-th position ndex, Enable vecbindexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, curstep For the simulated annealing current step in step A2, that is, vecbindex= vecbindex+curstep;
Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is small When 0, that is, vecbindex< 0, then follow the steps A12;It is no to then follow the steps A13;
Step A12: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled In 0, that is, vecbindex=0;Go to step A15;
Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is big When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecbindex> CATEGORYNUM-1 thens follow the steps A14;It is no to then follow the steps A15;
Step A14: numerical value etc. of the current topic-tag along sort sequence vecb in step A2 on the i-th position ndex is enabled In CATEGORYNUM-1, that is, vecbindex=CATEGORYNUM-1, wherein CATEGORYNUM is the contingency table in step A1 Label sum;
Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: step B is executed;
Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2;
Step A18: current topic-tag along sort sequence vecb in obtaining step A2;
Step A19: step B is executed;
Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2;
Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is, Eb > EA thens follow the steps A22;It is no to then follow the steps A25;
Step A22: random number random is generated, wherein numberical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TWhen, that is, random < e(eb-EA)/T, In, eb is the current matching number in step A2, and EA is the global maximum matching number in step A2, thens follow the steps A24;Otherwise it holds Row step A25;
Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is working as in step A2 Preceding theme-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching in step A2 Number eb, that is, FACTOR=vecb, EA=eb;
Step A25: reducing the simulated annealing temperature T in step A1, cold using the simulated annealing in step A1 But parameter COOL, that is, T=T × COOL executes step A6;
Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM};Global maximum matching number EA in return step A2;
Such as attached drawing 2, cost function process flow steps B is from step B1 to step B15:
Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A;
Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if people Member's unique identification integrates as LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, If coupling number is SUM, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter extracts mould Current class personnel collection is combined into ldacurcategoryperson in type, if current class personnel collect in K-means clustering algorithm It is combined into kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is unionpersonnum;
Step B3: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out in step B2 Personnel unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π2(LDATOPICPERSON) = {PERSONp1, PERSONp2, …, PERSONpd};
Step B4: from personnel-tag along sort collection of the K-means algorithm in step A2 Filter out all tag along sort collection CATEGORY of step B2 in KMEANSPERSONCATEGORY, and to the selection result duplicate removal, That is, CATEGORY=Π2(KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is the personnel in step B2 Sum;
Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out LDAPERSONiCorresponding theme set singlepersontopic, that is,={TOPICt1, TOPICt2, …, TOPICtc }, wherein singlepersontopic comes from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step Corresponding tag along sort is found in theme-tag along sort sequence TMPFACTOR in B1, wherein enables categoryt1 = TMPFACTORTOPICt1、categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein categoryt1 ,categoryt2 , …, categorytcTag along sort is represented, also, different variables may represent together One tag along sort, singlepersontopic come from step B2;The number that each tag along sort occurs is counted, is denoted as categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out tag along sort frequency of occurrence most Big tag along sort category;Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi, category)};
Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is the people in step B2 Member's sum, thens follow the steps B9;Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is the contingency table in step A1 Label sum, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: from personnel-tag along sort collection of the K-means algorithm in step A2 CATEGORY is filtered out in KMEANSPERSONCATEGORYjCorresponding personnel's set kmeanscurcategoryperson, That is, = {PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein kmeanscurcategoryperson comes from Step B2;
Step B11: personnel-tag along sort collection of model is extracted from the LDA document subject matter in step B2 CATEGORY is filtered out in LDAPERSONCATEGORYjCorresponding personnel's set ldacurcategoryperson, that is,={PERSONlda1, PERSONlda2, … , PERSONldac, wherein ldacurcategoryperson comes from step B2;
Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 collect Close the intersection unionperson of kmeanscurcategoryperson, that is, unionperson= ldacurcategoryperson∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned to step The number unionpersonnum of coincidence in B2, and the coupling number SUM being added in step B2, that is, SUM=SUM+ unionpersonnum;
Step B14: it when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, thens follow the steps B15;Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13;
Step B15: the coupling number SUM in return step B2.
Wherein, cluster is that personnel's internet records are carried out cluster point using K-means clustering algorithm and LDA model Analysis, then two cluster results are mutually authenticated, and the efficiency of optimization cluster result is improved using simulated annealing, so as to improve Cluster accuracy.
Primary data needed for step A3 to step A4 provides simulated annealing;Step A7 is to simulate to step A12 The numerical value on the random site in current solution sequence is changed in annealing algorithm;Step B5 is to the master that step B8 is by step B1 The theme of LDA model in topic-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and close Connection, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model;Step B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category The number of personnel, and it is superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence Cost;Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15(eb-EA)/TIt is big It is small, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb>EA and random<e(eb-EA)/TWhen, then update the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA, eb B is obtained through the above steps with the cost values of EA;Final result returns to overall situation maximum matching number EA and global best master Topic-tag along sort sequence FACTOR.
Wherein, the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and the Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of keyword distribution Thunder parameter ETA is that 0.01, LDA document subject matter extraction model the number of iterations ITERLDA is 2000, the LDA document subject matter of step A1 The theme sum TOPICNUM for extracting model is 20.
The present invention can be in conjunction with computer system, to be automatically performed personnel's network behavior habit cluster.
Webpage attribute of the invention land productivity in personnel's internet records, keyword, frequency are calculated in conjunction with K-means Method, LDA document subject matter extract model, annealing algorithm, first browse record-personnel-pass to personnel-label-frequency set, personnel Keyword collection carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, uses later K-means and LDA is carried out bi-directional verification and calculates global best theme-tag along sort sequence by annealing algorithm, with this according to optimization Network behavior habit cluster as a result, K-means and LDA bi-directional verification is improved to personnel-tag along sort susceptibility, annealing Algorithm can be improved the efficiency of optimization cluster result, and then improve cluster accuracy.

Claims (4)

1. one kind is accustomed to based on the network behavior of K-means and Latent Dirichlet Allocation (LDA) bi-directional verification Clustering method, which is characterized in that using webpage attribute, keyword and the frequency in personnel's internet records, calculated in conjunction with K-means Method, LDA document subject matter extract model and annealing algorithm, first browse record-personnel-to personnel-label-frequency set, personnel Keyword set carries out K-means algorithm cluster and LDA document subject matter is extracted model and generated, and storage calculates intermediate result, makes later K-means and LDA is subjected to bi-directional verification with annealing algorithm, calculates global best theme-tag along sort sequence, on this basis Optimize the result of network behavior habit cluster;It wherein, include simulated annealing main flow step A and cost function process flow Step B:
Simulated annealing main flow step A1 to step A26:
Step A1: personnel-label-frequency set is set as PERSONLABELFREQ={ (PERSONp1, LABELp1, FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein PERSONp1, PERSONp2, …, PERSONpaRepresent personnel's unique identification, LABELp1, LABELp2, …, LABELpaGeneration Table personnel surf the web the integrity attribute of content, and personnel's unique identification can correspond to multiple attributes, FREQp1, FREQp2, …, FREQpaThe personnel of representative surf the web content integrity attribute weight, record-personnel-keyword if personnel surf the web Integrate as RECORDIDPERSONKEYWORD={ (RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2, PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein RECORDIDr1, RECORDIDr2, …, RECORDIDraPersonnel record's unique identification is represented, is made of personnel's unique identification and online date, PERSONr1, PERSONr2, …, PERSONraRepresent personnel's unique identification, KEYWORDr1, KEYWORDr2, …, KEYWORDraThe personnel of representative surf the web the keyword that content includes, if the Di Li Cray parameter of theme distribution is ALPHA, if closing The Di Li Cray parameter of keyword distribution is ETA, if it is ITERLDA that LDA document subject matter, which extracts model the number of iterations, if K-means Clustering algorithm the number of iterations is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if LDA document subject matter extracts mould The theme sum of type is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing, which changes step-length, if Simulated annealing cooling parameter is COOL;
Step A2: the result set of K-means clustering algorithm is set as personnel-tag along sort collection, that is, be expressed as KMEANSPERSONCATEGORY, wherein tag along sort is indicated by natural integer;If theme-keyword set of LDA model is LDATOPICWORD, wherein theme is indicated by natural integer;If theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is indicated by natural integer;If global best theme-tag along sort sequence is FACTOR, if global maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep;
Step A3: calling K-means clustering algorithm tool, is passed to tag along sort sum CATEGORYNUM, the step A1 of step A1 K-means clustering algorithm the number of iterations ITERKMEANS and step A1 personnel-label-frequency set PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, that is, is obtained KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …, (PERSONa, CATEGORYca), wherein KMEANSPERSONCATEGORY comes from step A2;
Step A4: calling LDA modeling tool, is passed to theme sum TOPICNUM, theme distribution that LDA document subject matter extracts model Di Li Cray parameter ALPHA, keyword distribution Di Li Cray parameter ETA, LDA document subject matter extract model the number of iterations ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, that is, obtain LDATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2), …, (TOPICtb, KEYWORDb) and LDA model theme-personnel collection, that is, LDATOPICPERSON={ (TOPICt1, PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein LDATOPICWORD and LDATOPICPERSON comes from step A2;
Step A5: with 0 to the global best theme-tag along sort sequence of random number initialization between CATEGORYNUM-1 FACTOR, sequence length are the theme sum TOPICNUM that LDA document subject matter extracts model, the range of each element in sequence 0 between CATEGORYNUM-1, wherein CATEGORYNUM is tag along sort sum, initializes global maximum matching number EA It is 0, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when the simulated annealing temperature T in step A1 is greater than 0.1, A7 is thened follow the steps to step A25;Otherwise Execute step A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number It is trapped among between 0 and TOPICNUM-1, wherein TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1;
Step A8: being curstep with random number assignment, the range of random number to the simulated annealing current step in step A2 Between -1 × STEP and STEP, wherein STEP is that the simulated annealing in step A1 changes step-length;
Step A9: current topic-tag along sort sequence vecb in step A2 is enabled to be equal to the best theme-point of the overall situation in step A2 Class sequence label FACTOR, that is, vecb=FACTOR;
Step A10: changing the step current topic-tag along sort sequence vecb in A2, and the numerical value on the i-th position ndex enables vecbindexIn addition curstep, wherein index is that theme-tag along sort sequence of step A2 weight works as presubscript, and curstep is Simulated annealing current step in step A2, that is, vecbindex= vecbindex+curstep;
Step A11: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is less than 0 When, that is, vecbindex< 0, then follow the steps A12;It is no to then follow the steps A13;
Step A12: enabling numerical value of the current topic-tag along sort sequence vecb on the i-th position ndex in step A2 be equal to 0, That is, vecbindex=0;Go to step A15;
Step A13: when numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is greater than When CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort sum in step A1, that is, vecbindex> CATEGORYNUM-1 thens follow the steps A14;It is no to then follow the steps A15;
Step A14: numerical value of the current topic in the step A2-tag along sort sequence vecb on the i-th position ndex is enabled to be equal to CATEGORYNUM-1, that is, vecbindex=CATEGORYNUM-1, wherein CATEGORYNUM is the tag along sort in step A1 Sum;
Step A15: the best theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: step B is executed;
Step A17: the result of obtaining step B is assigned to the global maximum matching number EA in step A2;
Step A18: current topic-tag along sort sequence vecb in obtaining step A2;
Step A19: step B is executed;
Step A20: the result of obtaining step B is assigned to the current matching number eb in step A2;
Step A21: when the current matching number eb in step A2 is greater than the global maximum matching number EA in step A2, that is, eb > EA thens follow the steps A22;It is no to then follow the steps A25;
Step A22: random number random is generated, wherein numberical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TWhen, that is, random < e(eb-EA)/T, wherein eb For the current matching number in step A2, EA is the global maximum matching number in step A2, thens follow the steps A24;Otherwise step is executed Rapid A25;
Step A24: the value for enabling the best theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2 Topic-tag along sort sequence vecb enables the value of the global maximum matching number EA in step A2 for the current matching number eb in step A2, That is, FACTOR=vecb, EA=eb;
Step A25: reducing the simulated annealing temperature T in step A1, uses the cooling ginseng of simulated annealing in step A1 Number COOL, that is, T=T × COOL executes step A6;
Step A26: the best theme-tag along sort sequence of the overall situation in return step A2, that is, FACTOR={ FACTOR1, FACTOR2, …, FACTORTOPICNUM};Global maximum matching number EA in return step A2;
Cost function process flow steps B is from step B1 to step B15:
Step B1: the theme-tag along sort sequence TMPFACTOR being passed in obtaining step A;
Step B2: the personnel-tag along sort for setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only One identification sets are LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if It is SUM with number, if the single corresponding theme collection of personnel is combined into singlepersontopic, if LDA document subject matter is extracted in model Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined into K-means clustering algorithm Kmeanscurcategoryperson, if the number collection being overlapped is combined into unionperson, if the number being overlapped is unionpersonnum;
Step B3: collect the people filtered out in step B2 in LDATOPICPERSON from theme-personnel of the LDA model in step A2 Member unique identification collection LDAPERSON, and to the selection result duplicate removal, that is, LDAPERSON=Π2(LDATOPICPERSON) = {PERSONp1, PERSONp2, …, PERSONpd};
Step B4: from personnel-tag along sort collection KMEANSPERSONCATEGORY of the K-means algorithm in step A2 All tag along sort collection CATEGORY of step B2 are filtered out, and to the selection result duplicate removal, that is, CATEGORY=Π2 (KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein LDAPERSONNUM is that the personnel in step B2 are total Number;
Step B6: collect in LDATOPICPERSON from theme-personnel of the LDA model in step A2 and filter out LDAPERSONiIt is right The theme set singlepersontopic answered, that is,= { TOPICt1, TOPICt2 ..., TOPICtc }, wherein singlepersontopic comes from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step B1 Theme-tag along sort sequence TMPFACTOR in find corresponding tag along sort, wherein enable categoryt1 = TMPFACTORTOPICt1、categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein categoryt1 ,categoryt2 , …, categorytcTag along sort is represented, also, different variables may represent together One tag along sort, singlepersontopic come from step B2;The number that each tag along sort occurs is counted, is denoted as categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out tag along sort frequency of occurrence most Big tag along sort category;Update personnel-tag along sort collection that the LDA document subject matter in step B2 extracts model LDAPERSONCATEGORY, that is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi, category)};
Step B8: when cyclic variable i is greater than LDAPERSONNUM, wherein LDAPERSONNUM is that the personnel in step B2 are total Number, thens follow the steps B9;Otherwise, i value adds 1, that is, i=i+1 executes step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein CATEGORYNUM is that the tag along sort in step A1 is total Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: from personnel-tag along sort collection KMEANSPERSONCATEGORY of the K-means algorithm in step A2 In filter out CATEGORYjCorresponding personnel's set kmeanscurcategoryperson, that is, = {PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein kmeanscurcategoryperson comes from step B2;
Step B11: personnel-tag along sort collection LDAPERSONCATEGORY of model is extracted from the LDA document subject matter in step B2 In filter out CATEGORYjCorresponding personnel's set ldacurcategoryperson, that is,={PERSONlda1, PERSONlda2, … , PERSONldac, wherein ldacurcategoryperson comes from step B2;
Step B12: the personnel calculated in personnel's set ldacurcategoryperson and step B2 in step B2 gather The intersection unionperson of kmeanscurcategoryperson, that is, unionperson=ldacurcategoryperson ∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2 is assigned in step B2 Coincidence number unionpersonnum, and the coupling number SUM being added in step B2, that is, SUM=SUM+ unionpersonnum;
Step B14: when cyclic variable j is greater than the tag along sort sum CATEGORYNUM in step A1, B15 is thened follow the steps; Otherwise, the value of j adds 1, that is, j=j+1 executes step B10 to step B13;
Step B15: the coupling number SUM in return step B2.
2. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method, It is characterized in that, cluster is that personnel's internet records are carried out clustering using K-means clustering algorithm and LDA model, Two cluster results are mutually authenticated again, and improve the efficiency of optimization cluster result using simulated annealing, so as to improve poly- Class accuracy.
3. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method, It is characterized in that, primary data needed for step A3 to step A4 offer simulated annealing;Step A7 to step A12 is in mould The numerical value on the random site in current solution sequence is changed in quasi- annealing algorithm;Step B5 is to the master that step B8 is by step B1 The theme of LDA model in topic-tag along sort sequence TMPFACTOR and step A2-personnel collect LDATOPICPERSON and close Connection, the LDA document subject matter being associated with out in step B2 extract personnel-tag along sort collection LDAPERSONCATEGORY of model;Step B9 to step B14 while is appeared in K-means cluster result and in LDA model result by comparing in the same category The number of personnel, and it is superimposed the number of such personnel in each classification, and finally returning in process, and as current sequence Cost;Step A14 to step A18 is the size for judging eb and EA and the random number random and e of step A15(eb-EA)/TIt is big It is small, wherein eb is the current matching number of step A2, and EA is the global maximum matching number of step A2, works as eb>EA and random<e(eb-EA)/TWhen, then update the value of global best theme-tag along sort sequence FACTOR and the value of global maximum matching number EA, eb B is obtained through the above steps with the cost values of EA;Final result returns to overall situation maximum matching number EA and global best master Topic-tag along sort sequence FACTOR.
4. a kind of network behavior based on K-means and LDA bi-directional verification according to claim 1 is accustomed to clustering method, If the K-means clustering algorithm the number of iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is The Di Li Cray parameter ALPHA of theme distribution in 3, step A4 is 0.1, and the Di Li Cray parameter ETA of keyword distribution is It is 2000 that 0.01, LDA document subject matter, which extracts model the number of iterations ITERLDA, and the LDA document subject matter of step A1 extracts the master of model Inscribing sum TOPICNUM is 20.
CN201610565749.XA 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification Active CN106202480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610565749.XA CN106202480B (en) 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610565749.XA CN106202480B (en) 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Publications (2)

Publication Number Publication Date
CN106202480A CN106202480A (en) 2016-12-07
CN106202480B true CN106202480B (en) 2019-06-11

Family

ID=57493136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610565749.XA Active CN106202480B (en) 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Country Status (1)

Country Link
CN (1) CN106202480B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984551A (en) * 2017-05-31 2018-12-11 广州智慧城市发展研究院 A kind of recommended method and system based on the multi-class soft cluster of joint
CN107305614B (en) * 2017-08-12 2020-05-26 西安电子科技大学 Method for processing big data based on MLDM algorithm meeting secondary aggregation
CN108460630B (en) * 2018-02-12 2021-11-02 广州虎牙信息科技有限公司 Method and device for carrying out classification analysis based on user data
CN110276503B (en) * 2018-03-14 2023-04-21 吉旗物联科技(上海)有限公司 Method for automatically identifying cold chain vehicle task
CN108897815B (en) * 2018-06-20 2021-07-16 淮阴工学院 Multi-label text classification method based on similarity model and FastText
CN112800419A (en) * 2019-11-13 2021-05-14 北京数安鑫云信息技术有限公司 Method, apparatus, medium and device for identifying IP group
CN112214515A (en) * 2020-10-16 2021-01-12 平安国际智慧城市科技股份有限公司 Data automatic matching method and device, electronic equipment and storage medium
CN112883154B (en) * 2021-01-28 2022-02-01 平安科技(深圳)有限公司 Text topic mining method and device, computer equipment and storage medium
CN113204641B (en) * 2021-04-12 2022-09-02 武汉大学 Annealing attention rumor identification method and device based on user characteristics
CN113312450B (en) * 2021-05-28 2022-05-31 北京航空航天大学 Method for preventing text stream sequence conversion attack
CN114742869B (en) * 2022-06-15 2022-08-16 西安交通大学医学院第一附属医院 Brain neurosurgery registration method based on pattern recognition and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194012A (en) * 2011-06-17 2011-09-21 清华大学 Microblog topic detecting method and system
CN102609719A (en) * 2012-01-19 2012-07-25 北京工业大学 Method for identifying place image on the basis of improved probabilistic topic model
CN103632166A (en) * 2013-12-04 2014-03-12 西安电子科技大学 Aurora image classification method based on latent theme combining with saliency information
CN103678500A (en) * 2013-11-18 2014-03-26 南京邮电大学 Data mining improved type K mean value clustering method based on linear discriminant analysis
CN103793501A (en) * 2014-01-20 2014-05-14 惠州学院 Theme community discovery method based on social network
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104750856A (en) * 2015-04-16 2015-07-01 天天艾米(北京)网络科技有限公司 System and method for multi-dimensional synergic recommendation
CN105303199A (en) * 2015-12-08 2016-02-03 南京信息工程大学 Data fragment type identification method based on content characteristics and K-means
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090290508A1 (en) * 2008-05-22 2009-11-26 At&T Labs, Inc. Method for optimizing network "Point of Presence" locations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194012A (en) * 2011-06-17 2011-09-21 清华大学 Microblog topic detecting method and system
CN102609719A (en) * 2012-01-19 2012-07-25 北京工业大学 Method for identifying place image on the basis of improved probabilistic topic model
CN103678500A (en) * 2013-11-18 2014-03-26 南京邮电大学 Data mining improved type K mean value clustering method based on linear discriminant analysis
CN103632166A (en) * 2013-12-04 2014-03-12 西安电子科技大学 Aurora image classification method based on latent theme combining with saliency information
CN103793501A (en) * 2014-01-20 2014-05-14 惠州学院 Theme community discovery method based on social network
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104750856A (en) * 2015-04-16 2015-07-01 天天艾米(北京)网络科技有限公司 System and method for multi-dimensional synergic recommendation
CN105303199A (en) * 2015-12-08 2016-02-03 南京信息工程大学 Data fragment type identification method based on content characteristics and K-means
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Tropical wood species recognition system based on multi-feature extractors and classifiers";Marzuki Khalid et al.;《2011 2nd International Conference on Instrumentation Control and Automation》;20120119;全文
"基于隐含狄利克雷分配模型的图像分类算法";杨赛 等;《计算机工程》;20120731;第38卷(第14期);全文
"基于隐含狄利克雷分配的微博推荐模型研究";唐晓波 等;《情报科学》;20150228;第33卷(第2期);全文

Also Published As

Publication number Publication date
CN106202480A (en) 2016-12-07

Similar Documents

Publication Publication Date Title
CN106202480B (en) A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
Zhang et al. A multi-objective evolutionary approach for mining frequent and high utility itemsets
CN105022754A (en) Social network based object classification method and apparatus
CN110750645A (en) Cross-domain false comment identification method based on countermeasure training
Sharma et al. Trend analysis in machine learning research using text mining
Guo et al. Multi-label classification methods for green computing and application for mobile medical recommendations
Gan et al. R-RNN: Extracting user recent behavior sequence for click-through rate prediction
Yu et al. Data cleaning for personal credit scoring by utilizing social media data: An empirical study
Liu et al. A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge
Naghavipour et al. Hybrid metaheuristics for QoS-aware service composition: a systematic mapping study
CN106919997A (en) A kind of customer consumption Forecasting Methodology of the ecommerce based on LDA
Yu et al. Spectrum-enhanced pairwise learning to rank
Zhang et al. Multi-dimension topic mining based on hierarchical semantic graph model
Sharma et al. A study of tree based machine learning techniques for restaurant reviews
Niu et al. Deep adversarial autoencoder recommendation algorithm based on group influence
Xiao A Survey of Document Clustering Techniques & Comparison of LDA and moVMF
Dehghan et al. An improvement in the quality of expert finding in community question answering networks
CN106649380A (en) Hot spot recommendation method and system based on tag
Niham et al. Utilization of Big Data in Libraries by Using Data Mining
Zhang A short introduction to data mining and its applications
Ahn et al. Using genetic algorithms to optimize nearest neighbors for data mining
Xin et al. When factorization meets heterogeneous latent topics: an interpretable cross-site recommendation framework
Singh Sentiment analysis of online mobile reviews
Osial et al. Smartphone recommendation system using web data integration techniques

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant