CN106202480A

CN106202480A - A kind of network behavior based on K means and LDA bi-directional verification custom clustering method

Info

Publication number: CN106202480A
Application number: CN201610565749.XA
Authority: CN
Inventors: 朱全银; 辛诚; 李翔; 许康; 潘舒新; 孙青怡; 周泓; 严云洋; 胡荣林; 冯万利; 王留洋; 王海云; 袁媛; 唐海波
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2016-12-07
Anticipated expiration: 2036-07-19
Also published as: CN106202480B

Abstract

nullThe invention discloses the network behavior custom clustering method of a kind of bi-directional verification based on K means and LDA，The present invention utilizes the webpage attribute in personnel's internet records、Key word and frequency，In conjunction with K means algorithm、LDA document subject matter extraction model and annealing algorithm，First to personnel's label frequency collection、Personnel browse record keeping personnel's keyword set and carry out K means algorithm cluster and the generation of LDA document subject matter extraction model，Storage calculates intermediate object program，Use annealing algorithm that K means and LDA is carried out bi-directional verification afterwards，Calculate the optimal subject classification sequence label of the overall situation，Optimize the result of network behavior custom cluster on this basis，K means and LDA bi-directional verification improve the sensitivity to personnel's tag along sort，Annealing algorithm can improve the efficiency optimizing cluster result，And then raising cluster accuracy.

Description

A kind of network behavior based on K-means and LDA bi-directional verification custom clustering method

Technical field

The invention belongs to cluster analysis, optimized algorithm field, based on K-means and LDA two-way test particularly to a kind of The network behavior custom clustering method of card, is used for optimizing cluster result, and then improves cluster accuracy, and with on this increase personnel The use value of net record information.

Background technology

The clustering method of grasp network behavior custom data is surfed the Net for research worker and is accustomed to having important effect and meaning Justice, along with constantly popularizing of the Internet, increasing people selects to obtain information interested by network.Personnel surf the Net clear The quantity of information of the content look at is huge, only relies on and manually analyzes these data not only inefficiency, and accuracy is the highest. By cluster analysis, add and another kind of clustering method bi-directional verification, the efficiency of analysis and the accurate of analysis can be improved Rate.General clustering algorithm has K-means cluster and LDA document subject matter extraction model etc., and general optimized algorithm has simulation to move back Fire algorithm and genetic algorithm etc..

The correlative theses of clustering algorithm and optimized algorithm has: Pang Feng. and principle and the algorithm of simulated annealing are asked in optimization Application in topic. Jilin University's Master's thesis, 2006；Li Xiangping, Zhang Hongyang. simulated annealing principle and improvement. soft Part guide, 2008 (4): 47-48；Yang Mengduo, Li Fanchang, Zhang Li. 10 years progress of Lie-group machine learning. computer Journal, 2015 (7): 1337-1356；Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Proceeding of Journal of Machine Learning Research. 2003, Vol.3: 993-1022；Yuan J, Gao F, Ho Q, et al. LightLDA: Big Topic Models on Modest Computer Clusters. Proceeding of International Conference on World Wide Web. ACM, 2015；The existing Research foundation of Zhu Quanyin et al. includes: Li Xiang, Zhu Quanyin. combine cluster and rating matrix share Collaborative filtering recommending. computer science and exploration. 2014. Vol.8 (6): 751-759；Suqun Cao, Quanyin Zhu, Zhiwei Hou. Customer Segmentation Based on a Novel Hierarchical Clustering Algorithm. 2009, p:1-5；Quanyin Zhu,Sunqun Cao. A Novel Classifier- independent Feature Selection Algorithm for Imbalanced Datasets. 2009, p:77- 82；Suqun Cao,Zhiwei Hou, Liuyang Wang, Quanyin Zhu. Kernelized Fuzzy Fisher Criterion based Clustering Algorithm. DCABES 2010, p:87-91；Quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian. The Case Study for Price Extracting of Mobile Phone Sell Online. 2011, p:282-285；Quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm. International Review on Computers and Software, 2011, Vol.6(6):1089-1093；Suqun Cao, Gelan Yang, Quanyin Zhu, Haihei Zhai. A novel feature extraction method for mechanical part recognition. Applied Mechanics and Materials, 2011, p:116-121；Pei Zhou, Quanyin Zhu. Multi-factor Matching Method for Basic Information of Science and Technology Experts Based on Web Mining. 2012, P:718-720；Jianping Deng, Fengwen Cao, Quanyin Zhu, Yu Zhang. The Web Data Extracting and Application for Shop Online Based on Commodities Classified. Communications in Computer and Information Science, Vol.234(4):120-128；Hui Zong, Quanyin Zhu, Ming Sun, Yahong Zhang. The case study for human resource management research based on web mining and semantic analysis. Applied Mechanics and Materials, Vol.488, 2014 p:1336-1339；Zhu Quanyin et al. application, the open Patents with mandate: Zhu Quanyin, Hu Rongjing, Cao Suqun, Zhou Pei etc. a kind of price forecasting of commodity method based on linear interpolation Yu Adaptive windowing mouth. Chinese patent: ZL 2011 1 0423015.5, 2015.07.01；Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly. and one is repaiied based on two divided data Mend the price forecasting of commodity method with Discontinuous Factors. Chinese patent: ZL 2,011 1 0422274.6,2013.01.02；Zhu Quan Silver, Yin Yonghua, Yan Yunyang, Chen Ting, Cao Suqun. the data of a kind of multi items price forecasting of commodity based on neutral net Preprocess method. Chinese patent: ZL 2,012 1 0325368.6,2016.06.08；Zhu Quanyin, Pan Lu, Liu Wenru, Lee Xiang, Zhou Hong, Hu Ronglin, Ding Jin, Jin Ying, Shaowu is outstanding, Tanghai ripple. the incremental learning multi-level two of a kind of science and technology news Sorting technique. China Patent Publication No.: CN 105205163A, 2015.12.30；Zhu Quanyin, Yan Yunyang, Huang Taoyi, Bright, Zhang Yuyang, Xin Cheng. the implementation method that the service of a kind of campus personality palm and user behavior custom are analyzed. China is specially Profit publication number: CN 104731971A, 2015.06.24；Zhu Quanyin, Shen Enqiang, Qian Yaping, Zhou Hong etc. a kind of based on K- Means clusters many weights adaptive Students ' Learning behavior analysis method. Chinese Patent Application No.: 201610222553.0, 2016.04.13；Zhu Quanyin, Shaowu is outstanding, Tanghai ripple, Zhou Hong, Li Xiang, Hu Ronglin, Jin Ying, Cao Suqun, Pan Shuxin. a kind of science Multi-level many sorting techniques of headline. China Patent Publication No.: CN 105205163A, 2016.07.13；Li Xiang, Zhu Quan Yin, Hu Ronglin, Zhou Hong. a kind of Cold Chain Logistics prestowage intelligent recommendation method based on spectral clustering. China Patent Publication No.: CN 105654267A, 2016.06.08。

LDA document subject matter extraction model:

LDA(Latent Dirichlet Allocation) it is that a kind of document subject matter generates model, also referred to as one three layers of pattra leaves This probabilistic model, comprises word, theme and document three-decker.So-called generate model, that is, it is believed that an article every Individual word be all by " with certain probability selection certain theme, and from this theme with certain word of certain probability selection " this One process of sample obtains.Document obeys multinomial distribution to theme, and multinomial distribution obeyed in theme to word.LDA is a kind of non-prison Superintend and direct machine learning techniques, can be used to identify extensive document sets (document collection) or corpus (corpus) In hiding subject information.It has employed the method for word bag (bag of words), and each document is considered as one by this method Individual word frequency vector, thus text message is converted the digital information for ease of modeling.But word bag method do not account for word with Order between word, this simplify the complex nature of the problem, and the most also the improvement for model provides opportunity.Each documents representative Probability distribution that some themes are constituted, and each theme represents the probability that a lot of word constituted and divides Cloth.

K-means clustering algorithm:

Come from a kind of vector quantization method in signal processing, be popular in number as a kind of clustering method the most more According to excavation applications.kThe purpose of-average cluster is: n point (can be the once observation of sample or an example) is divided intok In individual cluster so that each point broadly falls into the cluster that his nearest average (this is cluster centre) is corresponding, using as clustering Standard.This problem will be attributed to a problem that data space is divided into Voronoi cells.This problem is calculating On be difficult (NP difficulty), but there is efficient heuritic approach.Generally, the inspiration that all service efficiency is higher Formula algorithm, they can quickly converge on a locally optimal solution.These algorithms are generally similar at by iterative optimization method The EM algorithm (EM algorithm) of reason Gaussian Mixture distribution.And, they all use cluster centre to come for data modeling；Butk-average cluster tends to find cluster in comparable spatial dimension, it is desirable to-maximization technology but allows cluster to have difference Shape.

Simulated annealing:

Simulated annealing (Simulated Annealing, SA) thought the earliest is in 1953 by N. Metropolis et al. Year proposes.1983, annealing thought was successfully incorporated into Combinatorial Optimization field by S. Kirkpatrick etc..It is based on A kind of random optimizing algorithm of Monte-Carlo iterative strategy, its starting point is based on the annealing of solid matter in physics Similarity between process and general combinatorial optimization problem.Simulated annealing, from a certain higher initial temperature, is joined with temperature The continuous decline of number, join probability kick characteristic is the random globally optimal solution finding object function in solution space, i.e. in local Optimal solution can be jumped out probabilityly and finally tend to global optimum.Simulated annealing is a kind of general optimized algorithm, theoretical Upper algorithm has the global optimization performance of probability, is the most widely applied in engineering, such as VLSI, production scheduling, Control the fields such as engineering, machine learning, neutral net, signal processing.Simulated annealing is a kind of by giving search procedure Time-varying and the probabilistic jumping property finally gone to zero, thus can be prevented effectively from and be absorbed in local minimum and finally tend to the string of global optimum The optimized algorithm of row structure.

Heuristic search:

The big basic target of the two of computer science, it is simply that find that its operational efficiency provable is good and can obtain optimum solution or suboptimal solution Algorithm.Heuritic approach then attempts once to provide one or all target.Such as it often can find all well and good solution, but also Can not prove that it will not obtain worse solution；It generally can solve answer in the reasonable time, but also whether can not know it Can solve with such speed every time.At some in particular cases, heuritic approach can obtain the worst answer or efficiency pole Difference, but cause the data structure of those special circumstances, occur at real world the most never.Therefore real world opens Hairdo algorithm is in daily use and solves problem.Heuritic approach generally can obtain within reasonable time when processing many practical problems Good answer.Common heuritic approach has ant group algorithm, genetic algorithm, simulated annealing etc..

Summary of the invention

In order to help management personnel to understand and improve the personnel of being managed online custom, reduce the Internet harmful effect with And the generality excavated between the Internet data of the personnel of being managed and hobby contacts, it is managed personnel by comprehensive analysis Internet records, use document clustering algorithm based on LDA model and K means clustering algorithm, design achieve a kind of based on The network behavior custom clustering method of K-means and LDA bi-directional verification, carries with management for being managed the analysis of personnel's internet behavior Supply the system model of preferable reference value.

For the ease of understanding the theoretical basis of patent of the present invention, the theory of the present invention is described such as with the difference of traditional theory Under:

In traditional clustering method, it is usually and primary data uses a kind of mode carry out cluster analysis, then pass through manual analysis Mode verify.The present invention is on the basis of traditional method, and creative two kinds of clustering methods of employing, by self-defining The accuracy of verification method checking clustering algorithm, and use simulated annealing to improve the efficiency optimizing cluster result.

The technical scheme is that webpage attribute, key word and the frequency utilized in personnel's internet records, in conjunction with K- Personnel-label-frequency set, personnel are first browsed note by means algorithm, LDA document subject matter extraction model and annealing algorithm Record-personnel-keyword set carries out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates middle junction Really, use annealing algorithm that K-means and LDA carries out bi-directional verification afterwards, calculate the optimal theme-tag along sort sequence of the overall situation, Optimize the result of network behavior custom cluster on this basis；Wherein, simulated annealing main flow step A and cost letter are comprised Number process flow steps B:

Simulated annealing main flow step A1 is to step A26:

Step A1: set personnel-label-frequency set as PERSONLABELFREQ={ (PERSON_p1, LABEL_p1, FREQ_p1), (PERSON_p2, LABEL_p2, FREQ_p2), …, (PERSON_pa, LABEL_pa, FREQ_pa), wherein, PERSON_p1, PERSON_p2, …, PERSON_paThe personnel that represent uniquely identify, LABEL_p1, LABEL_p2, …, LABEL_paGeneration Table personnel surf the web the integrity attribute of content, personnel uniquely identify can corresponding multiple attributes, FREQ_p1,FREQ_p2, …, FREQ_paThe personnel of representative surf the web the weight of integrity attribute of content, if personnel surf the web, record-personnel-keyword set is RECORDIDPERSONKEYWORD={(RECORDID_r1, PERSON_r1, KEYWORD_r1), (RECORDID_r2, PERSON_r2, KEYWORD_r2), …, (RECORDID_ra, PERSON_ra, KEYWORD_ra), wherein, RECORDID_r1, RECORDID_r2, …, RECORDID_raRepresent personnel record uniquely to identify, personnel uniquely identify and the date of surfing the Net forms, PERSON_r1, PERSON_r2, …, PERSON_raThe personnel that represent uniquely identify, KEYWORD_r1,KEYWORD_r2, …, KEYWORD_raRepresentative Member surfs the web the key word that content comprises, if the Di Li Cray parameter of theme distribution is ALPHA, if the Di Li of key word distribution Cray parameter is ETA, if LDA document subject matter extraction model iterations is ITERLDA, if K-means clustering algorithm iteration time Number is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if the theme sum of LDA document subject matter extraction model is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing changes step-length, if simulated annealing is cold But parameter is COOL；

Step A2: set the result set of K-means clustering algorithm as personnel-tag along sort collection, i.e. be expressed as KMEANSPERSONCATEGORY, wherein, tag along sort is represented by natural integer；If the theme-keyword set of LDA model is LDATOPICWORD, wherein, theme is represented by natural integer；If the theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is represented by natural integer；If the optimal theme-tag along sort sequence of the overall situation is FACTOR, if overall situation maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep；

Step A3: call K-means clustering algorithm instrument, the tag along sort sum CATEGORYNUM in incoming step A1, step K-means clustering algorithm iterations ITERKMEANS in A1 and the personnel-label-frequency set in step A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, i.e. obtain KMEANSPERSONCATEGORY={(PERSON₁, CATEGORY_c1), (PERSON₂, CATEGORY_c2), …, (PERSON_a, CATEGORY_ca), wherein, KMEANSPERSONCATEGORY is from step A2；

Step A4: call LDA modeling tool, the theme sum TOPICNUM of incoming LDA document subject matter extraction model, theme distribution Di Li Cray parameter ALPHA, key word distribution Di Li Cray parameter ETA, LDA document subject matter extraction model iterations ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, i.e. obtain DATOPICWORD={ (TOPIC_t1, KEYWORD₁), (TOPIC_t2, KEYWORD₂), …, (TOPIC_tb, KEYWORD_b) and theme-personnel's collection of LDA model, i.e. LDATOPICPERSON={ (TOPIC_t1, PERSON_p1), (TOPIC_t2, PERSON_p2), …, (TOPIC_tc, PERSON_pc), wherein, LDATOPICWORD and LDATOPICPERSON is from step A2；

Step A5: initialize the optimal theme-tag along sort sequence of the overall situation with the random number between 0 to CATEGORYNUM-1 FACTOR, sequence length is the theme sum TOPICNUM of LDA document subject matter extraction model, the scope of each element in sequence Between 0 to CATEGORYNUM-1, wherein, CATEGORYNUM is tag along sort sum, initializes overall situation maximum matching number EA It is 0, i.e. FACTOR={FACTOR₁, FACTOR₂, …, FACTOR_TOPICNUM, EA=0；

Step A6: when simulated annealing temperature T in step A1 is more than 0.1, then perform step A7 to step A25；Otherwise Perform step A26；

Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number Being trapped among between 0 and TOPICNUM-1, wherein, TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1；

Step A8: be that curstep is with random number assignment, the scope of random number to the simulated annealing current step in step A2 Between-1 × STEP and STEP, wherein, the simulated annealing during STEP is step A1 changes step-length；

Step A9: the optimal theme of the overall situation making the current topic in step A2-tag along sort sequence vecb be equal in step A2-point Class sequence label FACTOR, i.e. vecb=FACTOR；

Step A10: change current topic-tag along sort sequence vecb in step A2, the numerical value on the i-th ndex position, order vecb_indexPlus curstep, wherein, index is that the theme in step A2-tag along sort sequence works as presubscript, and curstep is Simulated annealing current step in step A2, i.e. vecb_index= vecb_index+curstep；

Step A11: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is less than 0 Time, i.e. vecb_index< 0, then perform step A12；Otherwise perform step A13；

Step A12: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to 0, That is, vecb_index=0；Forward step A15 to；

Step A13: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is more than During CATEGORYNUM-1, wherein, the tag along sort sum during CATEGORYNUM is step A1, i.e. vecb_index> CATEGORYNUM-1, then perform step A14；Otherwise perform step A15；

Step A14: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to CATEGORYNUM-1, i.e. vecb_index=CATEGORYNUM-1, wherein, CATEGORYNUM is the tag along sort in step A1 Sum；

Step A15: optimal theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2；

Step A16: perform step B；

Step A17: the overall maximum matching number EA that the result of obtaining step B is assigned in step A2；

Step A18: the current topic in obtaining step A2-tag along sort sequence vecb；

Step A19: perform step B；

Step A20: the current matching number eb that the result of obtaining step B is assigned in step A2；

Step A21: when the current matching number eb in step A2 is more than the overall maximum matching number EA in step A2, i.e. eb > EA, then perform step A22；Otherwise perform step A25；

Step A22: generating random number random, wherein, numerical range is between 0 to 1；

Step A23: when the random number random in step A22 is less than e^(eb-EA)/TTime, i.e. random < e^(eb-EA)/T, wherein, eb For the current matching number in step A2, EA is the overall maximum matching number in step A2, then perform step A24；Otherwise perform step Rapid A25；

Step A24: the value making optimal theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2 Topic-tag along sort sequence vecb, the value making the overall maximum matching number EA in step A2 is the current matching number eb in step A2, That is, FACTOR=vecb, EA=eb；

Step A25: reduce simulated annealing temperature T in step A1, uses the simulated annealing cooling ginseng in step A1 Number COOL, i.e. T=T × COOL, performs step A6；

Step A26: return the optimal theme-tag along sort sequence of the overall situation in step A2, i.e. FACTOR={FACTOR₁, FACTOR₂, …, FACTOR_TOPICNUM}；Return the overall maximum matching number EA in step A2；

Cost function process flow steps B from step B1 to step B15:

Step B1: theme incoming in obtaining step A-tag along sort sequence TMPFACTOR；

Step B2: the personnel-tag along sort setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only One identification sets is LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if Join number for SUM, if theme collection corresponding to single personnel is combined into singlepersontopic, if in LDA document subject matter extraction model Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined in K-means clustering algorithm Kmeanscurcategoryperson, if the number collection overlapped is combined into unionperson, if the number overlapped is unionpersonnum；

Step B3: the theme-personnel of the LDA model from step A2 collect the people filtering out in step B2 in LDATOPICPERSON Unique identification sets LDAPERSON of member, and to the selection result duplicate removal, i.e. LDAPERSON=Π₂(LDATOPICPERSON) = {PERSON_p1, PERSON_p2, …, PERSON_pd}；

Step B4: in the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY Filter out all tag along sort collection CATEGORY in step B2, and to the selection result duplicate removal, i.e. CATEGORY=Π₂ (KMEANSPERSONCATEGORY)={CATEGORY_c1, CATEGORY_c2, …, CATEGORY_cd}；

Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number；

Step B6: the theme-personnel of the LDA model from step A2 collect and filter out LDAPERSON in LDATOPICPERSON_iCorresponding Theme set singlepersontopic, i.e. =TOPICt1, TOPICt2 ..., TOPICtc}, wherein, singlepersontopic is from step B2；

Step B7: to LDAPERSON_iEach theme in corresponding theme set singlepersontopic, from step B1 Theme-tag along sort sequence TMPFACTOR in find the tag along sort of correspondence, wherein, the subscript of TMPFACTOR represents main Topic, value corresponding to subscript represents the tag along sort that this theme is corresponding, i.e. category_t1 = TMPFACTOR_TOPICt1、 category_t2 = TMPFACTOR_TOPICt2、…、category_tc = TMPFACTOR_TOPICtc, wherein, category_t1 , category_t2 , …, category_tcRepresent tag along sort, and, different variablees may represent same contingency table Signing, singlepersontopic is from step B2；Add up the number of times that each tag along sort occurs, be designated as categorysnum₁, categorysnum₂, …, categorysnum_CATEGORYNUM, find out the tag along sort that tag along sort occurrence number is maximum category；Personnel-tag along sort collection the LDAPERSONCATEGORY of the LDA document subject matter extraction model in renewal step B2, That is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSON_i, category)}；

Step B8: when cyclic variable i is more than LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number, then perform step B9；Otherwise, i value adds 1, i.e. i=i+1, performs step B6 to step B7；

Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein, CATEGORYNUM is that the tag along sort in step A1 is total Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0；

Step B10: the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY In filter out CATEGORY_jCorresponding personnel gather kmeanscurcategoryperson, i.e. = {PERSON_kmeans1, PERSON_kmeans2, … , PERSON_kmeansc, wherein, kmeanscurcategoryperson is from step B2；

Step B11: the personnel of the LDA document subject matter extraction model from step B2-tag along sort collection LDAPERSONCATEGORY In filter out CATEGORY_jCorresponding personnel gather ldacurcategoryperson, i.e. ={PERSON_lda1, PERSON_lda2, … , PERSON_ldac, wherein, ldacurcategoryperson is from step B2；

Step B12: the personnel in calculation procedure B2 gather ldacurcategoryperson and the personnel's set in step B2 The common factor unionperson of kmeanscurcategoryperson, i.e. unionperson=ldacurcategoryperson ∩kmeanscurcategoryperson={PERSON_union1, PERSON_union2, … , PERSON_unionc}；

Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2, is assigned in step B2 Number unionpersonnum of coincidence, and the coupling number SUM being added in step B2, i.e. SUM=SUM+ unionpersonnum；

Step B14: when cyclic variable j is more than the tag along sort sum CATEGORYNUM in step A1, then perform step B15； Otherwise, the value of j adds 1, i.e. j=j+1, performs step B10 to step B13；

Step B15: return the coupling number SUM in step B2.

Wherein, cluster is to use K-means clustering algorithm and LDA model to carry out cluster point personnel's internet records Analysis, then two cluster results are mutually authenticated, and use simulated annealing to improve the efficiency optimizing cluster result, improve with this Cluster accuracy.

Wherein, step A3 provides the primary data needed for simulated annealing to step A4；Step A7 to step A12 be Simulated annealing is changed the numerical value on the random site in current solution sequence；Step B5 is by step B1 to step B8 The theme-personnel of theme-tag along sort sequence TMPFACTOR and the LDA model in step A2 collect LDATOPICPERSON and enter Row association, associates out the personnel-tag along sort collection LDAPERSONCATEGORY of LDA document subject matter extraction model in step B2； Step B9 is by comparison identical category to step B14, simultaneously appear in K-means cluster result with LDA model result In the number of personnel, and the number of these type of personnel in each classification of superposition, and finally returning in flow process, and as working as preamble The cost of row；Step A14 is to judge random number random and e in the size of eb Yu EA and step A15 to step A18^(eb ^-EA)/TSize, wherein, eb is the current matching number in step A2, and EA is the overall maximum matching number in step A2, works as eb > EA And random < e^(eb-EA)/TTime, then update value and maximum of the overall situation of optimal theme-tag along sort sequence FACTOR of the overall situation Joining the value of several EA, the cost values of eb with EA is drawn by above-mentioned step B；Final result return the overall situation maximum matching number EA with And optimal theme-tag along sort sequence FACTOR of the overall situation.

Wherein, the K-means clustering algorithm iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of key word distribution Thunder parameter ETA is 0.01, and LDA document subject matter extraction model iterations ITERLDA is 2000, the LDA document master in step A1 The theme sum TOPICNUM of topic extraction model is 20.

The invention land productivity webpage attribute, key word and frequency in personnel's internet records, calculates in conjunction with K-means Method, LDA document subject matter extraction model and annealing algorithm, first personnel-label-frequency set, personnel are browsed record-personnel- Keyword set carries out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates intermediate object program, makes afterwards With annealing algorithm, K-means and LDA is carried out bi-directional verification and calculate the optimal theme-tag along sort sequence of the overall situation, with this according to excellent Changing the result of network behavior custom cluster, K-means and LDA bi-directional verification improves the sensitivity to personnel-tag along sort, moves back Fire algorithm can improve the efficiency optimizing cluster result, and then improves cluster accuracy.

Accompanying drawing explanation

Accompanying drawing 1 is simulated annealing main flow.

Accompanying drawing 2 is cost function handling process.

Detailed description of the invention

Below in conjunction with the accompanying drawings technical scheme is described in detail:

Such as accompanying drawing 1, simulated annealing main flow step A1 is to step A26:

Step A16: perform step B；

Step A19: perform step B；

Such as accompanying drawing 2, cost function process flow steps B from step B1 to step B15:

Step B11: the personnel of the LDA document subject matter extraction model from step B2-tag along sort collection LDAPERSONCATEGORY In filter out CATEGORY_jCorresponding personnel gather ldacurcategoryperson, i.e.={PERSON_lda1, PERSON_lda2, … , PERSON_ldac, wherein, ldacurcategoryperson is from step B2；

Step B15: return the coupling number SUM in step B2.

In order to better illustrate the effectiveness of this method, the record that first surfs the web 5153 students carries out pre-place Reason, the data form of generation is: student's record number, student's numbering, key word 1, key word 2 ..., key word n, and record is total Number is 11167.Use LDA document subject matter extraction model, the record of student is abstracted into document, initial subject number 20 is set Individual.The key word surfed the web in record of student is quantified, finally by K-means by third party's corpus of classifying Algorithm gathers 3 classifications.Use artificial mode, analyze the concrete meaning of the theme of LDA model and separate classification, final and K- Means algorithm gathers the result and compares, it is possible to finally determines the concrete classification of 3149 people, accounts for sum 61.11%； The method using this experiment, it is possible to finally determine the concrete classification of 3610 people, account for the 70.06% of sum, compare to people's work point Analysis, improves 8.95%.

The present invention can be combined with computer system, thus is automatically performed personnel's network behavior custom cluster.

The invention land productivity webpage attribute in personnel's internet records, key word, frequency, calculate in conjunction with K-means Personnel-label-frequency set, personnel are first browsed record-personnel-pass by method, LDA document subject matter extraction model, annealing algorithm Keyword collection carries out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates intermediate object program, uses afterwards K-means and LDA is carried out bi-directional verification and calculates the optimal theme-tag along sort sequence of the overall situation, with this according to optimizing by annealing algorithm The result of network behavior custom cluster, K-means and LDA bi-directional verification improves the sensitivity to personnel-tag along sort, annealing Algorithm can improve the efficiency optimizing cluster result, and then improves cluster accuracy.

Claims

1. network behavior based on a K-means and LDA bi-directional verification custom clustering method, it is characterised in that utilize personnel Webpage attribute, key word and frequency in internet records, calculates in conjunction with K-means algorithm, LDA document subject matter extraction model and annealing Method, first personnel-label-frequency set, personnel are browsed record-personnel-keyword set carry out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates intermediate object program, uses annealing algorithm to be carried out by K-means and LDA afterwards double To checking, calculate the optimal theme-tag along sort sequence of the overall situation, optimize the result of network behavior custom cluster on this basis；Its In, comprise simulated annealing main flow step A and cost function process flow steps B:

Simulated annealing main flow step A1 is to step A26:

Step A16: perform step B；

Step A19: perform step B；

Cost function process flow steps B from step B1 to step B15:

Step B15: return the coupling number SUM in step B2.

A kind of network behavior based on K-means and LDA bi-directional verification the most according to claim 1 custom clustering method, It is characterized in that, cluster is to use K-means clustering algorithm and LDA model to carry out cluster analysis personnel's internet records, Again two cluster results are mutually authenticated, and use simulated annealing to improve the efficiency optimizing cluster result, improve with this poly- Class accuracy.

A kind of network behavior based on K-means and LDA bi-directional verification the most according to claim 1 custom clustering method, It is characterized in that, step A3 to step A4 provides the primary data needed for simulated annealing；Step A7 is at mould to step A12 Intend annealing algorithm changes the numerical value on the random site in current solution sequence；Step B5 to step B8 by step B1 Theme-the personnel of theme-tag along sort sequence TMPFACTOR and the LDA model in step A2 collect LDATOPICPERSON and carry out Association, associates out the personnel-tag along sort collection LDAPERSONCATEGORY of LDA document subject matter extraction model in step B2；Step Rapid B9 is by comparison identical category to step B14, simultaneously appear in K-means cluster result with in LDA model result The number of personnel, and the number of these type of personnel in each classification of superposition, and finally returning in flow process, and as current sequence Cost；Step A14 is to judge random number random and e in the size of eb Yu EA and step A15 to step A18^(eb-EA)/T Size, wherein, eb is the current matching number in step A2, and EA is the overall maximum matching number in step A2, works as eb > EA and random<e^(eb-EA)/TTime, then update value and the overall situation maximum matching number of optimal theme-tag along sort sequence FACTOR of the overall situation The value of EA, the cost values of eb with EA is drawn by above-mentioned step B；Final result returns overall situation maximum matching number EA and complete Optimal theme-tag along sort sequence FACTOR of office.

A kind of network behavior based on K-means and LDA bi-directional verification the most according to claim 1 custom clustering method, It is characterized in that, the K-means clustering algorithm iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of key word distribution Thunder parameter ETA is 0.01, and LDA document subject matter extraction model iterations ITERLDA is 2000, the LDA document master in step A1 The theme sum TOPICNUM of topic extraction model is 20.