CN106202480A - A kind of network behavior based on K means and LDA bi-directional verification custom clustering method - Google Patents

A kind of network behavior based on K means and LDA bi-directional verification custom clustering method Download PDF

Info

Publication number
CN106202480A
CN106202480A CN201610565749.XA CN201610565749A CN106202480A CN 106202480 A CN106202480 A CN 106202480A CN 201610565749 A CN201610565749 A CN 201610565749A CN 106202480 A CN106202480 A CN 106202480A
Authority
CN
China
Prior art keywords
personnel
tag along
along sort
theme
lda
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610565749.XA
Other languages
Chinese (zh)
Other versions
CN106202480B (en
Inventor
朱全银
辛诚
李翔
许康
潘舒新
孙青怡
周泓
严云洋
胡荣林
冯万利
王留洋
王海云
袁媛
唐海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN201610565749.XA priority Critical patent/CN106202480B/en
Publication of CN106202480A publication Critical patent/CN106202480A/en
Application granted granted Critical
Publication of CN106202480B publication Critical patent/CN106202480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

nullThe invention discloses the network behavior custom clustering method of a kind of bi-directional verification based on K means and LDA,The present invention utilizes the webpage attribute in personnel's internet records、Key word and frequency,In conjunction with K means algorithm、LDA document subject matter extraction model and annealing algorithm,First to personnel's label frequency collection、Personnel browse record keeping personnel's keyword set and carry out K means algorithm cluster and the generation of LDA document subject matter extraction model,Storage calculates intermediate object program,Use annealing algorithm that K means and LDA is carried out bi-directional verification afterwards,Calculate the optimal subject classification sequence label of the overall situation,Optimize the result of network behavior custom cluster on this basis,K means and LDA bi-directional verification improve the sensitivity to personnel's tag along sort,Annealing algorithm can improve the efficiency optimizing cluster result,And then raising cluster accuracy.

Description

A kind of network behavior based on K-means and LDA bi-directional verification custom clustering method
Technical field
The invention belongs to cluster analysis, optimized algorithm field, based on K-means and LDA two-way test particularly to a kind of The network behavior custom clustering method of card, is used for optimizing cluster result, and then improves cluster accuracy, and with on this increase personnel The use value of net record information.
Background technology
The clustering method of grasp network behavior custom data is surfed the Net for research worker and is accustomed to having important effect and meaning Justice, along with constantly popularizing of the Internet, increasing people selects to obtain information interested by network.Personnel surf the Net clear The quantity of information of the content look at is huge, only relies on and manually analyzes these data not only inefficiency, and accuracy is the highest. By cluster analysis, add and another kind of clustering method bi-directional verification, the efficiency of analysis and the accurate of analysis can be improved Rate.General clustering algorithm has K-means cluster and LDA document subject matter extraction model etc., and general optimized algorithm has simulation to move back Fire algorithm and genetic algorithm etc..
The correlative theses of clustering algorithm and optimized algorithm has: Pang Feng. and principle and the algorithm of simulated annealing are asked in optimization Application in topic. Jilin University's Master's thesis, 2006;Li Xiangping, Zhang Hongyang. simulated annealing principle and improvement. soft Part guide, 2008 (4): 47-48;Yang Mengduo, Li Fanchang, Zhang Li. 10 years progress of Lie-group machine learning. computer Journal, 2015 (7): 1337-1356;Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Proceeding of Journal of Machine Learning Research. 2003, Vol.3: 993-1022;Yuan J, Gao F, Ho Q, et al. LightLDA: Big Topic Models on Modest Computer Clusters. Proceeding of International Conference on World Wide Web. ACM, 2015;The existing Research foundation of Zhu Quanyin et al. includes: Li Xiang, Zhu Quanyin. combine cluster and rating matrix share Collaborative filtering recommending. computer science and exploration. 2014. Vol.8 (6): 751-759;Suqun Cao, Quanyin Zhu, Zhiwei Hou. Customer Segmentation Based on a Novel Hierarchical Clustering Algorithm. 2009, p:1-5;Quanyin Zhu,Sunqun Cao. A Novel Classifier- independent Feature Selection Algorithm for Imbalanced Datasets. 2009, p:77- 82;Suqun Cao,Zhiwei Hou, Liuyang Wang, Quanyin Zhu. Kernelized Fuzzy Fisher Criterion based Clustering Algorithm. DCABES 2010, p:87-91;Quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian. The Case Study for Price Extracting of Mobile Phone Sell Online. 2011, p:282-285;Quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm. International Review on Computers and Software, 2011, Vol.6(6):1089-1093;Suqun Cao, Gelan Yang, Quanyin Zhu, Haihei Zhai. A novel feature extraction method for mechanical part recognition. Applied Mechanics and Materials, 2011, p:116-121;Pei Zhou, Quanyin Zhu. Multi-factor Matching Method for Basic Information of Science and Technology Experts Based on Web Mining. 2012, P:718-720;Jianping Deng, Fengwen Cao, Quanyin Zhu, Yu Zhang. The Web Data Extracting and Application for Shop Online Based on Commodities Classified. Communications in Computer and Information Science, Vol.234(4):120-128;Hui Zong, Quanyin Zhu, Ming Sun, Yahong Zhang. The case study for human resource management research based on web mining and semantic analysis. Applied Mechanics and Materials, Vol.488, 2014 p:1336-1339;Zhu Quanyin et al. application, the open Patents with mandate: Zhu Quanyin, Hu Rongjing, Cao Suqun, Zhou Pei etc. a kind of price forecasting of commodity method based on linear interpolation Yu Adaptive windowing mouth. Chinese patent: ZL 2011 1 0423015.5, 2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly. and one is repaiied based on two divided data Mend the price forecasting of commodity method with Discontinuous Factors. Chinese patent: ZL 2,011 1 0422274.6,2013.01.02;Zhu Quan Silver, Yin Yonghua, Yan Yunyang, Chen Ting, Cao Suqun. the data of a kind of multi items price forecasting of commodity based on neutral net Preprocess method. Chinese patent: ZL 2,012 1 0325368.6,2016.06.08;Zhu Quanyin, Pan Lu, Liu Wenru, Lee Xiang, Zhou Hong, Hu Ronglin, Ding Jin, Jin Ying, Shaowu is outstanding, Tanghai ripple. the incremental learning multi-level two of a kind of science and technology news Sorting technique. China Patent Publication No.: CN 105205163A, 2015.12.30;Zhu Quanyin, Yan Yunyang, Huang Taoyi, Bright, Zhang Yuyang, Xin Cheng. the implementation method that the service of a kind of campus personality palm and user behavior custom are analyzed. China is specially Profit publication number: CN 104731971A, 2015.06.24;Zhu Quanyin, Shen Enqiang, Qian Yaping, Zhou Hong etc. a kind of based on K- Means clusters many weights adaptive Students ' Learning behavior analysis method. Chinese Patent Application No.: 201610222553.0, 2016.04.13;Zhu Quanyin, Shaowu is outstanding, Tanghai ripple, Zhou Hong, Li Xiang, Hu Ronglin, Jin Ying, Cao Suqun, Pan Shuxin. a kind of science Multi-level many sorting techniques of headline. China Patent Publication No.: CN 105205163A, 2016.07.13;Li Xiang, Zhu Quan Yin, Hu Ronglin, Zhou Hong. a kind of Cold Chain Logistics prestowage intelligent recommendation method based on spectral clustering. China Patent Publication No.: CN 105654267A, 2016.06.08。
LDA document subject matter extraction model:
LDA(Latent Dirichlet Allocation) it is that a kind of document subject matter generates model, also referred to as one three layers of pattra leaves This probabilistic model, comprises word, theme and document three-decker.So-called generate model, that is, it is believed that an article every Individual word be all by " with certain probability selection certain theme, and from this theme with certain word of certain probability selection " this One process of sample obtains.Document obeys multinomial distribution to theme, and multinomial distribution obeyed in theme to word.LDA is a kind of non-prison Superintend and direct machine learning techniques, can be used to identify extensive document sets (document collection) or corpus (corpus) In hiding subject information.It has employed the method for word bag (bag of words), and each document is considered as one by this method Individual word frequency vector, thus text message is converted the digital information for ease of modeling.But word bag method do not account for word with Order between word, this simplify the complex nature of the problem, and the most also the improvement for model provides opportunity.Each documents representative Probability distribution that some themes are constituted, and each theme represents the probability that a lot of word constituted and divides Cloth.
K-means clustering algorithm:
Come from a kind of vector quantization method in signal processing, be popular in number as a kind of clustering method the most more According to excavation applications.kThe purpose of-average cluster is: n point (can be the once observation of sample or an example) is divided intok In individual cluster so that each point broadly falls into the cluster that his nearest average (this is cluster centre) is corresponding, using as clustering Standard.This problem will be attributed to a problem that data space is divided into Voronoi cells.This problem is calculating On be difficult (NP difficulty), but there is efficient heuritic approach.Generally, the inspiration that all service efficiency is higher Formula algorithm, they can quickly converge on a locally optimal solution.These algorithms are generally similar at by iterative optimization method The EM algorithm (EM algorithm) of reason Gaussian Mixture distribution.And, they all use cluster centre to come for data modeling;Butk-average cluster tends to find cluster in comparable spatial dimension, it is desirable to-maximization technology but allows cluster to have difference Shape.
Simulated annealing:
Simulated annealing (Simulated Annealing, SA) thought the earliest is in 1953 by N. Metropolis et al. Year proposes.1983, annealing thought was successfully incorporated into Combinatorial Optimization field by S. Kirkpatrick etc..It is based on A kind of random optimizing algorithm of Monte-Carlo iterative strategy, its starting point is based on the annealing of solid matter in physics Similarity between process and general combinatorial optimization problem.Simulated annealing, from a certain higher initial temperature, is joined with temperature The continuous decline of number, join probability kick characteristic is the random globally optimal solution finding object function in solution space, i.e. in local Optimal solution can be jumped out probabilityly and finally tend to global optimum.Simulated annealing is a kind of general optimized algorithm, theoretical Upper algorithm has the global optimization performance of probability, is the most widely applied in engineering, such as VLSI, production scheduling, Control the fields such as engineering, machine learning, neutral net, signal processing.Simulated annealing is a kind of by giving search procedure Time-varying and the probabilistic jumping property finally gone to zero, thus can be prevented effectively from and be absorbed in local minimum and finally tend to the string of global optimum The optimized algorithm of row structure.
Heuristic search:
The big basic target of the two of computer science, it is simply that find that its operational efficiency provable is good and can obtain optimum solution or suboptimal solution Algorithm.Heuritic approach then attempts once to provide one or all target.Such as it often can find all well and good solution, but also Can not prove that it will not obtain worse solution;It generally can solve answer in the reasonable time, but also whether can not know it Can solve with such speed every time.At some in particular cases, heuritic approach can obtain the worst answer or efficiency pole Difference, but cause the data structure of those special circumstances, occur at real world the most never.Therefore real world opens Hairdo algorithm is in daily use and solves problem.Heuritic approach generally can obtain within reasonable time when processing many practical problems Good answer.Common heuritic approach has ant group algorithm, genetic algorithm, simulated annealing etc..
Summary of the invention
In order to help management personnel to understand and improve the personnel of being managed online custom, reduce the Internet harmful effect with And the generality excavated between the Internet data of the personnel of being managed and hobby contacts, it is managed personnel by comprehensive analysis Internet records, use document clustering algorithm based on LDA model and K means clustering algorithm, design achieve a kind of based on The network behavior custom clustering method of K-means and LDA bi-directional verification, carries with management for being managed the analysis of personnel's internet behavior Supply the system model of preferable reference value.
For the ease of understanding the theoretical basis of patent of the present invention, the theory of the present invention is described such as with the difference of traditional theory Under:
In traditional clustering method, it is usually and primary data uses a kind of mode carry out cluster analysis, then pass through manual analysis Mode verify.The present invention is on the basis of traditional method, and creative two kinds of clustering methods of employing, by self-defining The accuracy of verification method checking clustering algorithm, and use simulated annealing to improve the efficiency optimizing cluster result.
The technical scheme is that webpage attribute, key word and the frequency utilized in personnel's internet records, in conjunction with K- Personnel-label-frequency set, personnel are first browsed note by means algorithm, LDA document subject matter extraction model and annealing algorithm Record-personnel-keyword set carries out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates middle junction Really, use annealing algorithm that K-means and LDA carries out bi-directional verification afterwards, calculate the optimal theme-tag along sort sequence of the overall situation, Optimize the result of network behavior custom cluster on this basis;Wherein, simulated annealing main flow step A and cost letter are comprised Number process flow steps B:
Simulated annealing main flow step A1 is to step A26:
Step A1: set personnel-label-frequency set as PERSONLABELFREQ={ (PERSONp1, LABELp1, FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein, PERSONp1, PERSONp2, …, PERSONpaThe personnel that represent uniquely identify, LABELp1, LABELp2, …, LABELpaGeneration Table personnel surf the web the integrity attribute of content, personnel uniquely identify can corresponding multiple attributes, FREQp1,FREQp2, …, FREQpaThe personnel of representative surf the web the weight of integrity attribute of content, if personnel surf the web, record-personnel-keyword set is RECORDIDPERSONKEYWORD={(RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2, PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein, RECORDIDr1, RECORDIDr2, …, RECORDIDraRepresent personnel record uniquely to identify, personnel uniquely identify and the date of surfing the Net forms, PERSONr1, PERSONr2, …, PERSONraThe personnel that represent uniquely identify, KEYWORDr1,KEYWORDr2, …, KEYWORDraRepresentative Member surfs the web the key word that content comprises, if the Di Li Cray parameter of theme distribution is ALPHA, if the Di Li of key word distribution Cray parameter is ETA, if LDA document subject matter extraction model iterations is ITERLDA, if K-means clustering algorithm iteration time Number is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if the theme sum of LDA document subject matter extraction model is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing changes step-length, if simulated annealing is cold But parameter is COOL;
Step A2: set the result set of K-means clustering algorithm as personnel-tag along sort collection, i.e. be expressed as KMEANSPERSONCATEGORY, wherein, tag along sort is represented by natural integer;If the theme-keyword set of LDA model is LDATOPICWORD, wherein, theme is represented by natural integer;If the theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is represented by natural integer;If the optimal theme-tag along sort sequence of the overall situation is FACTOR, if overall situation maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep;
Step A3: call K-means clustering algorithm instrument, the tag along sort sum CATEGORYNUM in incoming step A1, step K-means clustering algorithm iterations ITERKMEANS in A1 and the personnel-label-frequency set in step A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, i.e. obtain KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …, (PERSONa, CATEGORYca), wherein, KMEANSPERSONCATEGORY is from step A2;
Step A4: call LDA modeling tool, the theme sum TOPICNUM of incoming LDA document subject matter extraction model, theme distribution Di Li Cray parameter ALPHA, key word distribution Di Li Cray parameter ETA, LDA document subject matter extraction model iterations ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, i.e. obtain DATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2), …, (TOPICtb, KEYWORDb) and theme-personnel's collection of LDA model, i.e. LDATOPICPERSON={ (TOPICt1, PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein, LDATOPICWORD and LDATOPICPERSON is from step A2;
Step A5: initialize the optimal theme-tag along sort sequence of the overall situation with the random number between 0 to CATEGORYNUM-1 FACTOR, sequence length is the theme sum TOPICNUM of LDA document subject matter extraction model, the scope of each element in sequence Between 0 to CATEGORYNUM-1, wherein, CATEGORYNUM is tag along sort sum, initializes overall situation maximum matching number EA It is 0, i.e. FACTOR={FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when simulated annealing temperature T in step A1 is more than 0.1, then perform step A7 to step A25;Otherwise Perform step A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number Being trapped among between 0 and TOPICNUM-1, wherein, TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1;
Step A8: be that curstep is with random number assignment, the scope of random number to the simulated annealing current step in step A2 Between-1 × STEP and STEP, wherein, the simulated annealing during STEP is step A1 changes step-length;
Step A9: the optimal theme of the overall situation making the current topic in step A2-tag along sort sequence vecb be equal in step A2-point Class sequence label FACTOR, i.e. vecb=FACTOR;
Step A10: change current topic-tag along sort sequence vecb in step A2, the numerical value on the i-th ndex position, order vecbindexPlus curstep, wherein, index is that the theme in step A2-tag along sort sequence works as presubscript, and curstep is Simulated annealing current step in step A2, i.e. vecbindex= vecbindex+curstep;
Step A11: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is less than 0 Time, i.e. vecbindex< 0, then perform step A12;Otherwise perform step A13;
Step A12: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to 0, That is, vecbindex=0;Forward step A15 to;
Step A13: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is more than During CATEGORYNUM-1, wherein, the tag along sort sum during CATEGORYNUM is step A1, i.e. vecbindex> CATEGORYNUM-1, then perform step A14;Otherwise perform step A15;
Step A14: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to CATEGORYNUM-1, i.e. vecbindex=CATEGORYNUM-1, wherein, CATEGORYNUM is the tag along sort in step A1 Sum;
Step A15: optimal theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: perform step B;
Step A17: the overall maximum matching number EA that the result of obtaining step B is assigned in step A2;
Step A18: the current topic in obtaining step A2-tag along sort sequence vecb;
Step A19: perform step B;
Step A20: the current matching number eb that the result of obtaining step B is assigned in step A2;
Step A21: when the current matching number eb in step A2 is more than the overall maximum matching number EA in step A2, i.e. eb > EA, then perform step A22;Otherwise perform step A25;
Step A22: generating random number random, wherein, numerical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TTime, i.e. random < e(eb-EA)/T, wherein, eb For the current matching number in step A2, EA is the overall maximum matching number in step A2, then perform step A24;Otherwise perform step Rapid A25;
Step A24: the value making optimal theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2 Topic-tag along sort sequence vecb, the value making the overall maximum matching number EA in step A2 is the current matching number eb in step A2, That is, FACTOR=vecb, EA=eb;
Step A25: reduce simulated annealing temperature T in step A1, uses the simulated annealing cooling ginseng in step A1 Number COOL, i.e. T=T × COOL, performs step A6;
Step A26: return the optimal theme-tag along sort sequence of the overall situation in step A2, i.e. FACTOR={FACTOR1, FACTOR2, …, FACTORTOPICNUM};Return the overall maximum matching number EA in step A2;
Cost function process flow steps B from step B1 to step B15:
Step B1: theme incoming in obtaining step A-tag along sort sequence TMPFACTOR;
Step B2: the personnel-tag along sort setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only One identification sets is LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if Join number for SUM, if theme collection corresponding to single personnel is combined into singlepersontopic, if in LDA document subject matter extraction model Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined in K-means clustering algorithm Kmeanscurcategoryperson, if the number collection overlapped is combined into unionperson, if the number overlapped is unionpersonnum;
Step B3: the theme-personnel of the LDA model from step A2 collect the people filtering out in step B2 in LDATOPICPERSON Unique identification sets LDAPERSON of member, and to the selection result duplicate removal, i.e. LDAPERSON=Π2(LDATOPICPERSON) = {PERSONp1, PERSONp2, …, PERSONpd};
Step B4: in the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY Filter out all tag along sort collection CATEGORY in step B2, and to the selection result duplicate removal, i.e. CATEGORY=Π2 (KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number;
Step B6: the theme-personnel of the LDA model from step A2 collect and filter out LDAPERSON in LDATOPICPERSONiCorresponding Theme set singlepersontopic, i.e. =TOPICt1, TOPICt2 ..., TOPICtc}, wherein, singlepersontopic is from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step B1 Theme-tag along sort sequence TMPFACTOR in find the tag along sort of correspondence, wherein, the subscript of TMPFACTOR represents main Topic, value corresponding to subscript represents the tag along sort that this theme is corresponding, i.e. categoryt1 = TMPFACTORTOPICt1、 categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein, categoryt1 , categoryt2 , …, categorytcRepresent tag along sort, and, different variablees may represent same contingency table Signing, singlepersontopic is from step B2;Add up the number of times that each tag along sort occurs, be designated as categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out the tag along sort that tag along sort occurrence number is maximum category;Personnel-tag along sort collection the LDAPERSONCATEGORY of the LDA document subject matter extraction model in renewal step B2, That is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi, category)};
Step B8: when cyclic variable i is more than LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number, then perform step B9;Otherwise, i value adds 1, i.e. i=i+1, performs step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein, CATEGORYNUM is that the tag along sort in step A1 is total Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY In filter out CATEGORYjCorresponding personnel gather kmeanscurcategoryperson, i.e. = {PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein, kmeanscurcategoryperson is from step B2;
Step B11: the personnel of the LDA document subject matter extraction model from step B2-tag along sort collection LDAPERSONCATEGORY In filter out CATEGORYjCorresponding personnel gather ldacurcategoryperson, i.e. ={PERSONlda1, PERSONlda2, … , PERSONldac, wherein, ldacurcategoryperson is from step B2;
Step B12: the personnel in calculation procedure B2 gather ldacurcategoryperson and the personnel's set in step B2 The common factor unionperson of kmeanscurcategoryperson, i.e. unionperson=ldacurcategoryperson ∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2, is assigned in step B2 Number unionpersonnum of coincidence, and the coupling number SUM being added in step B2, i.e. SUM=SUM+ unionpersonnum;
Step B14: when cyclic variable j is more than the tag along sort sum CATEGORYNUM in step A1, then perform step B15; Otherwise, the value of j adds 1, i.e. j=j+1, performs step B10 to step B13;
Step B15: return the coupling number SUM in step B2.
Wherein, cluster is to use K-means clustering algorithm and LDA model to carry out cluster point personnel's internet records Analysis, then two cluster results are mutually authenticated, and use simulated annealing to improve the efficiency optimizing cluster result, improve with this Cluster accuracy.
Wherein, step A3 provides the primary data needed for simulated annealing to step A4;Step A7 to step A12 be Simulated annealing is changed the numerical value on the random site in current solution sequence;Step B5 is by step B1 to step B8 The theme-personnel of theme-tag along sort sequence TMPFACTOR and the LDA model in step A2 collect LDATOPICPERSON and enter Row association, associates out the personnel-tag along sort collection LDAPERSONCATEGORY of LDA document subject matter extraction model in step B2; Step B9 is by comparison identical category to step B14, simultaneously appear in K-means cluster result with LDA model result In the number of personnel, and the number of these type of personnel in each classification of superposition, and finally returning in flow process, and as working as preamble The cost of row;Step A14 is to judge random number random and e in the size of eb Yu EA and step A15 to step A18(eb -EA)/TSize, wherein, eb is the current matching number in step A2, and EA is the overall maximum matching number in step A2, works as eb > EA And random < e(eb-EA)/TTime, then update value and maximum of the overall situation of optimal theme-tag along sort sequence FACTOR of the overall situation Joining the value of several EA, the cost values of eb with EA is drawn by above-mentioned step B;Final result return the overall situation maximum matching number EA with And optimal theme-tag along sort sequence FACTOR of the overall situation.
Wherein, the K-means clustering algorithm iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of key word distribution Thunder parameter ETA is 0.01, and LDA document subject matter extraction model iterations ITERLDA is 2000, the LDA document master in step A1 The theme sum TOPICNUM of topic extraction model is 20.
The invention land productivity webpage attribute, key word and frequency in personnel's internet records, calculates in conjunction with K-means Method, LDA document subject matter extraction model and annealing algorithm, first personnel-label-frequency set, personnel are browsed record-personnel- Keyword set carries out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates intermediate object program, makes afterwards With annealing algorithm, K-means and LDA is carried out bi-directional verification and calculate the optimal theme-tag along sort sequence of the overall situation, with this according to excellent Changing the result of network behavior custom cluster, K-means and LDA bi-directional verification improves the sensitivity to personnel-tag along sort, moves back Fire algorithm can improve the efficiency optimizing cluster result, and then improves cluster accuracy.
Accompanying drawing explanation
Accompanying drawing 1 is simulated annealing main flow.
Accompanying drawing 2 is cost function handling process.
Detailed description of the invention
Below in conjunction with the accompanying drawings technical scheme is described in detail:
Such as accompanying drawing 1, simulated annealing main flow step A1 is to step A26:
Step A1: set personnel-label-frequency set as PERSONLABELFREQ={ (PERSONp1, LABELp1, FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein, PERSONp1, PERSONp2, …, PERSONpaThe personnel that represent uniquely identify, LABELp1, LABELp2, …, LABELpaGeneration Table personnel surf the web the integrity attribute of content, personnel uniquely identify can corresponding multiple attributes, FREQp1,FREQp2, …, FREQpaThe personnel of representative surf the web the weight of integrity attribute of content, if personnel surf the web, record-personnel-keyword set is RECORDIDPERSONKEYWORD={(RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2, PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein, RECORDIDr1, RECORDIDr2, …, RECORDIDraRepresent personnel record uniquely to identify, personnel uniquely identify and the date of surfing the Net forms, PERSONr1, PERSONr2, …, PERSONraThe personnel that represent uniquely identify, KEYWORDr1,KEYWORDr2, …, KEYWORDraRepresentative Member surfs the web the key word that content comprises, if the Di Li Cray parameter of theme distribution is ALPHA, if the Di Li of key word distribution Cray parameter is ETA, if LDA document subject matter extraction model iterations is ITERLDA, if K-means clustering algorithm iteration time Number is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if the theme sum of LDA document subject matter extraction model is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing changes step-length, if simulated annealing is cold But parameter is COOL;
Step A2: set the result set of K-means clustering algorithm as personnel-tag along sort collection, i.e. be expressed as KMEANSPERSONCATEGORY, wherein, tag along sort is represented by natural integer;If the theme-keyword set of LDA model is LDATOPICWORD, wherein, theme is represented by natural integer;If the theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is represented by natural integer;If the optimal theme-tag along sort sequence of the overall situation is FACTOR, if overall situation maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep;
Step A3: call K-means clustering algorithm instrument, the tag along sort sum CATEGORYNUM in incoming step A1, step K-means clustering algorithm iterations ITERKMEANS in A1 and the personnel-label-frequency set in step A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, i.e. obtain KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …, (PERSONa, CATEGORYca), wherein, KMEANSPERSONCATEGORY is from step A2;
Step A4: call LDA modeling tool, the theme sum TOPICNUM of incoming LDA document subject matter extraction model, theme distribution Di Li Cray parameter ALPHA, key word distribution Di Li Cray parameter ETA, LDA document subject matter extraction model iterations ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, i.e. obtain DATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2), …, (TOPICtb, KEYWORDb) and theme-personnel's collection of LDA model, i.e. LDATOPICPERSON={ (TOPICt1, PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein, LDATOPICWORD and LDATOPICPERSON is from step A2;
Step A5: initialize the optimal theme-tag along sort sequence of the overall situation with the random number between 0 to CATEGORYNUM-1 FACTOR, sequence length is the theme sum TOPICNUM of LDA document subject matter extraction model, the scope of each element in sequence Between 0 to CATEGORYNUM-1, wherein, CATEGORYNUM is tag along sort sum, initializes overall situation maximum matching number EA It is 0, i.e. FACTOR={FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when simulated annealing temperature T in step A1 is more than 0.1, then perform step A7 to step A25;Otherwise Perform step A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number Being trapped among between 0 and TOPICNUM-1, wherein, TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1;
Step A8: be that curstep is with random number assignment, the scope of random number to the simulated annealing current step in step A2 Between-1 × STEP and STEP, wherein, the simulated annealing during STEP is step A1 changes step-length;
Step A9: the optimal theme of the overall situation making the current topic in step A2-tag along sort sequence vecb be equal in step A2-point Class sequence label FACTOR, i.e. vecb=FACTOR;
Step A10: change current topic-tag along sort sequence vecb in step A2, the numerical value on the i-th ndex position, order vecbindexPlus curstep, wherein, index is that the theme in step A2-tag along sort sequence works as presubscript, and curstep is Simulated annealing current step in step A2, i.e. vecbindex= vecbindex+curstep;
Step A11: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is less than 0 Time, i.e. vecbindex< 0, then perform step A12;Otherwise perform step A13;
Step A12: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to 0, That is, vecbindex=0;Forward step A15 to;
Step A13: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is more than During CATEGORYNUM-1, wherein, the tag along sort sum during CATEGORYNUM is step A1, i.e. vecbindex> CATEGORYNUM-1, then perform step A14;Otherwise perform step A15;
Step A14: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to CATEGORYNUM-1, i.e. vecbindex=CATEGORYNUM-1, wherein, CATEGORYNUM is the tag along sort in step A1 Sum;
Step A15: optimal theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: perform step B;
Step A17: the overall maximum matching number EA that the result of obtaining step B is assigned in step A2;
Step A18: the current topic in obtaining step A2-tag along sort sequence vecb;
Step A19: perform step B;
Step A20: the current matching number eb that the result of obtaining step B is assigned in step A2;
Step A21: when the current matching number eb in step A2 is more than the overall maximum matching number EA in step A2, i.e. eb > EA, then perform step A22;Otherwise perform step A25;
Step A22: generating random number random, wherein, numerical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TTime, i.e. random < e(eb-EA)/T, wherein, eb For the current matching number in step A2, EA is the overall maximum matching number in step A2, then perform step A24;Otherwise perform step Rapid A25;
Step A24: the value making optimal theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2 Topic-tag along sort sequence vecb, the value making the overall maximum matching number EA in step A2 is the current matching number eb in step A2, That is, FACTOR=vecb, EA=eb;
Step A25: reduce simulated annealing temperature T in step A1, uses the simulated annealing cooling ginseng in step A1 Number COOL, i.e. T=T × COOL, performs step A6;
Step A26: return the optimal theme-tag along sort sequence of the overall situation in step A2, i.e. FACTOR={FACTOR1, FACTOR2, …, FACTORTOPICNUM};Return the overall maximum matching number EA in step A2;
Such as accompanying drawing 2, cost function process flow steps B from step B1 to step B15:
Step B1: theme incoming in obtaining step A-tag along sort sequence TMPFACTOR;
Step B2: the personnel-tag along sort setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only One identification sets is LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if Join number for SUM, if theme collection corresponding to single personnel is combined into singlepersontopic, if in LDA document subject matter extraction model Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined in K-means clustering algorithm Kmeanscurcategoryperson, if the number collection overlapped is combined into unionperson, if the number overlapped is unionpersonnum;
Step B3: the theme-personnel of the LDA model from step A2 collect the people filtering out in step B2 in LDATOPICPERSON Unique identification sets LDAPERSON of member, and to the selection result duplicate removal, i.e. LDAPERSON=Π2(LDATOPICPERSON) = {PERSONp1, PERSONp2, …, PERSONpd};
Step B4: in the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY Filter out all tag along sort collection CATEGORY in step B2, and to the selection result duplicate removal, i.e. CATEGORY=Π2 (KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number;
Step B6: the theme-personnel of the LDA model from step A2 collect and filter out LDAPERSON in LDATOPICPERSONiCorresponding Theme set singlepersontopic, i.e. =TOPICt1, TOPICt2 ..., TOPICtc}, wherein, singlepersontopic is from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step B1 Theme-tag along sort sequence TMPFACTOR in find the tag along sort of correspondence, wherein, the subscript of TMPFACTOR represents main Topic, value corresponding to subscript represents the tag along sort that this theme is corresponding, i.e. categoryt1 = TMPFACTORTOPICt1、 categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein, categoryt1 , categoryt2 , …, categorytcRepresent tag along sort, and, different variablees may represent same contingency table Signing, singlepersontopic is from step B2;Add up the number of times that each tag along sort occurs, be designated as categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out the tag along sort that tag along sort occurrence number is maximum category;Personnel-tag along sort collection the LDAPERSONCATEGORY of the LDA document subject matter extraction model in renewal step B2, That is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi, category)};
Step B8: when cyclic variable i is more than LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number, then perform step B9;Otherwise, i value adds 1, i.e. i=i+1, performs step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein, CATEGORYNUM is that the tag along sort in step A1 is total Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY In filter out CATEGORYjCorresponding personnel gather kmeanscurcategoryperson, i.e. = {PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein, kmeanscurcategoryperson is from step B2;
Step B11: the personnel of the LDA document subject matter extraction model from step B2-tag along sort collection LDAPERSONCATEGORY In filter out CATEGORYjCorresponding personnel gather ldacurcategoryperson, i.e.={PERSONlda1, PERSONlda2, … , PERSONldac, wherein, ldacurcategoryperson is from step B2;
Step B12: the personnel in calculation procedure B2 gather ldacurcategoryperson and the personnel's set in step B2 The common factor unionperson of kmeanscurcategoryperson, i.e. unionperson=ldacurcategoryperson ∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2, is assigned in step B2 Number unionpersonnum of coincidence, and the coupling number SUM being added in step B2, i.e. SUM=SUM+ unionpersonnum;
Step B14: when cyclic variable j is more than the tag along sort sum CATEGORYNUM in step A1, then perform step B15; Otherwise, the value of j adds 1, i.e. j=j+1, performs step B10 to step B13;
Step B15: return the coupling number SUM in step B2.
Wherein, cluster is to use K-means clustering algorithm and LDA model to carry out cluster point personnel's internet records Analysis, then two cluster results are mutually authenticated, and use simulated annealing to improve the efficiency optimizing cluster result, improve with this Cluster accuracy.
Wherein, step A3 provides the primary data needed for simulated annealing to step A4;Step A7 to step A12 be Simulated annealing is changed the numerical value on the random site in current solution sequence;Step B5 is by step B1 to step B8 The theme-personnel of theme-tag along sort sequence TMPFACTOR and the LDA model in step A2 collect LDATOPICPERSON and enter Row association, associates out the personnel-tag along sort collection LDAPERSONCATEGORY of LDA document subject matter extraction model in step B2; Step B9 is by comparison identical category to step B14, simultaneously appear in K-means cluster result with LDA model result In the number of personnel, and the number of these type of personnel in each classification of superposition, and finally returning in flow process, and as working as preamble The cost of row;Step A14 is to judge random number random and e in the size of eb Yu EA and step A15 to step A18(eb -EA)/TSize, wherein, eb is the current matching number in step A2, and EA is the overall maximum matching number in step A2, works as eb > EA And random < e(eb-EA)/TTime, then update value and maximum of the overall situation of optimal theme-tag along sort sequence FACTOR of the overall situation Joining the value of several EA, the cost values of eb with EA is drawn by above-mentioned step B;Final result return the overall situation maximum matching number EA with And optimal theme-tag along sort sequence FACTOR of the overall situation.
Wherein, the K-means clustering algorithm iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of key word distribution Thunder parameter ETA is 0.01, and LDA document subject matter extraction model iterations ITERLDA is 2000, the LDA document master in step A1 The theme sum TOPICNUM of topic extraction model is 20.
In order to better illustrate the effectiveness of this method, the record that first surfs the web 5153 students carries out pre-place Reason, the data form of generation is: student's record number, student's numbering, key word 1, key word 2 ..., key word n, and record is total Number is 11167.Use LDA document subject matter extraction model, the record of student is abstracted into document, initial subject number 20 is set Individual.The key word surfed the web in record of student is quantified, finally by K-means by third party's corpus of classifying Algorithm gathers 3 classifications.Use artificial mode, analyze the concrete meaning of the theme of LDA model and separate classification, final and K- Means algorithm gathers the result and compares, it is possible to finally determines the concrete classification of 3149 people, accounts for sum 61.11%; The method using this experiment, it is possible to finally determine the concrete classification of 3610 people, account for the 70.06% of sum, compare to people's work point Analysis, improves 8.95%.
The present invention can be combined with computer system, thus is automatically performed personnel's network behavior custom cluster.
The invention land productivity webpage attribute in personnel's internet records, key word, frequency, calculate in conjunction with K-means Personnel-label-frequency set, personnel are first browsed record-personnel-pass by method, LDA document subject matter extraction model, annealing algorithm Keyword collection carries out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates intermediate object program, uses afterwards K-means and LDA is carried out bi-directional verification and calculates the optimal theme-tag along sort sequence of the overall situation, with this according to optimizing by annealing algorithm The result of network behavior custom cluster, K-means and LDA bi-directional verification improves the sensitivity to personnel-tag along sort, annealing Algorithm can improve the efficiency optimizing cluster result, and then improves cluster accuracy.

Claims (4)

1. network behavior based on a K-means and LDA bi-directional verification custom clustering method, it is characterised in that utilize personnel Webpage attribute, key word and frequency in internet records, calculates in conjunction with K-means algorithm, LDA document subject matter extraction model and annealing Method, first personnel-label-frequency set, personnel are browsed record-personnel-keyword set carry out K-means algorithm cluster and LDA document subject matter extraction model generates, and storage calculates intermediate object program, uses annealing algorithm to be carried out by K-means and LDA afterwards double To checking, calculate the optimal theme-tag along sort sequence of the overall situation, optimize the result of network behavior custom cluster on this basis;Its In, comprise simulated annealing main flow step A and cost function process flow steps B:
Simulated annealing main flow step A1 is to step A26:
Step A1: set personnel-label-frequency set as PERSONLABELFREQ={ (PERSONp1, LABELp1, FREQp1), (PERSONp2, LABELp2, FREQp2), …, (PERSONpa, LABELpa, FREQpa), wherein, PERSONp1, PERSONp2, …, PERSONpaThe personnel that represent uniquely identify, LABELp1, LABELp2, …, LABELpaGeneration Table personnel surf the web the integrity attribute of content, personnel uniquely identify can corresponding multiple attributes, FREQp1,FREQp2, …, FREQpaThe personnel of representative surf the web the weight of integrity attribute of content, if personnel surf the web, record-personnel-keyword set is RECORDIDPERSONKEYWORD={(RECORDIDr1, PERSONr1, KEYWORDr1), (RECORDIDr2, PERSONr2, KEYWORDr2), …, (RECORDIDra, PERSONra, KEYWORDra), wherein, RECORDIDr1, RECORDIDr2, …, RECORDIDraRepresent personnel record uniquely to identify, personnel uniquely identify and the date of surfing the Net forms, PERSONr1, PERSONr2, …, PERSONraThe personnel that represent uniquely identify, KEYWORDr1,KEYWORDr2, …, KEYWORDraRepresentative Member surfs the web the key word that content comprises, if the Di Li Cray parameter of theme distribution is ALPHA, if the Di Li of key word distribution Cray parameter is ETA, if LDA document subject matter extraction model iterations is ITERLDA, if K-means clustering algorithm iteration time Number is ITERKMEANS, if tag along sort sum is CATEGORYNUM, if the theme sum of LDA document subject matter extraction model is TOPICNUM, if simulated annealing temperature is T, if it is STEP that simulated annealing changes step-length, if simulated annealing is cold But parameter is COOL;
Step A2: set the result set of K-means clustering algorithm as personnel-tag along sort collection, i.e. be expressed as KMEANSPERSONCATEGORY, wherein, tag along sort is represented by natural integer;If the theme-keyword set of LDA model is LDATOPICWORD, wherein, theme is represented by natural integer;If the theme-personnel of LDA model integrate as LDATOPICPERSON, Wherein, theme is represented by natural integer;If the optimal theme-tag along sort sequence of the overall situation is FACTOR, if overall situation maximum matching number For EA, if current topic-tag along sort sequence is vecb, if current matching number is eb, if theme-tag along sort sequence is when front lower It is designated as index, if simulated annealing current step is curstep;
Step A3: call K-means clustering algorithm instrument, the tag along sort sum CATEGORYNUM in incoming step A1, step K-means clustering algorithm iterations ITERKMEANS in A1 and the personnel-label-frequency set in step A1 PERSONLABELFREQ, obtaining K-means clustering algorithm result set is personnel-tag along sort collection, i.e. obtain KMEANSPERSONCATEGORY={(PERSON1, CATEGORYc1), (PERSON2, CATEGORYc2), …, (PERSONa, CATEGORYca), wherein, KMEANSPERSONCATEGORY is from step A2;
Step A4: call LDA modeling tool, the theme sum TOPICNUM of incoming LDA document subject matter extraction model, theme distribution Di Li Cray parameter ALPHA, key word distribution Di Li Cray parameter ETA, LDA document subject matter extraction model iterations ITERLDA and personnel surf the web record-personnel-keyword set RECORDIDPERSONKEYWORD, obtain LDA model Theme-keyword set, i.e. obtain DATOPICWORD={ (TOPICt1, KEYWORD1), (TOPICt2, KEYWORD2), …, (TOPICtb, KEYWORDb) and theme-personnel's collection of LDA model, i.e. LDATOPICPERSON={ (TOPICt1, PERSONp1), (TOPICt2, PERSONp2), …, (TOPICtc, PERSONpc), wherein, LDATOPICWORD and LDATOPICPERSON is from step A2;
Step A5: initialize the optimal theme-tag along sort sequence of the overall situation with the random number between 0 to CATEGORYNUM-1 FACTOR, sequence length is the theme sum TOPICNUM of LDA document subject matter extraction model, the scope of each element in sequence Between 0 to CATEGORYNUM-1, wherein, CATEGORYNUM is tag along sort sum, initializes overall situation maximum matching number EA It is 0, i.e. FACTOR={FACTOR1, FACTOR2, …, FACTORTOPICNUM, EA=0;
Step A6: when simulated annealing temperature T in step A1 is more than 0.1, then perform step A7 to step A25;Otherwise Perform step A26;
Step A7: to the theme in step A2-tag along sort sequence when presubscript index is with random number assignment, the model of random number Being trapped among between 0 and TOPICNUM-1, wherein, TOPICNUM is the theme sum of the LDA document subject matter extraction model in step A1;
Step A8: be that curstep is with random number assignment, the scope of random number to the simulated annealing current step in step A2 Between-1 × STEP and STEP, wherein, the simulated annealing during STEP is step A1 changes step-length;
Step A9: the optimal theme of the overall situation making the current topic in step A2-tag along sort sequence vecb be equal in step A2-point Class sequence label FACTOR, i.e. vecb=FACTOR;
Step A10: change current topic-tag along sort sequence vecb in step A2, the numerical value on the i-th ndex position, order vecbindexPlus curstep, wherein, index is that the theme in step A2-tag along sort sequence works as presubscript, and curstep is Simulated annealing current step in step A2, i.e. vecbindex= vecbindex+curstep;
Step A11: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is less than 0 Time, i.e. vecbindex< 0, then perform step A12;Otherwise perform step A13;
Step A12: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to 0, That is, vecbindex=0;Forward step A15 to;
Step A13: when the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position is more than During CATEGORYNUM-1, wherein, the tag along sort sum during CATEGORYNUM is step A1, i.e. vecbindex> CATEGORYNUM-1, then perform step A14;Otherwise perform step A15;
Step A14: make the current topic in step A2-tag along sort sequence vecb numerical value on the i-th ndex position be equal to CATEGORYNUM-1, i.e. vecbindex=CATEGORYNUM-1, wherein, CATEGORYNUM is the tag along sort in step A1 Sum;
Step A15: optimal theme-tag along sort sequence FACTOR of the overall situation in obtaining step A2;
Step A16: perform step B;
Step A17: the overall maximum matching number EA that the result of obtaining step B is assigned in step A2;
Step A18: the current topic in obtaining step A2-tag along sort sequence vecb;
Step A19: perform step B;
Step A20: the current matching number eb that the result of obtaining step B is assigned in step A2;
Step A21: when the current matching number eb in step A2 is more than the overall maximum matching number EA in step A2, i.e. eb > EA, then perform step A22;Otherwise perform step A25;
Step A22: generating random number random, wherein, numerical range is between 0 to 1;
Step A23: when the random number random in step A22 is less than e(eb-EA)/TTime, i.e. random < e(eb-EA)/T, wherein, eb For the current matching number in step A2, EA is the overall maximum matching number in step A2, then perform step A24;Otherwise perform step Rapid A25;
Step A24: the value making optimal theme-tag along sort sequence FACTOR of the overall situation in step A2 is the current master in step A2 Topic-tag along sort sequence vecb, the value making the overall maximum matching number EA in step A2 is the current matching number eb in step A2, That is, FACTOR=vecb, EA=eb;
Step A25: reduce simulated annealing temperature T in step A1, uses the simulated annealing cooling ginseng in step A1 Number COOL, i.e. T=T × COOL, performs step A6;
Step A26: return the optimal theme-tag along sort sequence of the overall situation in step A2, i.e. FACTOR={FACTOR1, FACTOR2, …, FACTORTOPICNUM};Return the overall maximum matching number EA in step A2;
Cost function process flow steps B from step B1 to step B15:
Step B1: theme incoming in obtaining step A-tag along sort sequence TMPFACTOR;
Step B2: the personnel-tag along sort setting LDA document subject matter extraction model integrates as LDAPERSONCATEGORY, if personnel are only One identification sets is LDAPERSON, if total number of persons is LDAPERSONNUM, if all tag along sorts integrate as CATEGORY, if Join number for SUM, if theme collection corresponding to single personnel is combined into singlepersontopic, if in LDA document subject matter extraction model Current class personnel collection is combined into ldacurcategoryperson, if current class personnel collection is combined in K-means clustering algorithm Kmeanscurcategoryperson, if the number collection overlapped is combined into unionperson, if the number overlapped is unionpersonnum;
Step B3: the theme-personnel of the LDA model from step A2 collect the people filtering out in step B2 in LDATOPICPERSON Unique identification sets LDAPERSON of member, and to the selection result duplicate removal, i.e. LDAPERSON=Π2(LDATOPICPERSON) = {PERSONp1, PERSONp2, …, PERSONpd};
Step B4: in the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY Filter out all tag along sort collection CATEGORY in step B2, and to the selection result duplicate removal, i.e. CATEGORY=Π2 (KMEANSPERSONCATEGORY)={CATEGORYc1, CATEGORYc2, …, CATEGORYcd};
Step B5: setting cyclic variable as i, i < LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number;
Step B6: the theme-personnel of the LDA model from step A2 collect and filter out LDAPERSON in LDATOPICPERSONiCorresponding Theme set singlepersontopic, i.e. =TOPICt1, TOPICt2 ..., TOPICtc}, wherein, singlepersontopic is from step B2;
Step B7: to LDAPERSONiEach theme in corresponding theme set singlepersontopic, from step B1 Theme-tag along sort sequence TMPFACTOR in find the tag along sort of correspondence, wherein, the subscript of TMPFACTOR represents main Topic, value corresponding to subscript represents the tag along sort that this theme is corresponding, i.e. categoryt1 = TMPFACTORTOPICt1、 categoryt2 = TMPFACTORTOPICt2、…、categorytc = TMPFACTORTOPICtc, wherein, categoryt1 , categoryt2 , …, categorytcRepresent tag along sort, and, different variablees may represent same contingency table Signing, singlepersontopic is from step B2;Add up the number of times that each tag along sort occurs, be designated as categorysnum1, categorysnum2, …, categorysnumCATEGORYNUM, find out the tag along sort that tag along sort occurrence number is maximum category;Personnel-tag along sort collection the LDAPERSONCATEGORY of the LDA document subject matter extraction model in renewal step B2, That is, LDAPERSONCATEGORY=LDAPERSONCATEGORY ∪ { (LDAPERSONi, category)};
Step B8: when cyclic variable i is more than LDAPERSONNUM, wherein, LDAPERSONNUM is that the personnel in step B2 are total Number, then perform step B9;Otherwise, i value adds 1, i.e. i=i+1, performs step B6 to step B7;
Step B9: setting cyclic variable as j, j < CATEGORYNUM, wherein, CATEGORYNUM is that the tag along sort in step A1 is total Number, if the coupling number SUM in step B2 is 0, i.e. SUM=0;
Step B10: the personnel of the K-means algorithm from step A2-tag along sort collection KMEANSPERSONCATEGORY In filter out CATEGORYjCorresponding personnel gather kmeanscurcategoryperson, i.e. = {PERSONkmeans1, PERSONkmeans2, … , PERSONkmeansc, wherein, kmeanscurcategoryperson is from step B2;
Step B11: the personnel of the LDA document subject matter extraction model from step B2-tag along sort collection LDAPERSONCATEGORY In filter out CATEGORYjCorresponding personnel gather ldacurcategoryperson, i.e.={PERSONlda1, PERSONlda2, … , PERSONldac, wherein, ldacurcategoryperson is from step B2;
Step B12: the personnel in calculation procedure B2 gather ldacurcategoryperson and the personnel's set in step B2 The common factor unionperson of kmeanscurcategoryperson, i.e. unionperson=ldacurcategoryperson ∩kmeanscurcategoryperson={PERSONunion1, PERSONunion2, … , PERSONunionc};
Step B13: the number number of the number set unionperson of the coincidence in statistic procedure B2, is assigned in step B2 Number unionpersonnum of coincidence, and the coupling number SUM being added in step B2, i.e. SUM=SUM+ unionpersonnum;
Step B14: when cyclic variable j is more than the tag along sort sum CATEGORYNUM in step A1, then perform step B15; Otherwise, the value of j adds 1, i.e. j=j+1, performs step B10 to step B13;
Step B15: return the coupling number SUM in step B2.
A kind of network behavior based on K-means and LDA bi-directional verification the most according to claim 1 custom clustering method, It is characterized in that, cluster is to use K-means clustering algorithm and LDA model to carry out cluster analysis personnel's internet records, Again two cluster results are mutually authenticated, and use simulated annealing to improve the efficiency optimizing cluster result, improve with this poly- Class accuracy.
A kind of network behavior based on K-means and LDA bi-directional verification the most according to claim 1 custom clustering method, It is characterized in that, step A3 to step A4 provides the primary data needed for simulated annealing;Step A7 is at mould to step A12 Intend annealing algorithm changes the numerical value on the random site in current solution sequence;Step B5 to step B8 by step B1 Theme-the personnel of theme-tag along sort sequence TMPFACTOR and the LDA model in step A2 collect LDATOPICPERSON and carry out Association, associates out the personnel-tag along sort collection LDAPERSONCATEGORY of LDA document subject matter extraction model in step B2;Step Rapid B9 is by comparison identical category to step B14, simultaneously appear in K-means cluster result with in LDA model result The number of personnel, and the number of these type of personnel in each classification of superposition, and finally returning in flow process, and as current sequence Cost;Step A14 is to judge random number random and e in the size of eb Yu EA and step A15 to step A18(eb-EA)/T Size, wherein, eb is the current matching number in step A2, and EA is the overall maximum matching number in step A2, works as eb > EA and random<e(eb-EA)/TTime, then update value and the overall situation maximum matching number of optimal theme-tag along sort sequence FACTOR of the overall situation The value of EA, the cost values of eb with EA is drawn by above-mentioned step B;Final result returns overall situation maximum matching number EA and complete Optimal theme-tag along sort sequence FACTOR of office.
A kind of network behavior based on K-means and LDA bi-directional verification the most according to claim 1 custom clustering method, It is characterized in that, the K-means clustering algorithm iterations ITERKMEANS in step A3 is 300, tag along sort sum CATEGORYNUM is 3, and Di Li Cray parameter ALPHA of the theme distribution in step A4 is 0.1, the Di Like of key word distribution Thunder parameter ETA is 0.01, and LDA document subject matter extraction model iterations ITERLDA is 2000, the LDA document master in step A1 The theme sum TOPICNUM of topic extraction model is 20.
CN201610565749.XA 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification Active CN106202480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610565749.XA CN106202480B (en) 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610565749.XA CN106202480B (en) 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Publications (2)

Publication Number Publication Date
CN106202480A true CN106202480A (en) 2016-12-07
CN106202480B CN106202480B (en) 2019-06-11

Family

ID=57493136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610565749.XA Active CN106202480B (en) 2016-07-19 2016-07-19 A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification

Country Status (1)

Country Link
CN (1) CN106202480B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107305614A (en) * 2017-08-12 2017-10-31 西安电子科技大学 A kind of method based on the MLDM algorithm process big datas for meeting Second Aggregation
CN108460630A (en) * 2018-02-12 2018-08-28 广州虎牙信息科技有限公司 The method and apparatus for carrying out classification analysis based on user data
CN108897815A (en) * 2018-06-20 2018-11-27 淮阴工学院 A kind of multi-tag file classification method based on similarity model and FastText
CN108984551A (en) * 2017-05-31 2018-12-11 广州智慧城市发展研究院 A kind of recommended method and system based on the multi-class soft cluster of joint
CN110276503A (en) * 2018-03-14 2019-09-24 吉旗物联科技(上海)有限公司 A kind of method of automatic identification cold chain vehicle task
CN112214515A (en) * 2020-10-16 2021-01-12 平安国际智慧城市科技股份有限公司 Data automatic matching method and device, electronic equipment and storage medium
CN112800419A (en) * 2019-11-13 2021-05-14 北京数安鑫云信息技术有限公司 Method, apparatus, medium and device for identifying IP group
CN112883154A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text topic mining method and device, computer equipment and storage medium
CN113204641A (en) * 2021-04-12 2021-08-03 武汉大学 Annealing attention rumor identification method and device based on user characteristics
CN113312450A (en) * 2021-05-28 2021-08-27 北京航空航天大学 Method for preventing text stream sequence conversion attack
CN114742869A (en) * 2022-06-15 2022-07-12 西安交通大学医学院第一附属医院 Brain neurosurgery registration method based on pattern recognition and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090290508A1 (en) * 2008-05-22 2009-11-26 At&T Labs, Inc. Method for optimizing network "Point of Presence" locations
CN102194012A (en) * 2011-06-17 2011-09-21 清华大学 Microblog topic detecting method and system
CN102609719A (en) * 2012-01-19 2012-07-25 北京工业大学 Method for identifying place image on the basis of improved probabilistic topic model
CN103632166A (en) * 2013-12-04 2014-03-12 西安电子科技大学 Aurora image classification method based on latent theme combining with saliency information
CN103678500A (en) * 2013-11-18 2014-03-26 南京邮电大学 Data mining improved type K mean value clustering method based on linear discriminant analysis
CN103793501A (en) * 2014-01-20 2014-05-14 惠州学院 Theme community discovery method based on social network
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104750856A (en) * 2015-04-16 2015-07-01 天天艾米(北京)网络科技有限公司 System and method for multi-dimensional synergic recommendation
CN105303199A (en) * 2015-12-08 2016-02-03 南京信息工程大学 Data fragment type identification method based on content characteristics and K-means
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090290508A1 (en) * 2008-05-22 2009-11-26 At&T Labs, Inc. Method for optimizing network "Point of Presence" locations
CN102194012A (en) * 2011-06-17 2011-09-21 清华大学 Microblog topic detecting method and system
CN102609719A (en) * 2012-01-19 2012-07-25 北京工业大学 Method for identifying place image on the basis of improved probabilistic topic model
CN103678500A (en) * 2013-11-18 2014-03-26 南京邮电大学 Data mining improved type K mean value clustering method based on linear discriminant analysis
CN103632166A (en) * 2013-12-04 2014-03-12 西安电子科技大学 Aurora image classification method based on latent theme combining with saliency information
CN103793501A (en) * 2014-01-20 2014-05-14 惠州学院 Theme community discovery method based on social network
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104750856A (en) * 2015-04-16 2015-07-01 天天艾米(北京)网络科技有限公司 System and method for multi-dimensional synergic recommendation
CN105303199A (en) * 2015-12-08 2016-02-03 南京信息工程大学 Data fragment type identification method based on content characteristics and K-means
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MARZUKI KHALID ET AL.: ""Tropical wood species recognition system based on multi-feature extractors and classifiers"", 《2011 2ND INTERNATIONAL CONFERENCE ON INSTRUMENTATION CONTROL AND AUTOMATION》 *
唐晓波 等: ""基于隐含狄利克雷分配的微博推荐模型研究"", 《情报科学》 *
杨赛 等: ""基于隐含狄利克雷分配模型的图像分类算法"", 《计算机工程》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984551A (en) * 2017-05-31 2018-12-11 广州智慧城市发展研究院 A kind of recommended method and system based on the multi-class soft cluster of joint
CN107305614B (en) * 2017-08-12 2020-05-26 西安电子科技大学 Method for processing big data based on MLDM algorithm meeting secondary aggregation
CN107305614A (en) * 2017-08-12 2017-10-31 西安电子科技大学 A kind of method based on the MLDM algorithm process big datas for meeting Second Aggregation
CN108460630A (en) * 2018-02-12 2018-08-28 广州虎牙信息科技有限公司 The method and apparatus for carrying out classification analysis based on user data
CN110276503A (en) * 2018-03-14 2019-09-24 吉旗物联科技(上海)有限公司 A kind of method of automatic identification cold chain vehicle task
CN108897815B (en) * 2018-06-20 2021-07-16 淮阴工学院 Multi-label text classification method based on similarity model and FastText
CN108897815A (en) * 2018-06-20 2018-11-27 淮阴工学院 A kind of multi-tag file classification method based on similarity model and FastText
CN112800419A (en) * 2019-11-13 2021-05-14 北京数安鑫云信息技术有限公司 Method, apparatus, medium and device for identifying IP group
CN112214515A (en) * 2020-10-16 2021-01-12 平安国际智慧城市科技股份有限公司 Data automatic matching method and device, electronic equipment and storage medium
CN112883154A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text topic mining method and device, computer equipment and storage medium
CN113204641A (en) * 2021-04-12 2021-08-03 武汉大学 Annealing attention rumor identification method and device based on user characteristics
CN113204641B (en) * 2021-04-12 2022-09-02 武汉大学 Annealing attention rumor identification method and device based on user characteristics
CN113312450A (en) * 2021-05-28 2021-08-27 北京航空航天大学 Method for preventing text stream sequence conversion attack
CN113312450B (en) * 2021-05-28 2022-05-31 北京航空航天大学 Method for preventing text stream sequence conversion attack
CN114742869A (en) * 2022-06-15 2022-07-12 西安交通大学医学院第一附属医院 Brain neurosurgery registration method based on pattern recognition and electronic equipment
CN114742869B (en) * 2022-06-15 2022-08-16 西安交通大学医学院第一附属医院 Brain neurosurgery registration method based on pattern recognition and electronic equipment

Also Published As

Publication number Publication date
CN106202480B (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN106202480B (en) A kind of network behavior habit clustering method based on K-means and LDA bi-directional verification
Zhang et al. A multi-objective evolutionary approach for mining frequent and high utility itemsets
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN108090607A (en) A kind of social media user&#39;s ascribed characteristics of population Forecasting Methodology based on the fusion of multi-model storehouse
CN104933622A (en) Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme
CN101770520A (en) User interest modeling method based on user browsing behavior
CN103870474A (en) News topic organizing method and device
CN103034687B (en) A kind of relating module recognition methodss based on 2 class heterogeneous networks
CN111724039B (en) Recommendation method for recommending customer service personnel to power users
Chen et al. Research on location fusion of spatial geological disaster based on fuzzy SVM
CN111191099B (en) User activity type identification method based on social media
CN103353880A (en) Data mining method adopting dissimilarity degree clustering and association
CN105205163B (en) A kind of multi-level two sorting technique of the incremental learning of science and technology news
CN104077723A (en) Social network recommending system and social network recommending method
Liu et al. A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge
CN106776859A (en) Mobile solution App commending systems based on user preference
Sharma et al. Trend analysis in machine learning research using text mining
CN103778206A (en) Method for providing network service resources
Naghavipour et al. Hybrid metaheuristics for QoS-aware service composition: A systematic mapping study
Yu et al. Spectrum-enhanced pairwise learning to rank
CN106919997A (en) A kind of customer consumption Forecasting Methodology of the ecommerce based on LDA
Kim et al. Collaborative filtering with a user-item matrix reduction technique
Guan et al. Customer load forecasting method based on the industry electricity consumption behavior portrait
Liu et al. Wheel hub customization with an interactive artificial immune algorithm
CN106202498A (en) A kind of network behavior custom quantization method based on classification corpus key word word frequency record association

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant