CN109711478A - A kind of large-scale data group searching method based on timing Density Clustering - Google Patents

A kind of large-scale data group searching method based on timing Density Clustering Download PDF

Info

Publication number
CN109711478A
CN109711478A CN201811642734.4A CN201811642734A CN109711478A CN 109711478 A CN109711478 A CN 109711478A CN 201811642734 A CN201811642734 A CN 201811642734A CN 109711478 A CN109711478 A CN 109711478A
Authority
CN
China
Prior art keywords
node
state
original cluster
cluster
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811642734.4A
Other languages
Chinese (zh)
Inventor
姚嘉豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201811642734.4A priority Critical patent/CN109711478A/en
Publication of CN109711478A publication Critical patent/CN109711478A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of large-scale data group searching method based on timing Density Clustering, based on the acquisition to the more than one hundred million a nodes of line, and preliminary pretreatment is carried out to collected node, original cluster and the dendrogram for expression node relationship are constructed, the group where merge node is found according to the connected relation for representing cluster node.With each round iteration of algorithm, the node contribution degree score of different moments is calculated, range query is executed to node according to the height of score value.In the case where guaranteeing the correctness of final group's discovery, the search efficiency to large-scale data network can be improved using the Density Clustering method based on timing well.I/O and cross-domain communication amount can be reduced to meet the critical issue of high-energy physics data-intensive access and enriched data query demand by using the solution of the present invention.

Description

A kind of large-scale data group searching method based on timing Density Clustering
Technical field
The invention belongs to information retrieval fields, more particularly, to a kind of large-scale data based on timing Density Clustering Group's searching method.
Background technique
The energy consumption problem of high-performance calculation is one of the main bottleneck that China promotes extensive supercomputer application, and high-energy physics is made Industry calculation amount is huge, however there is no effective solution strategy to Batch Arrival task at present.Give some nodes and according to them The method that the similitude of attribute is classified as group's (also referred to as cluster) is known as clustering algorithm.Clustering method at this stage can be divided into: be based on The clustering method (such as K-MEANS algorithm) of division, clustering method (such as BIRCH algorithm), density clustering based on level Method (such as DBSCAN algorithm).Wherein, density clustering method can overcome other clustering methods that can only find " class circle The shortcomings that shape " group.Currently, density clustering method (such as DBSCAN algorithm) is very widely used on the living conditions, than Such as Neuscience, astronomy.However, the scale of social networks is constantly expanded in recent years, the section of mobile application APP (such as microblogging) Points reach billions of.In face of large-scale complex data, existing group's searching method starts a series of calculating occurred Bottleneck.
Influence existing density clustering method performance indicator it is main there are two: first is to execute all nodes Range query time, it is directly proportional to the number of node;Second be cluster label propagation time, mainly spent Span from influence.For two above factor, someone it is proposed that excessively improved method is gone with a kind of passively mode It completes.Such as data information before is not learned to, and will cause a large amount of computing redundancy to limit algorithm Performance.Meanwhile existing improved method largely uses batch processing and grid mechanism, it is noted that grid mechanism is counting on a large scale According to there are problems in the scalability of collection, and batch processing mechanism then limits the interaction with user during the execution of the algorithm.
In recent years, it is a kind of new group nodes searching method based on timing that Son et al., which proposes Ti-DBC method, The attribute of given node can quickly arrange out the contribution degree of its neighbor node according to time sequence.Method based on timing means calculation Method allows user termination algorithm while to correct in a certain wheel iteration and generating one and corresponding interact as a result, playing with user Effect.However, the algorithm does not consider that node merger at the storage state of network after group, is facing large-scale complex When attribute data, the state after node execution range query is not recorded, it may appear that computing redundancy largely effects on the effect of algorithm Rate.
In conclusion problem of the existing technology is: currently based on group's searching method of density in face of extensive multiple Occurs the problem that computational efficiency is low, set expandability is insufficient when miscellaneous data.
Summary of the invention
To solve existing technological deficiency, the invention discloses a kind of new large-scale datas based on timing Density Clustering Group's searching method.The present invention can make certain moment node in different states effectively the state timing of node Different processing can reduce computing redundancy, and efficiency has a distinct increment when handling large-scale data.Secondly, being saved calculating In the contribution degree scoring of point symmetry groups group, as the continuous iteration of algorithm is found according to the height of the score value of different moments node Next node execution range query for being most hopeful merger and entering group, to improve convergence speed of the algorithm.
In order to solve the above technical problems, technical scheme is as follows:
A kind of large-scale data group searching method based on timing Density Clustering, comprising the following steps:
S1: according to given node, the three kinds of original states and original cluster of definition node;The original state includes just Primary state is not carried out state, execution state;The original cluster is the known close of the core point executed and the core point executed The set of the connected neighbor node of degree;
S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, defines the representative of different original clusters Between connecting degree be state (a, b), the state (a, b) include three kinds of states: strong continune, weak connectedness, no connection;Institute The a stated, b are the representative of respective original cluster;
S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges;
S4: selection executes the node of range query in the node of original cluster after merging;
S5: it executes the node of selection and updates dendrogram;
S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered.
In a preferred solution, the S1 includes the following contents:
Node by randomly choosing initial state executes range query, and content is as follows:
If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot executed simultaneously It is stored in noise sequence L;
If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, locate In other vertex ticks in the neighbor domain of node selected be the boundary point that is not carried out, and if in other nodes in core vertex neighborhood In the presence of the noise spot executed, then noise spot is updated to the boundary point executed.
The above process does not stop to execute the operation of the range query until the node of all initial states is all completed, all sections Point is all to be not carried out state or execution state, and all nodes are converged into many different original clusters.
In a preferred solution, the S2 includes the following contents:
The strong continune is that original cluster a is connected with original cluster b density;
The weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b;
Remaining state is defined as original cluster a and original cluster b without connected relation.
In this preferred embodiment, in t moment, state (a, b) is the original cluster node of strong ties in next step range query Merge the likelihood ratio weak connectedness state for becoming the same group when (i.e. t+1 moment) and stateless node is high.
In a preferred solution, the S3 includes the following contents:
Connecting degree between representative based on the obtained each original cluster of S2, finding out state (a, b) is strong continune state Node, the real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram;The node institute for being in strong continune state The cluster of representative merges, and is defined as the original cluster after merging.
It, can in the time that the label that it is executed is propagated since the scale of dendrogram is very small in this preferred embodiment It is greatly decreased to improve the performance of algorithm.Cluster representated by the above node in strong continune state is merged, after becoming merging Original cluster.It is possible thereby to which the representative node for the original cluster being greatly reduced in S2 in dendrogram, can be improved the efficiency of algorithm.
In a preferred solution, the S4 includes the following contents:
It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates It is each not carried out the importance of the node of state out and it is ranked up, range is executed since the highest node of importance and is looked into It askes;
Wherein, in t moment, the statistic of node is expressed by following formula:
Wherein, the n (a) is the sum that node is not carried out in original cluster a, and the np (a) is original cluster a node Sum, the n are the sums of all nodes;
It is expressed in the degree of t moment, node by following formula:
d(a)=State (a, b)=weaks(a)+∑State (a, b)=nothings(a)
It is expressed in the importance of t moment, node by following formula:
Wherein, the p (a) is original cluster a, and the ne (a) is the sum of neighbor node in the cluster of a.
In this preferred embodiment, state is that weak connectedness and stateless node are fewer in dendrogram, and algorithm is looked into closer to convergence The number of inquiry is fewer, and performance is then higher.
In a preferred solution, the S5 includes the following contents:
If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster; Otherwise, it is defined as the core point being not carried out;Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is straight To algorithmic statement.
In a preferred solution, the mark of the algorithmic statement are as follows: the state on the side of all nodes in dendrogram (a, b) is strong continune state, then defines algorithmic statement.
In a preferred solution, the S6 includes the following contents:
All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, if depositing Above-mentioned node is then being changed to boundary point;If it does not exist, then it is labeled as noise spot.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
(1) when searching for group, the invention proposes a kind of mechanism of Active Learning, can learn to not being performed range The nodal information of inquiry, and reduce using it the sum of range query and number that label is propagated, algorithm is improved extensive Search efficiency under data network.
(2) approach application of the invention allows user's end in a certain wheel iteration to the timing technology of node state Only algorithm simultaneously correct and generate one it is corresponding as a result, playing the effect interacted with user.
(3) present invention breaches the use limitation of grid mechanism, does not need the data using high time and spatial complexity Structure, scalability are well adapted for large-scale complex data set.
(4) present invention establishes efficient figure (network) data directory structure and Data Management Model, proposes node tribute Degree of offering assessment algorithm, effectively improves data access efficiency, can reduce I/O and cross-domain communication amount, meet high-energy physics data The critical issue of intensity access and enriched data query demand.
Detailed description of the invention
Fig. 1 is the flow chart of the present embodiment.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size;
To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, a kind of large-scale data group searching method based on timing Density Clustering, comprising the following steps:
S1: according to given node, the three kinds of original states and original cluster of definition node;Original state include initial state, It is not carried out state, execution state;Institute's original cluster is that the core point executed is connected with the known density of the core point executed The set of neighbor node;
S1 includes the following contents:
Node by randomly choosing initial state executes range query, and content is as follows:
If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot executed simultaneously It is stored in noise sequence L;
If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, locate In other vertex ticks in the neighbor domain of node selected be the boundary point that is not carried out, and if in other nodes in core vertex neighborhood In the presence of the noise spot executed, then noise spot is updated to the boundary point executed;
The above process does not stop to execute the operation of the range query until the node of all initial states is all completed;
S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, defines the representative of different original clusters Between connecting degree be state (a, b), state (a, b) include three kinds of states: strong continune, weak connectedness, no connection;A, b are The representative of respective original cluster;
Strong continune is that original cluster a is connected with original cluster b density;
Weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b;
Remaining state is defined as original cluster a and original cluster b without connected relation;
S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges;
Connecting degree between representative based on the obtained each original cluster of S2, finding out state (a, b) is strong continune state Node, the real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram;The node institute for being in strong continune state The cluster of representative merges, and is defined as the original cluster after merging;
S4: selection executes the node of range query in the node of original cluster after merging;
It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates It is each not carried out the importance of the node of state out and it is ranked up, range is executed since the highest node of importance and is looked into It askes;
Wherein, in t moment, the statistic of node is expressed by following formula:
Wherein, n (a) is the sum that node is not carried out in original cluster a, and np (a) is the sum of original cluster a node, and n is all The sum of node;
It is expressed in the degree of t moment, node by following formula:
d(a)=State (a, b)=weaks(a)+∑State (a, b)=nothings(a)
It is expressed in the importance of t moment, node by following formula:
Wherein, p (a) is original cluster a, and ne (a) is the sum of neighbor node in the cluster of a;
S5: it executes the node of selection and updates dendrogram;
If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster; Otherwise, it is defined as the core point being not carried out;Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is straight To algorithmic statement, the mark of algorithmic statement are as follows: the state (a, b) on the side of all nodes is strong continune state in dendrogram, then fixed Adopted algorithmic statement;
S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered;
All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, if depositing Above-mentioned node is then being changed to boundary point;If it does not exist, then it is labeled as noise spot.
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;It is aobvious So, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to reality of the invention Apply the restriction of mode.For those of ordinary skill in the art, it can also make on the basis of the above description other Various forms of variations or variation.There is no necessity and possibility to exhaust all the enbodiments.It is all in spirit of the invention With any modifications, equivalent replacements, and improvements made within principle etc., the protection scope of the claims in the present invention should be included in Within.

Claims (8)

1. a kind of large-scale data group searching method based on timing Density Clustering, which comprises the following steps:
S1: according to given node, the three kinds of original states and original cluster of definition node;The original state includes initial State is not carried out state, execution state;The original cluster is the known density of the core point executed and the core point executed The set of connected neighbor node;
S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, between the representative for defining different original clusters Connecting degree is state (a, b), and the state (a, b) includes three kinds of states: strong continune, weak connectedness, no connection;Described A, b are the representative of respective original cluster;
S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges;
S4: selection executes the node of range query in the node of original cluster after merging;
S5: it executes the node of selection and updates dendrogram;
S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered.
2. large-scale data group according to claim 1 searching method, which is characterized in that the S1 includes in following Hold:
Node by randomly choosing initial state executes range query, and content is as follows:
If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot and deposit executed In noise sequence L;
If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, in choosing Other vertex ticks in the neighbor domain of node selected are the boundary point being not carried out, and in other nodes in core vertex neighborhood if it exists The noise spot executed, then noise spot is updated to the boundary point executed;
The above process does not stop to execute the operation of the range query until the node of all initial states is all completed.
3. large-scale data group according to claim 1 or 2 searching method, which is characterized in that the S2 include with Lower content:
The strong continune is that original cluster a is connected with original cluster b density;
The weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b;
Remaining state is defined as original cluster a and original cluster b without connected relation.
4. large-scale data group according to claim 3 searching method, which is characterized in that the S3 includes in following Hold:
Connecting degree between representative based on the obtained each original cluster of S2 finds out the section that state (a, b) is strong continune state The real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram by point;Representated by the node in strong continune state Cluster merge, be defined as merge after original cluster.
5. large-scale data group according to claim 4 searching method, which is characterized in that the S4 includes in following Hold:
It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates every The importance of a node for being not carried out state is simultaneously ranked up it, and range query is executed since the highest node of importance;
Wherein, in t moment, the statistic of node is expressed by following formula:
Wherein, the n (a) is the sum that node is not carried out in original cluster a, and the np (a) is the total of original cluster a node Number, the n is the sum of all nodes;
It is expressed in the degree of t moment, node by following formula:
D (a)=∑State (a, b)=weaks(a)+∑State (a, b)=nothings(a)
It is expressed in the importance of t moment, node by following formula:
Wherein, the p (a) is original cluster a, and the ne (a) is the sum of neighbor node in the cluster of a.
6. large-scale data group searching method according to claim 1,2,4 or 5, which is characterized in that the S5 packet Include the following contents:
If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster;Otherwise, It is defined as the core point being not carried out;Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is until calculate Method convergence.
7. large-scale data group according to claim 6 searching method, which is characterized in that the mark of the algorithmic statement Will are as follows: the state (a, b) on the side of all nodes is strong continune state in dendrogram, then defines algorithmic statement.
8. according to claim 1, large-scale data group searching method described in 2,4,5 or 7, which is characterized in that the S6 Including the following contents:
All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, and if it exists, then Above-mentioned node is changed to boundary point;If it does not exist, then it is labeled as noise spot.
CN201811642734.4A 2018-12-29 2018-12-29 A kind of large-scale data group searching method based on timing Density Clustering Pending CN109711478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811642734.4A CN109711478A (en) 2018-12-29 2018-12-29 A kind of large-scale data group searching method based on timing Density Clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811642734.4A CN109711478A (en) 2018-12-29 2018-12-29 A kind of large-scale data group searching method based on timing Density Clustering

Publications (1)

Publication Number Publication Date
CN109711478A true CN109711478A (en) 2019-05-03

Family

ID=66259612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811642734.4A Pending CN109711478A (en) 2018-12-29 2018-12-29 A kind of large-scale data group searching method based on timing Density Clustering

Country Status (1)

Country Link
CN (1) CN109711478A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114096832A (en) * 2019-07-09 2022-02-25 科磊股份有限公司 System and method for selecting defect detection methods for specimen inspection

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110029519A1 (en) * 2003-04-25 2011-02-03 Leland Stanford Junior University Population clustering through density-based merging
US20110055212A1 (en) * 2009-09-01 2011-03-03 Cheng-Fa Tsai Density-based data clustering method
CN103106279A (en) * 2013-02-21 2013-05-15 浙江大学 Clustering method simultaneously based on node attribute and structural relationship similarity
US20130185235A1 (en) * 2012-01-18 2013-07-18 Fuji Xerox Co., Ltd. Non-transitory computer readable medium storing a program, search apparatus, search method, and clustering device
CN105760478A (en) * 2016-02-15 2016-07-13 中山大学 Large-scale distributed data clustering method based on machine learning
CN108205570A (en) * 2016-12-19 2018-06-26 华为技术有限公司 A kind of data detection method and device
CN108512754A (en) * 2018-03-23 2018-09-07 南京邮电大学 A kind of wireless sense network cluster algorithm based on mobile sink

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110029519A1 (en) * 2003-04-25 2011-02-03 Leland Stanford Junior University Population clustering through density-based merging
US20110055212A1 (en) * 2009-09-01 2011-03-03 Cheng-Fa Tsai Density-based data clustering method
US20130185235A1 (en) * 2012-01-18 2013-07-18 Fuji Xerox Co., Ltd. Non-transitory computer readable medium storing a program, search apparatus, search method, and clustering device
CN103106279A (en) * 2013-02-21 2013-05-15 浙江大学 Clustering method simultaneously based on node attribute and structural relationship similarity
CN105760478A (en) * 2016-02-15 2016-07-13 中山大学 Large-scale distributed data clustering method based on machine learning
CN108205570A (en) * 2016-12-19 2018-06-26 华为技术有限公司 A kind of data detection method and device
CN108512754A (en) * 2018-03-23 2018-09-07 南京邮电大学 A kind of wireless sense network cluster algorithm based on mobile sink

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
S. CHANDRAKALA等: "A Density based Method for Multivariate Time Series Clustering in Kernel Feature Space", 《2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IEEE WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE)》, pages 1885 - 1890 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114096832A (en) * 2019-07-09 2022-02-25 科磊股份有限公司 System and method for selecting defect detection methods for specimen inspection
TWI826703B (en) * 2019-07-09 2023-12-21 美商科磊股份有限公司 Systems and methods for selecting defect detection methods for inspection of a specimen

Similar Documents

Publication Publication Date Title
Li et al. Skyline community search in multi-valued networks
CN108600321A (en) A kind of diagram data storage method and system based on distributed memory cloud
CN105893381A (en) Semi-supervised label propagation based microblog user group division method
Ma et al. Decomposition-based multiobjective evolutionary algorithm for community detection in dynamic social networks
CN105335438A (en) Local shortest loop based social network group division method
CN110909111A (en) Distributed storage and indexing method based on knowledge graph RDF data characteristics
Chen et al. Detecting community structures in social networks with particle swarm optimization
CN109711478A (en) A kind of large-scale data group searching method based on timing Density Clustering
CN108198084A (en) A kind of complex network is overlapped community discovery method
CN117272195A (en) Block chain abnormal node detection method and system based on graph convolution attention network
CN106650800B (en) Markov equivalence class model distributed learning method based on Storm
CN107018027B (en) Link prediction method based on Bayesian estimation and common neighbor node degree
CN105354243B (en) The frequent probability subgraph search method of parallelization based on merger cluster
CN116011564A (en) Entity relationship completion method, system and application for power equipment
Wu Data association rules mining method based on improved apriori algorithm
CN108804582A (en) Method based on the chart database optimization of complex relationship between big data
CN109492677A (en) Time-varying network link prediction method based on bayesian theory
CN109033746A (en) A kind of protein complex recognizing method based on knot vector
CN111369052B (en) Simplified road network KSP optimization algorithm
Liu et al. Community discovery in weighted networks based on the similarity of common neighbors
CN106599187B (en) Edge instability based community discovery system and method
CN108614889B (en) Moving object continuous k nearest neighbor query method and system based on Gaussian mixture model
CN107231252B (en) Link prediction method based on Bayesian estimation and seed node neighbor set
Tang et al. An efficient method based on label propagation for overlapping community detection
CN105337759A (en) Internal and external ratio measurement method based on community structure, and community discovery method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination