CN109711478A - A kind of large-scale data group searching method based on timing Density Clustering - Google Patents
A kind of large-scale data group searching method based on timing Density Clustering Download PDFInfo
- Publication number
- CN109711478A CN109711478A CN201811642734.4A CN201811642734A CN109711478A CN 109711478 A CN109711478 A CN 109711478A CN 201811642734 A CN201811642734 A CN 201811642734A CN 109711478 A CN109711478 A CN 109711478A
- Authority
- CN
- China
- Prior art keywords
- node
- state
- original cluster
- cluster
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of large-scale data group searching method based on timing Density Clustering, based on the acquisition to the more than one hundred million a nodes of line, and preliminary pretreatment is carried out to collected node, original cluster and the dendrogram for expression node relationship are constructed, the group where merge node is found according to the connected relation for representing cluster node.With each round iteration of algorithm, the node contribution degree score of different moments is calculated, range query is executed to node according to the height of score value.In the case where guaranteeing the correctness of final group's discovery, the search efficiency to large-scale data network can be improved using the Density Clustering method based on timing well.I/O and cross-domain communication amount can be reduced to meet the critical issue of high-energy physics data-intensive access and enriched data query demand by using the solution of the present invention.
Description
Technical field
The invention belongs to information retrieval fields, more particularly, to a kind of large-scale data based on timing Density Clustering
Group's searching method.
Background technique
The energy consumption problem of high-performance calculation is one of the main bottleneck that China promotes extensive supercomputer application, and high-energy physics is made
Industry calculation amount is huge, however there is no effective solution strategy to Batch Arrival task at present.Give some nodes and according to them
The method that the similitude of attribute is classified as group's (also referred to as cluster) is known as clustering algorithm.Clustering method at this stage can be divided into: be based on
The clustering method (such as K-MEANS algorithm) of division, clustering method (such as BIRCH algorithm), density clustering based on level
Method (such as DBSCAN algorithm).Wherein, density clustering method can overcome other clustering methods that can only find " class circle
The shortcomings that shape " group.Currently, density clustering method (such as DBSCAN algorithm) is very widely used on the living conditions, than
Such as Neuscience, astronomy.However, the scale of social networks is constantly expanded in recent years, the section of mobile application APP (such as microblogging)
Points reach billions of.In face of large-scale complex data, existing group's searching method starts a series of calculating occurred
Bottleneck.
Influence existing density clustering method performance indicator it is main there are two: first is to execute all nodes
Range query time, it is directly proportional to the number of node;Second be cluster label propagation time, mainly spent
Span from influence.For two above factor, someone it is proposed that excessively improved method is gone with a kind of passively mode
It completes.Such as data information before is not learned to, and will cause a large amount of computing redundancy to limit algorithm
Performance.Meanwhile existing improved method largely uses batch processing and grid mechanism, it is noted that grid mechanism is counting on a large scale
According to there are problems in the scalability of collection, and batch processing mechanism then limits the interaction with user during the execution of the algorithm.
In recent years, it is a kind of new group nodes searching method based on timing that Son et al., which proposes Ti-DBC method,
The attribute of given node can quickly arrange out the contribution degree of its neighbor node according to time sequence.Method based on timing means calculation
Method allows user termination algorithm while to correct in a certain wheel iteration and generating one and corresponding interact as a result, playing with user
Effect.However, the algorithm does not consider that node merger at the storage state of network after group, is facing large-scale complex
When attribute data, the state after node execution range query is not recorded, it may appear that computing redundancy largely effects on the effect of algorithm
Rate.
In conclusion problem of the existing technology is: currently based on group's searching method of density in face of extensive multiple
Occurs the problem that computational efficiency is low, set expandability is insufficient when miscellaneous data.
Summary of the invention
To solve existing technological deficiency, the invention discloses a kind of new large-scale datas based on timing Density Clustering
Group's searching method.The present invention can make certain moment node in different states effectively the state timing of node
Different processing can reduce computing redundancy, and efficiency has a distinct increment when handling large-scale data.Secondly, being saved calculating
In the contribution degree scoring of point symmetry groups group, as the continuous iteration of algorithm is found according to the height of the score value of different moments node
Next node execution range query for being most hopeful merger and entering group, to improve convergence speed of the algorithm.
In order to solve the above technical problems, technical scheme is as follows:
A kind of large-scale data group searching method based on timing Density Clustering, comprising the following steps:
S1: according to given node, the three kinds of original states and original cluster of definition node;The original state includes just
Primary state is not carried out state, execution state;The original cluster is the known close of the core point executed and the core point executed
The set of the connected neighbor node of degree;
S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, defines the representative of different original clusters
Between connecting degree be state (a, b), the state (a, b) include three kinds of states: strong continune, weak connectedness, no connection;Institute
The a stated, b are the representative of respective original cluster;
S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges;
S4: selection executes the node of range query in the node of original cluster after merging;
S5: it executes the node of selection and updates dendrogram;
S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered.
In a preferred solution, the S1 includes the following contents:
Node by randomly choosing initial state executes range query, and content is as follows:
If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot executed simultaneously
It is stored in noise sequence L;
If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, locate
In other vertex ticks in the neighbor domain of node selected be the boundary point that is not carried out, and if in other nodes in core vertex neighborhood
In the presence of the noise spot executed, then noise spot is updated to the boundary point executed.
The above process does not stop to execute the operation of the range query until the node of all initial states is all completed, all sections
Point is all to be not carried out state or execution state, and all nodes are converged into many different original clusters.
In a preferred solution, the S2 includes the following contents:
The strong continune is that original cluster a is connected with original cluster b density;
The weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b;
Remaining state is defined as original cluster a and original cluster b without connected relation.
In this preferred embodiment, in t moment, state (a, b) is the original cluster node of strong ties in next step range query
Merge the likelihood ratio weak connectedness state for becoming the same group when (i.e. t+1 moment) and stateless node is high.
In a preferred solution, the S3 includes the following contents:
Connecting degree between representative based on the obtained each original cluster of S2, finding out state (a, b) is strong continune state
Node, the real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram;The node institute for being in strong continune state
The cluster of representative merges, and is defined as the original cluster after merging.
It, can in the time that the label that it is executed is propagated since the scale of dendrogram is very small in this preferred embodiment
It is greatly decreased to improve the performance of algorithm.Cluster representated by the above node in strong continune state is merged, after becoming merging
Original cluster.It is possible thereby to which the representative node for the original cluster being greatly reduced in S2 in dendrogram, can be improved the efficiency of algorithm.
In a preferred solution, the S4 includes the following contents:
It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates
It is each not carried out the importance of the node of state out and it is ranked up, range is executed since the highest node of importance and is looked into
It askes;
Wherein, in t moment, the statistic of node is expressed by following formula:
Wherein, the n (a) is the sum that node is not carried out in original cluster a, and the np (a) is original cluster a node
Sum, the n are the sums of all nodes;
It is expressed in the degree of t moment, node by following formula:
d(a)=∑State (a, b)=weaks(a)+∑State (a, b)=nothings(a)
It is expressed in the importance of t moment, node by following formula:
Wherein, the p (a) is original cluster a, and the ne (a) is the sum of neighbor node in the cluster of a.
In this preferred embodiment, state is that weak connectedness and stateless node are fewer in dendrogram, and algorithm is looked into closer to convergence
The number of inquiry is fewer, and performance is then higher.
In a preferred solution, the S5 includes the following contents:
If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster;
Otherwise, it is defined as the core point being not carried out;Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is straight
To algorithmic statement.
In a preferred solution, the mark of the algorithmic statement are as follows: the state on the side of all nodes in dendrogram
(a, b) is strong continune state, then defines algorithmic statement.
In a preferred solution, the S6 includes the following contents:
All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, if depositing
Above-mentioned node is then being changed to boundary point;If it does not exist, then it is labeled as noise spot.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
(1) when searching for group, the invention proposes a kind of mechanism of Active Learning, can learn to not being performed range
The nodal information of inquiry, and reduce using it the sum of range query and number that label is propagated, algorithm is improved extensive
Search efficiency under data network.
(2) approach application of the invention allows user's end in a certain wheel iteration to the timing technology of node state
Only algorithm simultaneously correct and generate one it is corresponding as a result, playing the effect interacted with user.
(3) present invention breaches the use limitation of grid mechanism, does not need the data using high time and spatial complexity
Structure, scalability are well adapted for large-scale complex data set.
(4) present invention establishes efficient figure (network) data directory structure and Data Management Model, proposes node tribute
Degree of offering assessment algorithm, effectively improves data access efficiency, can reduce I/O and cross-domain communication amount, meet high-energy physics data
The critical issue of intensity access and enriched data query demand.
Detailed description of the invention
Fig. 1 is the flow chart of the present embodiment.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product
Size;
To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing
's.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, a kind of large-scale data group searching method based on timing Density Clustering, comprising the following steps:
S1: according to given node, the three kinds of original states and original cluster of definition node;Original state include initial state,
It is not carried out state, execution state;Institute's original cluster is that the core point executed is connected with the known density of the core point executed
The set of neighbor node;
S1 includes the following contents:
Node by randomly choosing initial state executes range query, and content is as follows:
If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot executed simultaneously
It is stored in noise sequence L;
If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, locate
In other vertex ticks in the neighbor domain of node selected be the boundary point that is not carried out, and if in other nodes in core vertex neighborhood
In the presence of the noise spot executed, then noise spot is updated to the boundary point executed;
The above process does not stop to execute the operation of the range query until the node of all initial states is all completed;
S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, defines the representative of different original clusters
Between connecting degree be state (a, b), state (a, b) include three kinds of states: strong continune, weak connectedness, no connection;A, b are
The representative of respective original cluster;
Strong continune is that original cluster a is connected with original cluster b density;
Weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b;
Remaining state is defined as original cluster a and original cluster b without connected relation;
S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges;
Connecting degree between representative based on the obtained each original cluster of S2, finding out state (a, b) is strong continune state
Node, the real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram;The node institute for being in strong continune state
The cluster of representative merges, and is defined as the original cluster after merging;
S4: selection executes the node of range query in the node of original cluster after merging;
It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates
It is each not carried out the importance of the node of state out and it is ranked up, range is executed since the highest node of importance and is looked into
It askes;
Wherein, in t moment, the statistic of node is expressed by following formula:
Wherein, n (a) is the sum that node is not carried out in original cluster a, and np (a) is the sum of original cluster a node, and n is all
The sum of node;
It is expressed in the degree of t moment, node by following formula:
d(a)=∑State (a, b)=weaks(a)+∑State (a, b)=nothings(a)
It is expressed in the importance of t moment, node by following formula:
Wherein, p (a) is original cluster a, and ne (a) is the sum of neighbor node in the cluster of a;
S5: it executes the node of selection and updates dendrogram;
If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster;
Otherwise, it is defined as the core point being not carried out;Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is straight
To algorithmic statement, the mark of algorithmic statement are as follows: the state (a, b) on the side of all nodes is strong continune state in dendrogram, then fixed
Adopted algorithmic statement;
S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered;
All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, if depositing
Above-mentioned node is then being changed to boundary point;If it does not exist, then it is labeled as noise spot.
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;It is aobvious
So, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to reality of the invention
Apply the restriction of mode.For those of ordinary skill in the art, it can also make on the basis of the above description other
Various forms of variations or variation.There is no necessity and possibility to exhaust all the enbodiments.It is all in spirit of the invention
With any modifications, equivalent replacements, and improvements made within principle etc., the protection scope of the claims in the present invention should be included in
Within.
Claims (8)
1. a kind of large-scale data group searching method based on timing Density Clustering, which comprises the following steps:
S1: according to given node, the three kinds of original states and original cluster of definition node;The original state includes initial
State is not carried out state, execution state;The original cluster is the known density of the core point executed and the core point executed
The set of connected neighbor node;
S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, between the representative for defining different original clusters
Connecting degree is state (a, b), and the state (a, b) includes three kinds of states: strong continune, weak connectedness, no connection;Described
A, b are the representative of respective original cluster;
S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges;
S4: selection executes the node of range query in the node of original cluster after merging;
S5: it executes the node of selection and updates dendrogram;
S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered.
2. large-scale data group according to claim 1 searching method, which is characterized in that the S1 includes in following
Hold:
Node by randomly choosing initial state executes range query, and content is as follows:
If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot and deposit executed
In noise sequence L;
If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, in choosing
Other vertex ticks in the neighbor domain of node selected are the boundary point being not carried out, and in other nodes in core vertex neighborhood if it exists
The noise spot executed, then noise spot is updated to the boundary point executed;
The above process does not stop to execute the operation of the range query until the node of all initial states is all completed.
3. large-scale data group according to claim 1 or 2 searching method, which is characterized in that the S2 include with
Lower content:
The strong continune is that original cluster a is connected with original cluster b density;
The weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b;
Remaining state is defined as original cluster a and original cluster b without connected relation.
4. large-scale data group according to claim 3 searching method, which is characterized in that the S3 includes in following
Hold:
Connecting degree between representative based on the obtained each original cluster of S2 finds out the section that state (a, b) is strong continune state
The real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram by point;Representated by the node in strong continune state
Cluster merge, be defined as merge after original cluster.
5. large-scale data group according to claim 4 searching method, which is characterized in that the S4 includes in following
Hold:
It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates every
The importance of a node for being not carried out state is simultaneously ranked up it, and range query is executed since the highest node of importance;
Wherein, in t moment, the statistic of node is expressed by following formula:
Wherein, the n (a) is the sum that node is not carried out in original cluster a, and the np (a) is the total of original cluster a node
Number, the n is the sum of all nodes;
It is expressed in the degree of t moment, node by following formula:
D (a)=∑State (a, b)=weaks(a)+∑State (a, b)=nothings(a)
It is expressed in the importance of t moment, node by following formula:
Wherein, the p (a) is original cluster a, and the ne (a) is the sum of neighbor node in the cluster of a.
6. large-scale data group searching method according to claim 1,2,4 or 5, which is characterized in that the S5 packet
Include the following contents:
If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster;Otherwise,
It is defined as the core point being not carried out;Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is until calculate
Method convergence.
7. large-scale data group according to claim 6 searching method, which is characterized in that the mark of the algorithmic statement
Will are as follows: the state (a, b) on the side of all nodes is strong continune state in dendrogram, then defines algorithmic statement.
8. according to claim 1, large-scale data group searching method described in 2,4,5 or 7, which is characterized in that the S6
Including the following contents:
All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, and if it exists, then
Above-mentioned node is changed to boundary point;If it does not exist, then it is labeled as noise spot.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811642734.4A CN109711478A (en) | 2018-12-29 | 2018-12-29 | A kind of large-scale data group searching method based on timing Density Clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811642734.4A CN109711478A (en) | 2018-12-29 | 2018-12-29 | A kind of large-scale data group searching method based on timing Density Clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109711478A true CN109711478A (en) | 2019-05-03 |
Family
ID=66259612
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811642734.4A Pending CN109711478A (en) | 2018-12-29 | 2018-12-29 | A kind of large-scale data group searching method based on timing Density Clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109711478A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114096832A (en) * | 2019-07-09 | 2022-02-25 | 科磊股份有限公司 | System and method for selecting defect detection methods for specimen inspection |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029519A1 (en) * | 2003-04-25 | 2011-02-03 | Leland Stanford Junior University | Population clustering through density-based merging |
US20110055212A1 (en) * | 2009-09-01 | 2011-03-03 | Cheng-Fa Tsai | Density-based data clustering method |
CN103106279A (en) * | 2013-02-21 | 2013-05-15 | 浙江大学 | Clustering method simultaneously based on node attribute and structural relationship similarity |
US20130185235A1 (en) * | 2012-01-18 | 2013-07-18 | Fuji Xerox Co., Ltd. | Non-transitory computer readable medium storing a program, search apparatus, search method, and clustering device |
CN105760478A (en) * | 2016-02-15 | 2016-07-13 | 中山大学 | Large-scale distributed data clustering method based on machine learning |
CN108205570A (en) * | 2016-12-19 | 2018-06-26 | 华为技术有限公司 | A kind of data detection method and device |
CN108512754A (en) * | 2018-03-23 | 2018-09-07 | 南京邮电大学 | A kind of wireless sense network cluster algorithm based on mobile sink |
-
2018
- 2018-12-29 CN CN201811642734.4A patent/CN109711478A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029519A1 (en) * | 2003-04-25 | 2011-02-03 | Leland Stanford Junior University | Population clustering through density-based merging |
US20110055212A1 (en) * | 2009-09-01 | 2011-03-03 | Cheng-Fa Tsai | Density-based data clustering method |
US20130185235A1 (en) * | 2012-01-18 | 2013-07-18 | Fuji Xerox Co., Ltd. | Non-transitory computer readable medium storing a program, search apparatus, search method, and clustering device |
CN103106279A (en) * | 2013-02-21 | 2013-05-15 | 浙江大学 | Clustering method simultaneously based on node attribute and structural relationship similarity |
CN105760478A (en) * | 2016-02-15 | 2016-07-13 | 中山大学 | Large-scale distributed data clustering method based on machine learning |
CN108205570A (en) * | 2016-12-19 | 2018-06-26 | 华为技术有限公司 | A kind of data detection method and device |
CN108512754A (en) * | 2018-03-23 | 2018-09-07 | 南京邮电大学 | A kind of wireless sense network cluster algorithm based on mobile sink |
Non-Patent Citations (1)
Title |
---|
S. CHANDRAKALA等: "A Density based Method for Multivariate Time Series Clustering in Kernel Feature Space", 《2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IEEE WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE)》, pages 1885 - 1890 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114096832A (en) * | 2019-07-09 | 2022-02-25 | 科磊股份有限公司 | System and method for selecting defect detection methods for specimen inspection |
TWI826703B (en) * | 2019-07-09 | 2023-12-21 | 美商科磊股份有限公司 | Systems and methods for selecting defect detection methods for inspection of a specimen |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Skyline community search in multi-valued networks | |
CN108600321A (en) | A kind of diagram data storage method and system based on distributed memory cloud | |
CN105893381A (en) | Semi-supervised label propagation based microblog user group division method | |
Ma et al. | Decomposition-based multiobjective evolutionary algorithm for community detection in dynamic social networks | |
CN105335438A (en) | Local shortest loop based social network group division method | |
CN110909111A (en) | Distributed storage and indexing method based on knowledge graph RDF data characteristics | |
Chen et al. | Detecting community structures in social networks with particle swarm optimization | |
CN109711478A (en) | A kind of large-scale data group searching method based on timing Density Clustering | |
CN108198084A (en) | A kind of complex network is overlapped community discovery method | |
CN117272195A (en) | Block chain abnormal node detection method and system based on graph convolution attention network | |
CN106650800B (en) | Markov equivalence class model distributed learning method based on Storm | |
CN107018027B (en) | Link prediction method based on Bayesian estimation and common neighbor node degree | |
CN105354243B (en) | The frequent probability subgraph search method of parallelization based on merger cluster | |
CN116011564A (en) | Entity relationship completion method, system and application for power equipment | |
Wu | Data association rules mining method based on improved apriori algorithm | |
CN108804582A (en) | Method based on the chart database optimization of complex relationship between big data | |
CN109492677A (en) | Time-varying network link prediction method based on bayesian theory | |
CN109033746A (en) | A kind of protein complex recognizing method based on knot vector | |
CN111369052B (en) | Simplified road network KSP optimization algorithm | |
Liu et al. | Community discovery in weighted networks based on the similarity of common neighbors | |
CN106599187B (en) | Edge instability based community discovery system and method | |
CN108614889B (en) | Moving object continuous k nearest neighbor query method and system based on Gaussian mixture model | |
CN107231252B (en) | Link prediction method based on Bayesian estimation and seed node neighbor set | |
Tang et al. | An efficient method based on label propagation for overlapping community detection | |
CN105337759A (en) | Internal and external ratio measurement method based on community structure, and community discovery method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |