CN109711478A

CN109711478A - A kind of large-scale data group searching method based on timing Density Clustering

Info

Publication number: CN109711478A
Application number: CN201811642734.4A
Authority: CN
Inventors: 姚嘉豪
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-03

Abstract

The invention discloses a kind of large-scale data group searching method based on timing Density Clustering, based on the acquisition to the more than one hundred million a nodes of line, and preliminary pretreatment is carried out to collected node, original cluster and the dendrogram for expression node relationship are constructed, the group where merge node is found according to the connected relation for representing cluster node.With each round iteration of algorithm, the node contribution degree score of different moments is calculated, range query is executed to node according to the height of score value.In the case where guaranteeing the correctness of final group's discovery, the search efficiency to large-scale data network can be improved using the Density Clustering method based on timing well.I/O and cross-domain communication amount can be reduced to meet the critical issue of high-energy physics data-intensive access and enriched data query demand by using the solution of the present invention.

Description

A kind of large-scale data group searching method based on timing Density Clustering

Technical field

The invention belongs to information retrieval fields, more particularly, to a kind of large-scale data based on timing Density Clustering Group's searching method.

Background technique

The energy consumption problem of high-performance calculation is one of the main bottleneck that China promotes extensive supercomputer application, and high-energy physics is made Industry calculation amount is huge, however there is no effective solution strategy to Batch Arrival task at present.Give some nodes and according to them The method that the similitude of attribute is classified as group's (also referred to as cluster) is known as clustering algorithm.Clustering method at this stage can be divided into: be based on The clustering method (such as K-MEANS algorithm) of division, clustering method (such as BIRCH algorithm), density clustering based on level Method (such as DBSCAN algorithm).Wherein, density clustering method can overcome other clustering methods that can only find " class circle The shortcomings that shape " group.Currently, density clustering method (such as DBSCAN algorithm) is very widely used on the living conditions, than Such as Neuscience, astronomy.However, the scale of social networks is constantly expanded in recent years, the section of mobile application APP (such as microblogging) Points reach billions of.In face of large-scale complex data, existing group's searching method starts a series of calculating occurred Bottleneck.

Influence existing density clustering method performance indicator it is main there are two: first is to execute all nodes Range query time, it is directly proportional to the number of node；Second be cluster label propagation time, mainly spent Span from influence.For two above factor, someone it is proposed that excessively improved method is gone with a kind of passively mode It completes.Such as data information before is not learned to, and will cause a large amount of computing redundancy to limit algorithm Performance.Meanwhile existing improved method largely uses batch processing and grid mechanism, it is noted that grid mechanism is counting on a large scale According to there are problems in the scalability of collection, and batch processing mechanism then limits the interaction with user during the execution of the algorithm.

In recent years, it is a kind of new group nodes searching method based on timing that Son et al., which proposes Ti-DBC method, The attribute of given node can quickly arrange out the contribution degree of its neighbor node according to time sequence.Method based on timing means calculation Method allows user termination algorithm while to correct in a certain wheel iteration and generating one and corresponding interact as a result, playing with user Effect.However, the algorithm does not consider that node merger at the storage state of network after group, is facing large-scale complex When attribute data, the state after node execution range query is not recorded, it may appear that computing redundancy largely effects on the effect of algorithm Rate.

In conclusion problem of the existing technology is: currently based on group's searching method of density in face of extensive multiple Occurs the problem that computational efficiency is low, set expandability is insufficient when miscellaneous data.

Summary of the invention

To solve existing technological deficiency, the invention discloses a kind of new large-scale datas based on timing Density Clustering Group's searching method.The present invention can make certain moment node in different states effectively the state timing of node Different processing can reduce computing redundancy, and efficiency has a distinct increment when handling large-scale data.Secondly, being saved calculating In the contribution degree scoring of point symmetry groups group, as the continuous iteration of algorithm is found according to the height of the score value of different moments node Next node execution range query for being most hopeful merger and entering group, to improve convergence speed of the algorithm.

In order to solve the above technical problems, technical scheme is as follows:

A kind of large-scale data group searching method based on timing Density Clustering, comprising the following steps:

S1: according to given node, the three kinds of original states and original cluster of definition node；The original state includes just Primary state is not carried out state, execution state；The original cluster is the known close of the core point executed and the core point executed The set of the connected neighbor node of degree；

S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, defines the representative of different original clusters Between connecting degree be state (a, b), the state (a, b) include three kinds of states: strong continune, weak connectedness, no connection；Institute The a stated, b are the representative of respective original cluster；

S3: it according to connecting degree between the representative of different original clusters, finds the component of strong continune and merges；

S4: selection executes the node of range query in the node of original cluster after merging；

S5: it executes the node of selection and updates dendrogram；

S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered.

In a preferred solution, the S1 includes the following contents:

Node by randomly choosing initial state executes range query, and content is as follows:

If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot executed simultaneously It is stored in noise sequence L；

If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, locate In other vertex ticks in the neighbor domain of node selected be the boundary point that is not carried out, and if in other nodes in core vertex neighborhood In the presence of the noise spot executed, then noise spot is updated to the boundary point executed.

The above process does not stop to execute the operation of the range query until the node of all initial states is all completed, all sections Point is all to be not carried out state or execution state, and all nodes are converged into many different original clusters.

In a preferred solution, the S2 includes the following contents:

The strong continune is that original cluster a is connected with original cluster b density；

The weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b；

Remaining state is defined as original cluster a and original cluster b without connected relation.

In this preferred embodiment, in t moment, state (a, b) is the original cluster node of strong ties in next step range query Merge the likelihood ratio weak connectedness state for becoming the same group when (i.e. t+1 moment) and stateless node is high.

In a preferred solution, the S3 includes the following contents:

Connecting degree between representative based on the obtained each original cluster of S2, finding out state (a, b) is strong continune state Node, the real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram；The node institute for being in strong continune state The cluster of representative merges, and is defined as the original cluster after merging.

It, can in the time that the label that it is executed is propagated since the scale of dendrogram is very small in this preferred embodiment It is greatly decreased to improve the performance of algorithm.Cluster representated by the above node in strong continune state is merged, after becoming merging Original cluster.It is possible thereby to which the representative node for the original cluster being greatly reduced in S2 in dendrogram, can be improved the efficiency of algorithm.

In a preferred solution, the S4 includes the following contents:

It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates It is each not carried out the importance of the node of state out and it is ranked up, range is executed since the highest node of importance and is looked into It askes；

Wherein, in t moment, the statistic of node is expressed by following formula:

Wherein, the n (a) is the sum that node is not carried out in original cluster a, and the np (a) is original cluster a node Sum, the n are the sums of all nodes；

It is expressed in the degree of t moment, node by following formula:

d(a)₌∑_{State (a, b)=weak}s(a)+∑_{State (a, b)=nothing}s(a)

It is expressed in the importance of t moment, node by following formula:

Wherein, the p (a) is original cluster a, and the ne (a) is the sum of neighbor node in the cluster of a.

In this preferred embodiment, state is that weak connectedness and stateless node are fewer in dendrogram, and algorithm is looked into closer to convergence The number of inquiry is fewer, and performance is then higher.

In a preferred solution, the S5 includes the following contents:

If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster； Otherwise, it is defined as the core point being not carried out；Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is straight To algorithmic statement.

In a preferred solution, the mark of the algorithmic statement are as follows: the state on the side of all nodes in dendrogram (a, b) is strong continune state, then defines algorithmic statement.

In a preferred solution, the S6 includes the following contents:

All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, if depositing Above-mentioned node is then being changed to boundary point；If it does not exist, then it is labeled as noise spot.

Compared with prior art, the beneficial effect of technical solution of the present invention is:

(1) when searching for group, the invention proposes a kind of mechanism of Active Learning, can learn to not being performed range The nodal information of inquiry, and reduce using it the sum of range query and number that label is propagated, algorithm is improved extensive Search efficiency under data network.

(2) approach application of the invention allows user's end in a certain wheel iteration to the timing technology of node state Only algorithm simultaneously correct and generate one it is corresponding as a result, playing the effect interacted with user.

(3) present invention breaches the use limitation of grid mechanism, does not need the data using high time and spatial complexity Structure, scalability are well adapted for large-scale complex data set.

(4) present invention establishes efficient figure (network) data directory structure and Data Management Model, proposes node tribute Degree of offering assessment algorithm, effectively improves data access efficiency, can reduce I/O and cross-domain communication amount, meet high-energy physics data The critical issue of intensity access and enriched data query demand.

Detailed description of the invention

Fig. 1 is the flow chart of the present embodiment.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

As shown in Figure 1, a kind of large-scale data group searching method based on timing Density Clustering, comprising the following steps:

S1: according to given node, the three kinds of original states and original cluster of definition node；Original state include initial state, It is not carried out state, execution state；Institute's original cluster is that the core point executed is connected with the known density of the core point executed The set of neighbor node；

S1 includes the following contents:

If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, locate In other vertex ticks in the neighbor domain of node selected be the boundary point that is not carried out, and if in other nodes in core vertex neighborhood In the presence of the noise spot executed, then noise spot is updated to the boundary point executed；

The above process does not stop to execute the operation of the range query until the node of all initial states is all completed；

S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, defines the representative of different original clusters Between connecting degree be state (a, b), state (a, b) include three kinds of states: strong continune, weak connectedness, no connection；A, b are The representative of respective original cluster；

Strong continune is that original cluster a is connected with original cluster b density；

Weak connectedness state is that there are intersections for nodal set in original cluster a and original cluster b；

Remaining state is defined as original cluster a and original cluster b without connected relation；

Connecting degree between representative based on the obtained each original cluster of S2, finding out state (a, b) is strong continune state Node, the real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram；The node institute for being in strong continune state The cluster of representative merges, and is defined as the original cluster after merging；

Wherein, in t moment, the statistic of node is expressed by following formula:

Wherein, n (a) is the sum that node is not carried out in original cluster a, and np (a) is the sum of original cluster a node, and n is all The sum of node；

It is expressed in the degree of t moment, node by following formula:

d(a)₌∑_{State (a, b)=weak}s(a)+∑_{State (a, b)=nothing}s(a)

It is expressed in the importance of t moment, node by following formula:

Wherein, p (a) is original cluster a, and ne (a) is the sum of neighbor node in the cluster of a；

S5: it executes the node of selection and updates dendrogram；

If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster； Otherwise, it is defined as the core point being not carried out；Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is straight To algorithmic statement, the mark of algorithmic statement are as follows: the state (a, b) on the side of all nodes is strong continune state in dendrogram, then fixed Adopted algorithmic statement；

S6: checking the noise spot in S1, exports the noise spot of review and the cluster clustered；

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；It is aobvious So, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to reality of the invention Apply the restriction of mode.For those of ordinary skill in the art, it can also make on the basis of the above description other Various forms of variations or variation.There is no necessity and possibility to exhaust all the enbodiments.It is all in spirit of the invention With any modifications, equivalent replacements, and improvements made within principle etc., the protection scope of the claims in the present invention should be included in Within.

Claims

1. a kind of large-scale data group searching method based on timing Density Clustering, which comprises the following steps:

S1: according to given node, the three kinds of original states and original cluster of definition node；The original state includes initial State is not carried out state, execution state；The original cluster is the known density of the core point executed and the core point executed The set of connected neighbor node；

S2: according to the correlation of original cluster, the dendrogram between original cluster is constructed, between the representative for defining different original clusters Connecting degree is state (a, b), and the state (a, b) includes three kinds of states: strong continune, weak connectedness, no connection；Described A, b are the representative of respective original cluster；

S5: it executes the node of selection and updates dendrogram；

2. large-scale data group according to claim 1 searching method, which is characterized in that the S1 includes in following Hold:

If the neighbor node number of the node of selection is less than μ, the node selected is marked as the noise spot and deposit executed In noise sequence L；

If the nodes neighbors number of nodes of selection is less than μ, the node selected can be marked as the core point executed, in choosing Other vertex ticks in the neighbor domain of node selected are the boundary point being not carried out, and in other nodes in core vertex neighborhood if it exists The noise spot executed, then noise spot is updated to the boundary point executed；

The above process does not stop to execute the operation of the range query until the node of all initial states is all completed.

3. large-scale data group according to claim 1 or 2 searching method, which is characterized in that the S2 include with Lower content:

4. large-scale data group according to claim 3 searching method, which is characterized in that the S3 includes in following Hold:

Connecting degree between representative based on the obtained each original cluster of S2 finds out the section that state (a, b) is strong continune state The real-time status of above-mentioned node is recorded in another figure, is defined as dendrogram by point；Representated by the node in strong continune state Cluster merge, be defined as merge after original cluster.

5. large-scale data group according to claim 4 searching method, which is characterized in that the S4 includes in following Hold:

It is all in the node progress importance assessment for being not carried out state in original cluster after merging in dendrogram, it calculates every The importance of a node for being not carried out state is simultaneously ranked up it, and range query is executed since the highest node of importance；

Wherein, in t moment, the statistic of node is expressed by following formula:

Wherein, the n (a) is the sum that node is not carried out in original cluster a, and the np (a) is the total of original cluster a node Number, the n is the sum of all nodes；

It is expressed in the degree of t moment, node by following formula:

D (a)=∑_{State (a, b)=weak}s(a)+∑_{State (a, b)=nothing}s(a)

It is expressed in the importance of t moment, node by following formula:

6. large-scale data group searching method according to claim 1,2,4 or 5, which is characterized in that the S5 packet Include the following contents:

If inquiring the boundary point for existing in original cluster and being not carried out, the boundary point being not carried out is integrated into original cluster；Otherwise, It is defined as the core point being not carried out；Simultaneously according to the execution state of node, the node state being cyclically updated in dendrogram is until calculate Method convergence.

7. large-scale data group according to claim 6 searching method, which is characterized in that the mark of the algorithmic statement Will are as follows: the state (a, b) on the side of all nodes is strong continune state in dendrogram, then defines algorithmic statement.

8. according to claim 1, large-scale data group searching method described in 2,4,5 or 7, which is characterized in that the S6 Including the following contents:

All nodes present in noise sequence L are scanned, detection node whether there is in dendrogram, and if it exists, then Above-mentioned node is changed to boundary point；If it does not exist, then it is labeled as noise spot.