CN107247776A

CN107247776A - It is a kind of to be used for the method for similarity identification in clustering

Info

Publication number: CN107247776A
Application number: CN201710432635.2A
Authority: CN
Inventors: 王星华; 周亚武; 陈云龙; 许炫壕
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2017-10-13

Abstract

It is used for the method for similarity identification in clustering the invention discloses a kind of, by obtaining First ray and the second sequence；Calculating predistribution in First ray has the Euclidean distance for pre-allocating and having between the element of default weight in the element and second sequence of default weight；According to the increment of i-th dimension element in the increment of i-th dimension element in First ray and the second sequence, the incidence coefficient between i-th dimension element in First ray i-th dimension element and the second sequence is calculated；According to incidence coefficient, the grey relational grade between First ray and the second sequence is calculated；According to grey relational grade and Euclidean distance, to preset weight coefficient, the similarity between two sequences is calculated.The application passes through weight coefficient, Euclidean distance and grey relational grade between sequence is organically combined, so that similarity can reflect between two sequences spatially apart from size, also morphic similarity can be reflected, that is, the similarity calculated can be while " type is similar " degree and " value is similar " between representing sequence be spent.

Description

Method for identifying similarity in cluster analysis

Technical Field

The invention relates to the field of data mining, in particular to a method and a device for similarity identification in cluster analysis.

Background

With the advent of the big data era, massive and complicated data are accumulated in various fields, so that how to mine the potential value in the data becomes a research hotspot in the current big data environment. Among them, cluster analysis is widely applied to various fields such as weather forecast, electric power, finance, forestry, and the like.

The clustering analysis is a multivariate analysis method in mathematical statistics, and quantitatively determines the affinity and the sparsity of a sample by a mathematical method so as to objectively classify the types. The objects being clustered are often referred to as samples and the group of objects being clustered is referred to as a sample set. And the similarity function may be used as a tool to measure the degree of similarity between sample data.

At present, common similarity functions include an euclidean distance method and a gray level correlation method, the euclidean distance method is a static analysis method, is suitable for static analysis of research objects, only reflects the distance size of two research objects on the space, can ensure the value similarity between sequences, but cannot fully ensure the similarity of the shapes or contours of the research objects, namely cannot ensure the type similarity; the grey correlation method is a dynamic analysis method, is suitable for the dynamic process of research objects, can dynamically analyze the change trend among the research objects, and can ensure the similarity of the models but cannot ensure the similarity of the values. In summary, both methods lack the completeness of the expression of similarity, i.e., they cannot simultaneously express the "type similarity" and "value similarity" between sequences.

Disclosure of Invention

The invention aims to provide a method for identifying similarity in cluster analysis, and aims to solve the problem of incomplete similarity expression in the prior art.

In order to solve the above technical problem, the present invention provides a method for similarity recognition in cluster analysis, including:

acquiring a first sequence and a second sequence;

calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;

calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n;

calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient;

and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.

Optionally, the calculating a euclidean distance between an element in the first sequence to which a preset weight is pre-assigned and an element in the second sequence to which the preset weight is pre-assigned includes:

euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;

wherein the first sequence is X ═ X₁,x₂···x_n]And the second sequence is Y ═ Y₁,y₂···y_n]；ω_iIs the preset weight, ω_i∈[0,1](ii) a n is the total number of elements in the sequence.

Optionally, the calculating a correlation coefficient between the ith dimension element in the first sequence and the ith dimension element in the second sequence according to the increment of the ith dimension element in the first sequence and the increment of the ith dimension element in the second sequence includes:

calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequence

Based on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence;

wherein,when in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,

optionally, the calculating a gray correlation degree between the first sequence and the second sequence according to the correlation coefficient includes:

based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.

Optionally, the calculating the similarity between the first sequence and the second sequence according to the gray correlation and the euclidean distance by using a preset weight coefficient includes:

similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence;

wherein, mu and nu are both weight coefficients, and mu + nu is 1.

Optionally, the weight coefficients are all 0.5.

The invention provides a method for identifying similarity in cluster analysis, which comprises the steps of obtaining a first sequence and a second sequence; calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence; calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n; calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient; and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance. According to the method, the Euclidean distance between the sequences and the grey correlation degree are organically combined together through the weight coefficient, so that the obtained similarity can reflect the distance size in the space between the two sequences and can also reflect the similarity of the form or the contour, namely the calculated similarity can simultaneously represent the type similarity and the value similarity between the sequences.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a specific implementation of a similarity recognition method for cluster analysis according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a specific implementation of a similarity recognition method for cluster analysis according to an embodiment of the present invention, where the method includes the following steps:

step 101: a first sequence and a second sequence are obtained.

It should be noted that the first sequence and the second sequence may refer to two research objects of cluster analysis. The first sequence may specifically be X ═ X₁,x₂···x_n]The second sequence may specifically be Y ═ Y₁,y₂···y_n]。

Step 102: and calculating the Euclidean distance between the elements with the preset weights pre-distributed in the first sequence and the elements with the preset weights pre-distributed in the second sequence.

Specifically, each element in the sequence is assigned a weight in advance, and then the euclidean distance between corresponding elements in the two sequences is calculated based on the definition of the euclidean distance. For example, the jth element X in the first sequence X_jCorresponding weight is w_jThe jth element Y in the second sequence Y_jIs also weighted by w_jFirst, calculate (w)_jx_j-w_jy_j)²And calculating each element in the two sequences in turn by analogy, then summing the square sums of the elements, and then solving the Euclidean distance between the sequences.

As a specific implementation manner, the above process of calculating the euclidean distance between the element in the first sequence pre-assigned with the preset weight and the element in the second sequence pre-assigned with the preset weight may specifically be: euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence; wherein the first sequence is X ═ X₁,x₂···x_n]And the second sequence is Y ═ Y₁,y₂···y_n]；ω_iIs the preset weight, ω_i∈[0,1](ii) a n is the total number of elements in the sequence.

Obviously, ω_iThe value of (b) may be set according to actual conditions, and is not limited herein.

Step 103: and calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.

It should be noted that the increment may be obtained by subtracting the previous element from the current element of the sequence, for example,the increment may, of course, be a difference between non-adjacent elements in the sequence, e.g.,

as a specific implementation manner, the above process of calculating the correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence may specifically be: calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequenceBased on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence; wherein D is_xi＝(x_i-x_i-1)，D_yi＝(y_i-y_i-1) (ii) a When in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,

when λ is defined as_iEqual to 1, (i) is greater than 0, and the positive and negative directions of the ith dimension element representing the sequences X and Y relative to the change of the (i-1) th dimension element are consistent; when λ is_iEqual to-1, (i) is less than 0, and the i-th dimension elements representing the sequences X and Y are changed in opposite positive and negative directions with respect to the (i-1) -th dimension element.

The traditional grey correlation degree can only reflect the change of the homodromous trend between sequences, and the homodromous trend changes into the positive direction or the negative direction. Here, a calculation model of grey correlation of the symbolic function is introduced, so that the calculation model can reflect different trends and same trend changes.

It can be seen that the sign function λ is introduced in the grey correlation part_iThe positive and negative correlation among the sequences can be reflected, and the expression capability of the similarity function is improved.

Step 104: and calculating the gray correlation degree between the first sequence and the second sequence according to the correlation coefficient.

Specifically, after the correlation coefficient of each element between the sequences is calculated, the correlation coefficient may be based on the correlation coefficient.

As a specific implementation manner, the above process of calculating the gray correlation degree between the first sequence and the second sequence according to the correlation coefficient may specifically be: based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.

Step 105: and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.

The preset weight coefficient may refer to a weight coefficient of a gray scale correlation degree and an euclidean distance, and specifically, the weight coefficient of the gray scale correlation degree is μ, the weight coefficient of the euclidean distance is ν, and μ + ν is 1.

Alternatively, μ ═ 0.5 and ν ═ 0.5. Of course, when the "similarity of patterns" between sequences needs to be improved, the value of μ can be correspondingly increased; when the 'value similarity' between sequences needs to be improved, the value of v can be correspondingly increased, i.e. the values of μ and v can be adjusted according to actual conditions, which is not limited herein.

As a specific implementation manner, the process of calculating the similarity between the first sequence and the second sequence according to the gray correlation and the euclidean distance by using a preset weight coefficient may specifically be: similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence; wherein, mu and nu are both weight coefficients, and mu + nu is 1.

It can be seen that the similarity recognition model has two parts, one of which is a gray correlation between two sequences, which can represent the similarity of morphology or contour between the sequences, i.e. the "type similarity"; the other part is the Euclidean distance between two sequences, which can indicate the spatial distance between the sequences, i.e., the degree of "similarity in value". The Euclidean distance function and the grey correlation degree are organically combined together through the weight coefficient, so that the limitation of a single method in the prior art can be overcome, and the expression of the similarity is more complete.

The similarity recognition method for cluster analysis provided by the embodiment of the invention obtains a first sequence and a second sequence; calculating Euclidean distance between elements with preset weights pre-distributed in the first sequence and elements with preset weights pre-distributed in the second sequence; calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n; calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient; and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance. According to the method, through a weight coefficient, the Euclidean distance between sequences and the grey correlation degree are organically combined together, so that the obtained similarity can reflect the distance size in the space between the two sequences and can also reflect the similarity of the form or the contour, namely the calculated similarity can simultaneously represent the type similarity and the value similarity between the sequences.

The similarity recognition method for cluster analysis provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A method for similarity recognition in cluster analysis, comprising:

acquiring a first sequence and a second sequence;

2. The method of claim 1, wherein the calculating of the euclidean distance between the elements within the first sequence pre-assigned with the preset weights and the elements within the second sequence pre-assigned with the preset weights comprises:

wherein the first sequence is X ═ X₁,x₂…x_n]And the second sequence is Y ═ Y₁,y₂…y_n]；ω_iIs the preset weight, ω_i∈[0,1](ii) a n is the total number of elements in the sequence.

3. The method of claim 2, wherein calculating the correlation coefficient between the ith dimension element in the first sequence and the ith dimension element in the second sequence based on the increment of the ith dimension element in the first sequence and the increment of the ith dimension element in the second sequence comprises:

4. the method of claim 3, wherein said calculating a gray correlation between said first sequence and said second sequence based on said correlation coefficients comprises:

5. The method according to any one of claims 1 to 4, wherein the calculating the similarity between the first sequence and the second sequence according to the gray correlation and the Euclidean distance by using a preset weight coefficient comprises:

wherein, mu and nu are both weight coefficients, and mu + nu is 1.

6. The method of claim 5, wherein the weighting factors are each 0.5.