CN107247776A - It is a kind of to be used for the method for similarity identification in clustering - Google Patents
It is a kind of to be used for the method for similarity identification in clustering Download PDFInfo
- Publication number
- CN107247776A CN107247776A CN201710432635.2A CN201710432635A CN107247776A CN 107247776 A CN107247776 A CN 107247776A CN 201710432635 A CN201710432635 A CN 201710432635A CN 107247776 A CN107247776 A CN 107247776A
- Authority
- CN
- China
- Prior art keywords
- sequence
- similarity
- calculating
- dimension element
- ith
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000007621 cluster analysis Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 1
- 238000012067 mathematical method Methods 0.000 description 1
- 238000000491 multivariate analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Analysing Materials By The Use Of Radiation (AREA)
Abstract
It is used for the method for similarity identification in clustering the invention discloses a kind of, by obtaining First ray and the second sequence;Calculating predistribution in First ray has the Euclidean distance for pre-allocating and having between the element of default weight in the element and second sequence of default weight;According to the increment of i-th dimension element in the increment of i-th dimension element in First ray and the second sequence, the incidence coefficient between i-th dimension element in First ray i-th dimension element and the second sequence is calculated;According to incidence coefficient, the grey relational grade between First ray and the second sequence is calculated;According to grey relational grade and Euclidean distance, to preset weight coefficient, the similarity between two sequences is calculated.The application passes through weight coefficient, Euclidean distance and grey relational grade between sequence is organically combined, so that similarity can reflect between two sequences spatially apart from size, also morphic similarity can be reflected, that is, the similarity calculated can be while " type is similar " degree and " value is similar " between representing sequence be spent.
Description
Technical Field
The invention relates to the field of data mining, in particular to a method and a device for similarity identification in cluster analysis.
Background
With the advent of the big data era, massive and complicated data are accumulated in various fields, so that how to mine the potential value in the data becomes a research hotspot in the current big data environment. Among them, cluster analysis is widely applied to various fields such as weather forecast, electric power, finance, forestry, and the like.
The clustering analysis is a multivariate analysis method in mathematical statistics, and quantitatively determines the affinity and the sparsity of a sample by a mathematical method so as to objectively classify the types. The objects being clustered are often referred to as samples and the group of objects being clustered is referred to as a sample set. And the similarity function may be used as a tool to measure the degree of similarity between sample data.
At present, common similarity functions include an euclidean distance method and a gray level correlation method, the euclidean distance method is a static analysis method, is suitable for static analysis of research objects, only reflects the distance size of two research objects on the space, can ensure the value similarity between sequences, but cannot fully ensure the similarity of the shapes or contours of the research objects, namely cannot ensure the type similarity; the grey correlation method is a dynamic analysis method, is suitable for the dynamic process of research objects, can dynamically analyze the change trend among the research objects, and can ensure the similarity of the models but cannot ensure the similarity of the values. In summary, both methods lack the completeness of the expression of similarity, i.e., they cannot simultaneously express the "type similarity" and "value similarity" between sequences.
Disclosure of Invention
The invention aims to provide a method for identifying similarity in cluster analysis, and aims to solve the problem of incomplete similarity expression in the prior art.
In order to solve the above technical problem, the present invention provides a method for similarity recognition in cluster analysis, including:
acquiring a first sequence and a second sequence;
calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n;
calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient;
and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.
Optionally, the calculating a euclidean distance between an element in the first sequence to which a preset weight is pre-assigned and an element in the second sequence to which the preset weight is pre-assigned includes:
euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
wherein the first sequence is X ═ X1,x2···xn]And the second sequence is Y ═ Y1,y2···yn];ωiIs the preset weight, ωi∈[0,1](ii) a n is the total number of elements in the sequence.
Optionally, the calculating a correlation coefficient between the ith dimension element in the first sequence and the ith dimension element in the second sequence according to the increment of the ith dimension element in the first sequence and the increment of the ith dimension element in the second sequence includes:
calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequence
Based on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence;
wherein,when in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,
optionally, the calculating a gray correlation degree between the first sequence and the second sequence according to the correlation coefficient includes:
based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.
Optionally, the calculating the similarity between the first sequence and the second sequence according to the gray correlation and the euclidean distance by using a preset weight coefficient includes:
similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence;
wherein, mu and nu are both weight coefficients, and mu + nu is 1.
Optionally, the weight coefficients are all 0.5.
The invention provides a method for identifying similarity in cluster analysis, which comprises the steps of obtaining a first sequence and a second sequence; calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence; calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n; calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient; and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance. According to the method, the Euclidean distance between the sequences and the grey correlation degree are organically combined together through the weight coefficient, so that the obtained similarity can reflect the distance size in the space between the two sequences and can also reflect the similarity of the form or the contour, namely the calculated similarity can simultaneously represent the type similarity and the value similarity between the sequences.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a specific implementation of a similarity recognition method for cluster analysis according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a specific implementation of a similarity recognition method for cluster analysis according to an embodiment of the present invention, where the method includes the following steps:
step 101: a first sequence and a second sequence are obtained.
It should be noted that the first sequence and the second sequence may refer to two research objects of cluster analysis. The first sequence may specifically be X ═ X1,x2···xn]The second sequence may specifically be Y ═ Y1,y2···yn]。
Step 102: and calculating the Euclidean distance between the elements with the preset weights pre-distributed in the first sequence and the elements with the preset weights pre-distributed in the second sequence.
Specifically, each element in the sequence is assigned a weight in advance, and then the euclidean distance between corresponding elements in the two sequences is calculated based on the definition of the euclidean distance. For example, the jth element X in the first sequence XjCorresponding weight is wjThe jth element Y in the second sequence YjIs also weighted by wjFirst, calculate (w)jxj-wjyj)2And calculating each element in the two sequences in turn by analogy, then summing the square sums of the elements, and then solving the Euclidean distance between the sequences.
As a specific implementation manner, the above process of calculating the euclidean distance between the element in the first sequence pre-assigned with the preset weight and the element in the second sequence pre-assigned with the preset weight may specifically be: euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence; wherein the first sequence is X ═ X1,x2···xn]And the second sequence is Y ═ Y1,y2···yn];ωiIs the preset weight, ωi∈[0,1](ii) a n is the total number of elements in the sequence.
Obviously, ωiThe value of (b) may be set according to actual conditions, and is not limited herein.
Step 103: and calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.
It should be noted that the increment may be obtained by subtracting the previous element from the current element of the sequence, for example,the increment may, of course, be a difference between non-adjacent elements in the sequence, e.g.,
as a specific implementation manner, the above process of calculating the correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence may specifically be: calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequenceBased on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence; wherein D isxi=(xi-xi-1),Dyi=(yi-yi-1) (ii) a When in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,
when λ is defined asiEqual to 1, (i) is greater than 0, and the positive and negative directions of the ith dimension element representing the sequences X and Y relative to the change of the (i-1) th dimension element are consistent; when λ isiEqual to-1, (i) is less than 0, and the i-th dimension elements representing the sequences X and Y are changed in opposite positive and negative directions with respect to the (i-1) -th dimension element.
The traditional grey correlation degree can only reflect the change of the homodromous trend between sequences, and the homodromous trend changes into the positive direction or the negative direction. Here, a calculation model of grey correlation of the symbolic function is introduced, so that the calculation model can reflect different trends and same trend changes.
It can be seen that the sign function λ is introduced in the grey correlation partiThe positive and negative correlation among the sequences can be reflected, and the expression capability of the similarity function is improved.
Step 104: and calculating the gray correlation degree between the first sequence and the second sequence according to the correlation coefficient.
Specifically, after the correlation coefficient of each element between the sequences is calculated, the correlation coefficient may be based on the correlation coefficient.
As a specific implementation manner, the above process of calculating the gray correlation degree between the first sequence and the second sequence according to the correlation coefficient may specifically be: based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.
Step 105: and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.
The preset weight coefficient may refer to a weight coefficient of a gray scale correlation degree and an euclidean distance, and specifically, the weight coefficient of the gray scale correlation degree is μ, the weight coefficient of the euclidean distance is ν, and μ + ν is 1.
Alternatively, μ ═ 0.5 and ν ═ 0.5. Of course, when the "similarity of patterns" between sequences needs to be improved, the value of μ can be correspondingly increased; when the 'value similarity' between sequences needs to be improved, the value of v can be correspondingly increased, i.e. the values of μ and v can be adjusted according to actual conditions, which is not limited herein.
As a specific implementation manner, the process of calculating the similarity between the first sequence and the second sequence according to the gray correlation and the euclidean distance by using a preset weight coefficient may specifically be: similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence; wherein, mu and nu are both weight coefficients, and mu + nu is 1.
It can be seen that the similarity recognition model has two parts, one of which is a gray correlation between two sequences, which can represent the similarity of morphology or contour between the sequences, i.e. the "type similarity"; the other part is the Euclidean distance between two sequences, which can indicate the spatial distance between the sequences, i.e., the degree of "similarity in value". The Euclidean distance function and the grey correlation degree are organically combined together through the weight coefficient, so that the limitation of a single method in the prior art can be overcome, and the expression of the similarity is more complete.
The similarity recognition method for cluster analysis provided by the embodiment of the invention obtains a first sequence and a second sequence; calculating Euclidean distance between elements with preset weights pre-distributed in the first sequence and elements with preset weights pre-distributed in the second sequence; calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n; calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient; and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance. According to the method, through a weight coefficient, the Euclidean distance between sequences and the grey correlation degree are organically combined together, so that the obtained similarity can reflect the distance size in the space between the two sequences and can also reflect the similarity of the form or the contour, namely the calculated similarity can simultaneously represent the type similarity and the value similarity between the sequences.
The similarity recognition method for cluster analysis provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Claims (6)
1. A method for similarity recognition in cluster analysis, comprising:
acquiring a first sequence and a second sequence;
calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n;
calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient;
and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.
2. The method of claim 1, wherein the calculating of the euclidean distance between the elements within the first sequence pre-assigned with the preset weights and the elements within the second sequence pre-assigned with the preset weights comprises:
euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
wherein the first sequence is X ═ X1,x2…xn]And the second sequence is Y ═ Y1,y2…yn];ωiIs the preset weight, ωi∈[0,1](ii) a n is the total number of elements in the sequence.
3. The method of claim 2, wherein calculating the correlation coefficient between the ith dimension element in the first sequence and the ith dimension element in the second sequence based on the increment of the ith dimension element in the first sequence and the increment of the ith dimension element in the second sequence comprises:
calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequence
Based on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence;
wherein,when in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,
4. the method of claim 3, wherein said calculating a gray correlation between said first sequence and said second sequence based on said correlation coefficients comprises:
based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.
5. The method according to any one of claims 1 to 4, wherein the calculating the similarity between the first sequence and the second sequence according to the gray correlation and the Euclidean distance by using a preset weight coefficient comprises:
similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence;
wherein, mu and nu are both weight coefficients, and mu + nu is 1.
6. The method of claim 5, wherein the weighting factors are each 0.5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710432635.2A CN107247776A (en) | 2017-06-09 | 2017-06-09 | It is a kind of to be used for the method for similarity identification in clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710432635.2A CN107247776A (en) | 2017-06-09 | 2017-06-09 | It is a kind of to be used for the method for similarity identification in clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107247776A true CN107247776A (en) | 2017-10-13 |
Family
ID=60019216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710432635.2A Pending CN107247776A (en) | 2017-06-09 | 2017-06-09 | It is a kind of to be used for the method for similarity identification in clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107247776A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109336720A (en) * | 2018-11-30 | 2019-02-15 | 四川大丰收农业科技有限公司 | A kind of bio-organic fertilizer prepared using agricultural wastes bacteria residue, stalk fermentation |
CN114743120A (en) * | 2022-06-10 | 2022-07-12 | 深圳联和智慧科技有限公司 | Roadside vehicle illegal lane occupation detection method and system based on image recognition |
CN115982429A (en) * | 2023-03-21 | 2023-04-18 | 中交第四航务工程勘察设计院有限公司 | Knowledge management method and system based on flow control |
-
2017
- 2017-06-09 CN CN201710432635.2A patent/CN107247776A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109336720A (en) * | 2018-11-30 | 2019-02-15 | 四川大丰收农业科技有限公司 | A kind of bio-organic fertilizer prepared using agricultural wastes bacteria residue, stalk fermentation |
CN114743120A (en) * | 2022-06-10 | 2022-07-12 | 深圳联和智慧科技有限公司 | Roadside vehicle illegal lane occupation detection method and system based on image recognition |
CN114743120B (en) * | 2022-06-10 | 2022-09-06 | 深圳联和智慧科技有限公司 | Roadside vehicle illegal lane occupation detection method and system based on image recognition |
CN115982429A (en) * | 2023-03-21 | 2023-04-18 | 中交第四航务工程勘察设计院有限公司 | Knowledge management method and system based on flow control |
CN115982429B (en) * | 2023-03-21 | 2023-08-01 | 中交第四航务工程勘察设计院有限公司 | Knowledge management method and system based on flow control |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103049526B (en) | Based on the cross-media retrieval method of double space study | |
US9129152B2 (en) | Exemplar-based feature weighting | |
CN112529068B (en) | Multi-view image classification method, system, computer equipment and storage medium | |
CN104392231A (en) | Block and sparse principal feature extraction-based rapid collaborative saliency detection method | |
CN110472652A (en) | A small amount of sample classification method based on semanteme guidance | |
CN107301643A (en) | Well-marked target detection method based on robust rarefaction representation Yu Laplce's regular terms | |
CN107247776A (en) | It is a kind of to be used for the method for similarity identification in clustering | |
CN104036259A (en) | Face similarity recognition method and system | |
Lu et al. | Clustering by Sorting Potential Values (CSPV): A novel potential-based clustering method | |
CN114913379B (en) | Remote sensing image small sample scene classification method based on multitasking dynamic contrast learning | |
CN109190511A (en) | Hyperspectral classification method based on part Yu structural constraint low-rank representation | |
CN113095158A (en) | Handwriting generation method and device based on countermeasure generation network | |
Roux et al. | An efficient parallel global optimization strategy based on Kriging properties suitable for material parameters identification | |
Ding et al. | Efficient vanishing point detection method in unstructured road environments based on dark channel prior | |
CN117078956A (en) | Point cloud classification segmentation network based on point cloud multi-scale parallel feature extraction and attention mechanism | |
CN111738319A (en) | Clustering result evaluation method and device based on large-scale samples | |
Sreevalsan-Nair et al. | Local geometric descriptors for multi-scale probabilistic point classification of airborne LiDAR point clouds | |
CN102737253A (en) | SAR (Synthetic Aperture Radar) image target identification method | |
CN108829886A (en) | A kind of branch mailbox method and apparatus | |
CN112200252A (en) | Joint dimension reduction method based on probability box global sensitivity analysis and active subspace | |
CN108764301B (en) | A kind of distress in concrete detection method based on reversed rarefaction representation | |
Zeng et al. | Robust 3D keypoint detection method based on double Gaussian weighted dissimilarity measure | |
CN110297851A (en) | A kind of improvement relevance acquisition methods of electric load | |
US20200034735A1 (en) | System for generating topic inference information of lyrics | |
Xue et al. | Image edge detection algorithm research based on the CNN's neighborhood radius equals 2 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171013 |
|
RJ01 | Rejection of invention patent application after publication |