CN107247776A - It is a kind of to be used for the method for similarity identification in clustering - Google Patents

It is a kind of to be used for the method for similarity identification in clustering Download PDF

Info

Publication number
CN107247776A
CN107247776A CN201710432635.2A CN201710432635A CN107247776A CN 107247776 A CN107247776 A CN 107247776A CN 201710432635 A CN201710432635 A CN 201710432635A CN 107247776 A CN107247776 A CN 107247776A
Authority
CN
China
Prior art keywords
sequence
similarity
calculating
dimension element
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710432635.2A
Other languages
Chinese (zh)
Inventor
王星华
周亚武
陈云龙
许炫壕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201710432635.2A priority Critical patent/CN107247776A/en
Publication of CN107247776A publication Critical patent/CN107247776A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Analysing Materials By The Use Of Radiation (AREA)

Abstract

It is used for the method for similarity identification in clustering the invention discloses a kind of, by obtaining First ray and the second sequence;Calculating predistribution in First ray has the Euclidean distance for pre-allocating and having between the element of default weight in the element and second sequence of default weight;According to the increment of i-th dimension element in the increment of i-th dimension element in First ray and the second sequence, the incidence coefficient between i-th dimension element in First ray i-th dimension element and the second sequence is calculated;According to incidence coefficient, the grey relational grade between First ray and the second sequence is calculated;According to grey relational grade and Euclidean distance, to preset weight coefficient, the similarity between two sequences is calculated.The application passes through weight coefficient, Euclidean distance and grey relational grade between sequence is organically combined, so that similarity can reflect between two sequences spatially apart from size, also morphic similarity can be reflected, that is, the similarity calculated can be while " type is similar " degree and " value is similar " between representing sequence be spent.

Description

Method for identifying similarity in cluster analysis
Technical Field
The invention relates to the field of data mining, in particular to a method and a device for similarity identification in cluster analysis.
Background
With the advent of the big data era, massive and complicated data are accumulated in various fields, so that how to mine the potential value in the data becomes a research hotspot in the current big data environment. Among them, cluster analysis is widely applied to various fields such as weather forecast, electric power, finance, forestry, and the like.
The clustering analysis is a multivariate analysis method in mathematical statistics, and quantitatively determines the affinity and the sparsity of a sample by a mathematical method so as to objectively classify the types. The objects being clustered are often referred to as samples and the group of objects being clustered is referred to as a sample set. And the similarity function may be used as a tool to measure the degree of similarity between sample data.
At present, common similarity functions include an euclidean distance method and a gray level correlation method, the euclidean distance method is a static analysis method, is suitable for static analysis of research objects, only reflects the distance size of two research objects on the space, can ensure the value similarity between sequences, but cannot fully ensure the similarity of the shapes or contours of the research objects, namely cannot ensure the type similarity; the grey correlation method is a dynamic analysis method, is suitable for the dynamic process of research objects, can dynamically analyze the change trend among the research objects, and can ensure the similarity of the models but cannot ensure the similarity of the values. In summary, both methods lack the completeness of the expression of similarity, i.e., they cannot simultaneously express the "type similarity" and "value similarity" between sequences.
Disclosure of Invention
The invention aims to provide a method for identifying similarity in cluster analysis, and aims to solve the problem of incomplete similarity expression in the prior art.
In order to solve the above technical problem, the present invention provides a method for similarity recognition in cluster analysis, including:
acquiring a first sequence and a second sequence;
calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n;
calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient;
and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.
Optionally, the calculating a euclidean distance between an element in the first sequence to which a preset weight is pre-assigned and an element in the second sequence to which the preset weight is pre-assigned includes:
euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
wherein the first sequence is X ═ X1,x2···xn]And the second sequence is Y ═ Y1,y2···yn];ωiIs the preset weight, ωi∈[0,1](ii) a n is the total number of elements in the sequence.
Optionally, the calculating a correlation coefficient between the ith dimension element in the first sequence and the ith dimension element in the second sequence according to the increment of the ith dimension element in the first sequence and the increment of the ith dimension element in the second sequence includes:
calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequence
Based on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence;
wherein,when in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,
optionally, the calculating a gray correlation degree between the first sequence and the second sequence according to the correlation coefficient includes:
based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.
Optionally, the calculating the similarity between the first sequence and the second sequence according to the gray correlation and the euclidean distance by using a preset weight coefficient includes:
similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence;
wherein, mu and nu are both weight coefficients, and mu + nu is 1.
Optionally, the weight coefficients are all 0.5.
The invention provides a method for identifying similarity in cluster analysis, which comprises the steps of obtaining a first sequence and a second sequence; calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence; calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n; calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient; and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance. According to the method, the Euclidean distance between the sequences and the grey correlation degree are organically combined together through the weight coefficient, so that the obtained similarity can reflect the distance size in the space between the two sequences and can also reflect the similarity of the form or the contour, namely the calculated similarity can simultaneously represent the type similarity and the value similarity between the sequences.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a specific implementation of a similarity recognition method for cluster analysis according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a specific implementation of a similarity recognition method for cluster analysis according to an embodiment of the present invention, where the method includes the following steps:
step 101: a first sequence and a second sequence are obtained.
It should be noted that the first sequence and the second sequence may refer to two research objects of cluster analysis. The first sequence may specifically be X ═ X1,x2···xn]The second sequence may specifically be Y ═ Y1,y2···yn]。
Step 102: and calculating the Euclidean distance between the elements with the preset weights pre-distributed in the first sequence and the elements with the preset weights pre-distributed in the second sequence.
Specifically, each element in the sequence is assigned a weight in advance, and then the euclidean distance between corresponding elements in the two sequences is calculated based on the definition of the euclidean distance. For example, the jth element X in the first sequence XjCorresponding weight is wjThe jth element Y in the second sequence YjIs also weighted by wjFirst, calculate (w)jxj-wjyj)2And calculating each element in the two sequences in turn by analogy, then summing the square sums of the elements, and then solving the Euclidean distance between the sequences.
As a specific implementation manner, the above process of calculating the euclidean distance between the element in the first sequence pre-assigned with the preset weight and the element in the second sequence pre-assigned with the preset weight may specifically be: euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence; wherein the first sequence is X ═ X1,x2···xn]And the second sequence is Y ═ Y1,y2···yn];ωiIs the preset weight, ωi∈[0,1](ii) a n is the total number of elements in the sequence.
Obviously, ωiThe value of (b) may be set according to actual conditions, and is not limited herein.
Step 103: and calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.
It should be noted that the increment may be obtained by subtracting the previous element from the current element of the sequence, for example,the increment may, of course, be a difference between non-adjacent elements in the sequence, e.g.,
as a specific implementation manner, the above process of calculating the correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence may specifically be: calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequenceBased on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence; wherein D isxi=(xi-xi-1),Dyi=(yi-yi-1) (ii) a When in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,
when λ is defined asiEqual to 1, (i) is greater than 0, and the positive and negative directions of the ith dimension element representing the sequences X and Y relative to the change of the (i-1) th dimension element are consistent; when λ isiEqual to-1, (i) is less than 0, and the i-th dimension elements representing the sequences X and Y are changed in opposite positive and negative directions with respect to the (i-1) -th dimension element.
The traditional grey correlation degree can only reflect the change of the homodromous trend between sequences, and the homodromous trend changes into the positive direction or the negative direction. Here, a calculation model of grey correlation of the symbolic function is introduced, so that the calculation model can reflect different trends and same trend changes.
It can be seen that the sign function λ is introduced in the grey correlation partiThe positive and negative correlation among the sequences can be reflected, and the expression capability of the similarity function is improved.
Step 104: and calculating the gray correlation degree between the first sequence and the second sequence according to the correlation coefficient.
Specifically, after the correlation coefficient of each element between the sequences is calculated, the correlation coefficient may be based on the correlation coefficient.
As a specific implementation manner, the above process of calculating the gray correlation degree between the first sequence and the second sequence according to the correlation coefficient may specifically be: based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.
Step 105: and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.
The preset weight coefficient may refer to a weight coefficient of a gray scale correlation degree and an euclidean distance, and specifically, the weight coefficient of the gray scale correlation degree is μ, the weight coefficient of the euclidean distance is ν, and μ + ν is 1.
Alternatively, μ ═ 0.5 and ν ═ 0.5. Of course, when the "similarity of patterns" between sequences needs to be improved, the value of μ can be correspondingly increased; when the 'value similarity' between sequences needs to be improved, the value of v can be correspondingly increased, i.e. the values of μ and v can be adjusted according to actual conditions, which is not limited herein.
As a specific implementation manner, the process of calculating the similarity between the first sequence and the second sequence according to the gray correlation and the euclidean distance by using a preset weight coefficient may specifically be: similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence; wherein, mu and nu are both weight coefficients, and mu + nu is 1.
It can be seen that the similarity recognition model has two parts, one of which is a gray correlation between two sequences, which can represent the similarity of morphology or contour between the sequences, i.e. the "type similarity"; the other part is the Euclidean distance between two sequences, which can indicate the spatial distance between the sequences, i.e., the degree of "similarity in value". The Euclidean distance function and the grey correlation degree are organically combined together through the weight coefficient, so that the limitation of a single method in the prior art can be overcome, and the expression of the similarity is more complete.
The similarity recognition method for cluster analysis provided by the embodiment of the invention obtains a first sequence and a second sequence; calculating Euclidean distance between elements with preset weights pre-distributed in the first sequence and elements with preset weights pre-distributed in the second sequence; calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n; calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient; and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance. According to the method, through a weight coefficient, the Euclidean distance between sequences and the grey correlation degree are organically combined together, so that the obtained similarity can reflect the distance size in the space between the two sequences and can also reflect the similarity of the form or the contour, namely the calculated similarity can simultaneously represent the type similarity and the value similarity between the sequences.
The similarity recognition method for cluster analysis provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (6)

1. A method for similarity recognition in cluster analysis, comprising:
acquiring a first sequence and a second sequence;
calculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
calculating a correlation coefficient between the ith-dimension element in the first sequence and the ith-dimension element in the second sequence according to the increment of the ith-dimension element in the first sequence and the increment of the ith-dimension element in the second sequence, wherein i is 2,3,4.. n;
calculating the grey correlation degree between the first sequence and the second sequence according to the correlation coefficient;
and calculating the similarity between the first sequence and the second sequence by using a preset weight coefficient according to the grey correlation degree and the Euclidean distance.
2. The method of claim 1, wherein the calculating of the euclidean distance between the elements within the first sequence pre-assigned with the preset weights and the elements within the second sequence pre-assigned with the preset weights comprises:
euclidean distance modelCalculating Euclidean distances between elements with preset weights pre-distributed in the first sequence and elements with the preset weights pre-distributed in the second sequence;
wherein the first sequence is X ═ X1,x2…xn]And the second sequence is Y ═ Y1,y2…yn];ωiIs the preset weight, ωi∈[0,1](ii) a n is the total number of elements in the sequence.
3. The method of claim 2, wherein calculating the correlation coefficient between the ith dimension element in the first sequence and the ith dimension element in the second sequence based on the increment of the ith dimension element in the first sequence and the increment of the ith dimension element in the second sequence comprises:
calculating the increment of the ith dimension element in the first sequenceAnd increments of the ith dimension element in the second sequence
Based on correlation coefficient modelCalculating a correlation coefficient between the ith dimension element of the first sequence and the ith dimension element in the second sequence;
wherein,when in useWhen the sum is zero, (i) ═ 1; when in useWhen the difference is zero, the time difference is zero,
4. the method of claim 3, wherein said calculating a gray correlation between said first sequence and said second sequence based on said correlation coefficients comprises:
based on grey correlation degree modelCalculating the gray correlation degree between the first sequence and the second sequence; wherein (i) is the correlation coefficient.
5. The method according to any one of claims 1 to 4, wherein the calculating the similarity between the first sequence and the second sequence according to the gray correlation and the Euclidean distance by using a preset weight coefficient comprises:
similarity-based recognition modelCalculating the similarity between the first sequence and the second sequence;
wherein, mu and nu are both weight coefficients, and mu + nu is 1.
6. The method of claim 5, wherein the weighting factors are each 0.5.
CN201710432635.2A 2017-06-09 2017-06-09 It is a kind of to be used for the method for similarity identification in clustering Pending CN107247776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710432635.2A CN107247776A (en) 2017-06-09 2017-06-09 It is a kind of to be used for the method for similarity identification in clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710432635.2A CN107247776A (en) 2017-06-09 2017-06-09 It is a kind of to be used for the method for similarity identification in clustering

Publications (1)

Publication Number Publication Date
CN107247776A true CN107247776A (en) 2017-10-13

Family

ID=60019216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710432635.2A Pending CN107247776A (en) 2017-06-09 2017-06-09 It is a kind of to be used for the method for similarity identification in clustering

Country Status (1)

Country Link
CN (1) CN107247776A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109336720A (en) * 2018-11-30 2019-02-15 四川大丰收农业科技有限公司 A kind of bio-organic fertilizer prepared using agricultural wastes bacteria residue, stalk fermentation
CN114743120A (en) * 2022-06-10 2022-07-12 深圳联和智慧科技有限公司 Roadside vehicle illegal lane occupation detection method and system based on image recognition
CN115982429A (en) * 2023-03-21 2023-04-18 中交第四航务工程勘察设计院有限公司 Knowledge management method and system based on flow control

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109336720A (en) * 2018-11-30 2019-02-15 四川大丰收农业科技有限公司 A kind of bio-organic fertilizer prepared using agricultural wastes bacteria residue, stalk fermentation
CN114743120A (en) * 2022-06-10 2022-07-12 深圳联和智慧科技有限公司 Roadside vehicle illegal lane occupation detection method and system based on image recognition
CN114743120B (en) * 2022-06-10 2022-09-06 深圳联和智慧科技有限公司 Roadside vehicle illegal lane occupation detection method and system based on image recognition
CN115982429A (en) * 2023-03-21 2023-04-18 中交第四航务工程勘察设计院有限公司 Knowledge management method and system based on flow control
CN115982429B (en) * 2023-03-21 2023-08-01 中交第四航务工程勘察设计院有限公司 Knowledge management method and system based on flow control

Similar Documents

Publication Publication Date Title
CN103049526B (en) Based on the cross-media retrieval method of double space study
US9129152B2 (en) Exemplar-based feature weighting
CN112529068B (en) Multi-view image classification method, system, computer equipment and storage medium
CN104392231A (en) Block and sparse principal feature extraction-based rapid collaborative saliency detection method
CN110472652A (en) A small amount of sample classification method based on semanteme guidance
CN107301643A (en) Well-marked target detection method based on robust rarefaction representation Yu Laplce's regular terms
CN107247776A (en) It is a kind of to be used for the method for similarity identification in clustering
CN104036259A (en) Face similarity recognition method and system
Lu et al. Clustering by Sorting Potential Values (CSPV): A novel potential-based clustering method
CN114913379B (en) Remote sensing image small sample scene classification method based on multitasking dynamic contrast learning
CN109190511A (en) Hyperspectral classification method based on part Yu structural constraint low-rank representation
CN113095158A (en) Handwriting generation method and device based on countermeasure generation network
Roux et al. An efficient parallel global optimization strategy based on Kriging properties suitable for material parameters identification
Ding et al. Efficient vanishing point detection method in unstructured road environments based on dark channel prior
CN117078956A (en) Point cloud classification segmentation network based on point cloud multi-scale parallel feature extraction and attention mechanism
CN111738319A (en) Clustering result evaluation method and device based on large-scale samples
Sreevalsan-Nair et al. Local geometric descriptors for multi-scale probabilistic point classification of airborne LiDAR point clouds
CN102737253A (en) SAR (Synthetic Aperture Radar) image target identification method
CN108829886A (en) A kind of branch mailbox method and apparatus
CN112200252A (en) Joint dimension reduction method based on probability box global sensitivity analysis and active subspace
CN108764301B (en) A kind of distress in concrete detection method based on reversed rarefaction representation
Zeng et al. Robust 3D keypoint detection method based on double Gaussian weighted dissimilarity measure
CN110297851A (en) A kind of improvement relevance acquisition methods of electric load
US20200034735A1 (en) System for generating topic inference information of lyrics
Xue et al. Image edge detection algorithm research based on the CNN's neighborhood radius equals 2

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171013

RJ01 Rejection of invention patent application after publication