CN110334754A

CN110334754A - A method of by star Formation Fast Classification

Info

Publication number: CN110334754A
Application number: CN201910562679.6A
Authority: CN
Inventors: 栗雅婷; 蔡江辉; 杨海峰; 张继福; 赵旭俊
Original assignee: Taiyuan University of Science and Technology
Current assignee: Taiyuan University of Science and Technology
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-10-15

Abstract

A kind of method by star Formation Fast Classification provided by the invention first looks for the exemplary spectrum of every one kind as cluster centre, and then the distance between other spectrum foundation to each quasi-representative spectrum is clustered.In general, cluster centre is those data points in small radii with higher density and away from each other, the present invention determines initial cluster center using MNN (M nearest-neighbors), density and distance.From the feature of spectroscopic data itself, by the density and distance feature that calculate every stellar spectrum, construction one is able to reflect the spectrum and improves the accuracy of cluster to select the stellar spectrum of maximum probability as initial cluster center as the probability function of cluster centre.

Description

A method of by star Formation Fast Classification

Technical field

The present invention relates to a kind of methods by star Formation Fast Classification, the star aberration for taking LAMOST Modal data is classified, and data mining technology field is belonged to.

Background technique

LAMOST is a kind of giant optical telescope by Chinese independent design and innovation, technically has very much challenge Property.As the telescope with highest celestial body frequency spectrum acquisition rate, LAMOST will break through " bottleneck " of spectrum observation in astronomical research, And become most powerful spectrum observation telescope.The most prominent feature of LAMOST telescope is major diameter (4 meters) and big visual field (5 Degree), and the ultra-large spectrum observation system being made of 4000 optical fiber.LAMOST be include tens million of a galaxies, class star The spectrum observation of body and the Ha Noi celestial body including a large amount of fixed stars is made that tremendous contribution.Project is toured the heavens not recently as large size Disconnected implementation and the appearance of new observation technology, obtain a large amount of large data sets, wherein LAMOST guide, which tours the heavens, issues spectrum number According to more than 480,000 items, the celestial body including fixed star, Galaxies and some UNKNOWN TYPEs.The stellar spectrum packet of LAMOST shooting Containing multiple types such as A, F, G, K, M, these spectrum are classified, only manually operation needs to expend very big time and essence Power.

Data mining is that the process of interesting mode and knowledge is found from mass data.Cluster is a kind of typical unsupervised Algorithm occupies an important position in data mining.The purpose of cluster be one group of data object is grouped into multiple groups or cluster so that There is high similarity with the object in cluster, and dissimilar with the object height in other clusters.Traditional clustering algorithm is big Cause can be divided into partition clustering method, hierarchy clustering method, density clustering method, and the clustering method based on grid is based on The clustering method etc. of model.Most of clustering algorithms all encounter challenge, such as cluster centre selection difficulty, cluster number K's Artificial determining, clustering precision is low equal.

Summary of the invention

To solve the problems, such as Stellar spectra classification, the invention discloses a kind of methods by star Formation Fast Classification. The present invention constructs one by calculating the density and distance feature of every stellar spectrum from the feature of spectroscopic data itself Being able to reflect the spectrum becomes the probability function of cluster centre, to select the stellar spectrum of maximum probability as in initial clustering The heart improves the accuracy of cluster.

A kind of method by star Formation Fast Classification provided by the invention, it is comprised the steps of:

S1: the star Formation of LAMOST shooting is collected, and place is normalized to the star Formation being collected into Reason, is considered as an object for each spectroscopic data here；

S2: the distance between any two object d is calculated；

S3: centered on each object, finding out its M nearest-neighbors, and by the distance definition of M nearest-neighbors to center For r, repeat distance is only calculated once；

S4: the mean value of all r is calculated, R is denoted as；

S5: the density p in each data object R neighborhood is calculated；

S6: by the density p of each object divided by the distance r of corresponding M nearest-neighbors, it is denoted as Pro；

S7: Pro is sorted from large to small, K value as K initial cluster center, that is, K exemplary spectrum before output；

S8: to remaining each object, calculating the distance between itself and each cluster center, according to apart from nearest principle by its Cluster where distributing to corresponding initial center, distance is closer to indicate more similar, so that it to be assigned to most like cluster.

It is further improved, the star Formation of LAMOST shooting is collected in the step S1, and to the perseverance being collected into Starlight modal data is normalized, specifically: each spectroscopic data is considered as an object, it is assumed that object x_iInclude P Dimension data, i.e.,Calculate mean valueCoordinate after normalization ForIt has been more than 3000 dimensions because initial data includes multiple features and dimension, therefore has only been extracted it In a feature, with simplify calculate.

It is further improved, the distance between any two object d is calculated in the step S2, specifically: assuming thatWithAny two spectrum respectively in data set, then x_iAnd x_jEuclidean distance Calculation method is as follows:

d_p(x_i,x_j) it is x_iAnd x_jBetween distance metric relative to attribute P, wherein a_m∈ P, P are the attribute of object, f (x_i,a_m) indicate object x_iIn attribute a_mOn value.

It is further improved, the mean value of all r is calculated in the step S4, specifically:

Wherein n indicates the number of element in data set.

It is further improved, the density p in each data object R neighborhood is calculated in the step S5, specifically:

With point x_iCentered on, if x_iWith x_jBetween Euclidean distance d_p(x_i,x_j)≤R, then x_jBelong to x_iFor cluster centre The element for including in formed cluster, x_iDensity value add 1, otherwise, x_iDensity value add 0, point x_iDensity p_iCalculation formula are as follows:

It is further improved, the calculation formula of Pro in the step S6 are as follows:

The beneficial effects of the present invention are:

Compared with prior art, the execution process of the method for the present invention mainly include collect data and to the data being collected into Row normalized, the distance between any two points calculate between data, the searching of each data point M nearest-neighbors distance r, adjacent The solution of domain radius R, the calculating of each data dot density, point become the probability assessment of initial cluster center and how to select most 7 steps of whole initial center.The present invention first looks for the exemplary spectrum of every one kind as cluster centre, then other spectrum foundations It is clustered to the distance between each quasi-representative spectrum.In general, cluster centre be those in small radii have compared with High density and data point away from each other, this method determine initial clustering using MNN (M nearest-neighbors), density and distance Center.This method is from the feature of spectroscopic data itself, by calculating the density and distance feature of every stellar spectrum, construction One is able to reflect the spectrum and becomes the probability function of cluster centre, to select the stellar spectrum of maximum probability as initial clustering The accuracy of cluster is improved at center.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention；

Fig. 2 is the schematic diagram of the distance r of M nearest-neighbors；

Fig. 3 is the density schematic diagram of data point；

Fig. 4 is different types of K typical star aberration spectrogram being found using the method for the present invention.

Specific embodiment

The present invention is described in further detail combined with specific embodiments below, but protection scope of the present invention is not Be limited to these embodiments, it is all without departing substantially from the change of present inventive concept or equivalent substitute be included in protection scope of the present invention it It is interior.It is involved in the present invention to definition include:

Define the 1:(radius of neighbourhood).M nearest-neighbors (M nearest apart from the data point are looked for around each data point Object), this M arest neighbors is centered to the data point and is set to r apart from farthest value.As shown in Fig. 2, dotted line is indicated with point x_iFor Cluster centre, M (M=6) the distance r greatly found.

Define 2:(density).Using each data point as cluster centre, the institute being present in data point radius of neighbourhood R is found Quantity a little and the density for being regarded as the point, neighbours' number of data point is more, and density is bigger.

As shown in figure 3, the point in closure virtual coil is point x_iAll neighbours within the scope of radius R, point x_iNeighbours' quantity For the density of the point.

As Figure 1-Figure 4, the execution process of the method for the present invention mainly include collect data and to the data being collected into Row normalized, the distance between any two points calculate between data, the searching of each data point M nearest-neighbors distance r, adjacent The solution of domain radius R, the calculating of each data dot density, point become the probability assessment of initial cluster center and how to select most 7 steps of whole initial center, specific as follows:

S1: collecting the star Formation of LAMOST shooting, and the star Formation being collected into be normalized, Here each spectroscopic data is considered as an object, normalized, specifically: each spectroscopic data is considered as an object, Assuming that object x_iComprising P dimension data, i.e.,Calculate mean value Coordinate after normalization isIt has been more than 3000 because initial data includes multiple features and dimension Dimension, therefore it is only extracted one of feature, it is calculated with simplifying.

S2: calculating the distance between any two object d, specifically: assuming thatWithAny two spectrum respectively in data set, then x_iAnd x_jEuclidean distance calculation method it is as follows:

d_p(x_i,x_j) it is x_iAnd x_jBetween distance metric relative to attribute P, wherein a_m∈ P, P are the attribute of object, f (x_i,a_m) indicate object x_iIn attribute a_mOn value, be deposited into a symmetrical matrix

S3: centered on each object, finding out its M nearest-neighbors, and by the distance definition of M nearest-neighbors to center For r, and repeat distance is only calculated once, i.e., arranges each row of data according to sequence from small to large, and the matrix after sequence is as follows, Take the corresponding value of m-th point as the radius of neighbourhood r found under the conditions of each object equal densities.Such as: it is cluster with point x3 Center defines M=2, then the r found is 0.223,

S4: the mean value of all r is calculated, R is denoted as；Specifically:

Wherein n indicates the number of element in data set.

S5: calculating the density p in each data object R neighborhood, is understood in combination with Fig. 3, specifically:

S6: by the density p of each object divided by the distance r of corresponding M nearest-neighbors, being denoted as Pro, is measured with Pro Data point becomes the probability of initial center,

S7: Pro is sorted from large to small, K value as K initial cluster center, that is, K exemplary spectrum before output, this In by taking star Formation collection as an example, cluster number K=5.Due to using different M, the value got may difference, institute To repeat to test repeatedly using different M here, so that finding sub-fraction most probable becomes the point of initial cluster center, so These points are clustered using K-means afterwards, the center K-means after cluster is the center of star Formation, such as Fig. 4 It is shown.

S8: to remaining n-K object, calculating the distance between itself and each cluster center, incites somebody to action according to apart from nearest principle It distributes to cluster where corresponding initial center, and distance is closer to indicate more similar, so that it to be assigned to most like cluster.

Method of the present invention is by that can be determined more accurately initial cluster center after above-mentioned processing, to overcome The problem of spectral data classification difficulty, it was demonstrated that feasibility of the invention.

Claims

1. a kind of method by star Formation Fast Classification, it is characterised in that: the following steps are included:

S1: collecting the star Formation of LAMOST shooting, and the star Formation being collected into be normalized, this In each spectroscopic data is considered as an object；

S2: the distance between any two object d is calculated；

S3: centered on each object, its M nearest-neighbors is found out, and is r by the distance definition of M nearest-neighbors to center, is repeated Distance is only calculated primary；

S4: the mean value of all r is calculated, R is denoted as；

S5: the density p in each data object R neighborhood is calculated；

S8: to remaining each object, the distance between itself and each cluster center is calculated, is distributed according to apart from nearest principle To cluster where corresponding initial center, distance is closer to indicate more similar, so that it to be assigned to most like cluster.

2. a kind of method by star Formation Fast Classification according to claim 1, it is characterised in that: the step The star Formation of LAMOST shooting is collected in rapid S1, and the star Formation being collected into is normalized, is had Body are as follows: each spectroscopic data is considered as an object, it is assumed that object x_iComprising P dimension data, i.e.,Meter Calculate mean valueCoordinate after normalization isBecause initial data includes Multiple features and dimension have been more than 3000 dimensions, therefore are only extracted one of feature, are calculated with simplifying.

3. a kind of method by star Formation Fast Classification according to claim 1, it is characterised in that: the step The distance between any two object d is calculated in rapid S2, specifically: assuming thatWithPoint Not Wei any two spectrum in data set, then x_iAnd x_jEuclidean distance calculation method it is as follows:

4. a kind of method by star Formation Fast Classification according to claim 1, it is characterised in that: the step The mean value of all r is calculated in rapid S4, specifically:

Wherein n indicates the number of element in data set.

5. a kind of method by star Formation Fast Classification according to claim 1, it is characterised in that: the step The density p in each data object R neighborhood is calculated in rapid S5, specifically:

With point x_iCentered on, if x_iWith x_jBetween Euclidean distance d_p(x_i,x_j)≤R, then x_jBelong to x_iFor cluster centre institute shape The element for including in cluster, x_iDensity value add 1, otherwise, x_iDensity value add 0, point x_iDensity p_iCalculation formula are as follows:

6. a kind of method by star Formation Fast Classification according to claim 1, it is characterised in that: the step The calculation formula of Pro in rapid S6 are as follows: