CN114386466B

CN114386466B - Parallel hybrid clustering method for candidate signal mining in pulsar search

Info

Publication number: CN114386466B
Application number: CN202210036692.XA
Authority: CN
Inventors: 游子毅; 刘莹; 马智; 李思瑶; 王培�; 童超
Original assignee: Guizhou Education University
Current assignee: Guizhou Education University
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2024-04-05
Anticipated expiration: 2042-01-13
Also published as: CN114386466A

Abstract

The invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps: cluster analysis of pulsar candidate signals; grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value of Batchsize=1160, and setting the size of the sliding window to w=2; 1600 more complete pulsar candidate feature data are selected from real samples to be used as a group of samples, and the samples are respectively added into data to be detected corresponding to each wheel sliding window to form 1 data block, and the data set is divided into a plurality of parallel data blocks with the same size; the clustering is achieved by parallelization of data blocks based on a MapReduce/Spark calculation model. The invention can improve the clustering performance, the screening recall rate and the execution time.

Description

Parallel hybrid clustering method for candidate signal mining in pulsar search

Technical Field

The invention belongs to the technical field of astronomy, and particularly relates to a parallel hybrid clustering method for candidate signal mining in pulsar searching.

Background

The discovery of the pulsar field advantageously promotes the development of astronomy, physics, navigation and other related fields, and along with the establishment of a 500-meter caliber spherical radio telescope FAST and the night detection of a 19-beam receiver, the characteristics of high sensitivity and larger coverage of a sky area bring the advantages of a pulsar signal searching range and the huge increase of observation data, and how to effectively screen pulsar candidates from mass data becomes the key of pulsar searching;

the basic pulsar search needs to be completed by searching stable periodic pulse signals in a two-dimensional space consisting of P (period) -DM (dispersion quantity); currently, conventional methods of graphics tool assistance or statistics-based have failed to meet the needs of such huge data volume processing; candidate screening of pulsar by artificial intelligence technology is mainly divided into three types according to the principle of the method; the first class is a candidate ordering algorithm based on empirical formulas; such algorithms rely on assumptions such as signal-to-noise ratio, pulse profile shape, etc., many of which do not fit well in practice and may result in pulses of special shapes such as broad pulses, partial DM curves, or low flow pulsars being missed; the second category is a neural network image recognition model which directly utilizes the candidate diagnosis map to automatically extract the characteristics; compared with the traditional machine learning method, the algorithm has better generalization, but the subgraph of each training data needs to be marked manually, and the sample training requirement is large, so that a great deal of extra labor is input; the third class is a machine learning based classification algorithm; feature selection by means of human experience screening is key to influence binary classification results of pulsar screening, and incomplete feature design schemes possibly weaken performance of models, so that feature design problems are particularly key; in addition, some hybrid models integrated by multiple methods also achieve remarkable effects;

in actual large-scale pulsar data calculation and search, since most of input data sets are unlabeled data, and the proportion of pulsar sample data and non-pulsar sample data is extremely unbalanced, the time cost and the workload for identifying pulsar candidates by using a supervised learning classification method are quite large;

experimental dataset HTRU2 multi-beam (13 beams) observations from Parkes telescope, australia, the DM value of the pulsar signal search tube used was set to 0 to 2000cm-3pc, describing the pulsar candidate sample data collected during high time resolution cosmic survey based on PRESTO (Pulsar Exploration and Search Toolkit) software processing; pulsar search and analysis kits developed by PRESTO us NRAO radio astronomical station, now used for multiple night-time, processing short integration time data and X-ray data; the HTRU2 data set contains 17898 data samples in total, with 16259 false examples generated by RFI or noise and 1639 true pulsar examples; the characteristic values comprise 8 attributes including a mean value of a pulse profile, a standard deviation of the pulse profile, an excess kurtosis of the pulse profile, a skewness of the pulse profile, a mean value of a DM-S/N curve, a standard deviation of the DM-S/N curve, an excess kurtosis of the DM-S/N curve and a skewness of the DM-S/N curve; HTRU2 is an open, relatively sample-rich dataset with high acceptance, and is therefore widely used to evaluate the performance of pulsar candidate classification algorithms;

clustering is one of key methods for processing large-scale data mining problems, and comprises clustering algorithms based on division, density, grid and the like; k-means is widely used as a clustering algorithm based on division, but the original k-means has the defects that the clustering effect depends on the selection of an initial center point, only numerical data can be dealt with, abnormal value interference is large and the like; therefore, many scholars have been improving the algorithm. (price-Preserving Mechanisms for K-Modes modeling, computers and Security, 2018) proposes a K-MODES algorithm for solving the disadvantage that K-means can only cope with numerical data; the clustering method based on density, such as a typical DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm, can find clusters of any shape, but has large clustering samples and long convergence time, and has poor clustering effect on the condition of uneven cluster density; (Clustering by fast search and find of density peaks, science, 2014) proposes a fast search clustering algorithm based on density peaks, the main idea being that the density of cluster centers should be greater than the density of surrounding neighbors, and the distance between different cluster centers is relatively far; to overcome this drawback, (McDPC: multi-center density peak clustering, neural Computing and Applications, 2020) further proposes a multi-center clustering method based on density hierarchy division; hierarchical-based clustering does not require a predesignated number of clusters and can discover hierarchical relationships of classes, but the computational complexity is too high.

Disclosure of Invention

The invention aims to overcome the defects and provide a parallel hybrid clustering method for candidate signal mining in pulsar search, which improves clustering performance, screening recall rate and reduces execution time.

The aim and the main technical problems are achieved by adopting the following technical scheme:

the invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps:

(1) Cluster analysis of pulsar candidate signals:

calculating the density of data points by adopting a K-nearest neighbor Polynomial core (polynominal) function, screening out samples with density values smaller than a threshold value of 0.01, judging whether the samples are noise or new astronomical phenomena by further using a candidate diagnosis map, and eliminating outlier interference with too small density;

combining the characteristics of the clustering process of density peaks and layers, dividing the layers of multi-density clusters in a data set, merging micro cluster groups with similar density and adjacent distance in the same area, and determining an initial clustering center point;

the k-means iteration based on the Gaussian Radial Basis Function (RBF) distance is used for distributing all data points and optimizing the cluster center, and the similarity calculation between sample data points is calculated by adopting the (RBF) kernel function, so that the conversion of the measure distance to the high-dimensional space can be realized;

(2) Grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value of Batchsize=1160, and setting the size of the sliding window to w=2; 1600 more complete pulsar candidate feature data are selected from real samples to be used as a group of samples, and the samples are respectively added into data to be detected corresponding to each wheel sliding window to form 1 data block, and the data set is divided into a plurality of parallel data blocks with the same size;

(3) The clustering is achieved by parallelization of data blocks based on a MapReduce/Spark calculation model.

The parallel hybrid clustering method for candidate signal mining in pulsar search, wherein the cluster analysis method in the step (1) is as follows:

(1) performing data preprocessing, and performing feature selection and dimension reduction on pulsar candidate data in a pulsar search flow based on PRESTO (Pulsar Exploration and Search Toolkit) software by a feature extraction method (Fifty Years of Pulsar Candidate Selection: from simple filters to a new principled real-time classification approach, monthly Notices of the Royal Astronomical Society, 2016) and a principal component analysis method (PCA), so as to obtain a new feature space input dataset with a feature vector of b; optional candidate physical characteristic values include pulse radiation (unimodal, bimodal, and multimodal), period, dispersion value, signal to noise ratio, noise signal, signal ramp, sum of incoherent power, coherent power;

(2) calculating the mahalanobis distance between the data points i and j according to equation (1) as

Wherein S is a covariance matrix of the multidimensional random variable; calculating the local polynominal kernel density of each data point based on the K neighbor according to the formula (2), wherein the global characteristic possessed by the polynominal kernel function is high in generalization performance;

wherein c is a bias coefficient and d is the order of the polynomial; to eliminate the influence of the data variation size and the numerical size, the method is applied to d _ij And ρ _i The dispersion normalization treatment is adopted as follows;

wherein min is _d And min _ρ Respectively represent d _ij And ρ _i Minimum value, max of _d Sum max _ρ Respectively represent d _ij And ρ _i Is the maximum value of (2);

(3) removing outliers according to the formula (5), and calculating the distance delta between non-outliers according to the formula (6) _i Removing outliers facilitates selection of cluster center points; in addition, the number of data points with too small density is small and the distribution is marginalized; due to its scarcity and low density, anomalies appear in the data distribution, which may be pure noise or astronomical phenomena (such as special pulsar); this portion of the data is then further determined by the corresponding candidate diagnostic map;

inlier＝{ρ _i ＞ρ _threhold }，ρ _threhold ＝0.01 (5)

(4) all data points with a distance delta greater than a threshold lambda can generate 1 two-dimensional decision graph; wherein the horizontal axis is represented by density ρ and the vertical axis is represented by distance δ; the method for merging the density level micro cluster groups on the two-dimensional decision graph comprises the following steps: if the ρ -axis or δ -axis divided region includes two or more non-data-point existing regions, the void region is referred to as a void region; dividing all data points into two density areas by the empty area, wherein the rightmost density area is called a maximum density area, and the rest is a low density area;

(A) In a low-density area, the micro clusters corresponding to the area are combined into a cluster class due to low discrimination;

(B) In the area of maximum density, if all representative points are in the same delta area, selecting the representative points as independent cluster centers; if the representative points are not in the same delta area, the inter-representative-point distinction degree is not high and the representative points possibly belong to the same cluster, so that the corresponding micro clusters need to be combined into a large cluster;

(5) determining a cluster class number k and a corresponding cluster C _i (1. Ltoreq.i.ltoreq.k) center _i ；

(6) The data points x are respectively calculated according to the principle of the vicinity _j Assigned to nearest centers _i The cluster where the cluster is located adopts RBF kernel distance in a similarity measurement mode, as shown in a formula (7); the RBF kernel function has local characteristics and strong learning capacity, and the conversion of the measure distance to the high-dimensional space can be realized through the RBF kernel distance;

wherein η represents the kernel width; calculating a new cluster C according to formula (8) _i The mean of all data points in' as a new center _i '，n _i The representation belonging to C _i ' total number of data points;

(7) calculating the sum of squares error SSE of all objects of the data set:

until SSE value no longer changes, stopping the algorithm, otherwise returning to the step (6);

the parallel hybrid clustering method for candidate signal mining in pulsar search, wherein the grouping method for the data set based on the grouping strategy of the sliding window in the step (2) comprises the following steps: accurately screening candidates to the maximum according to the data structure, and dividing data by adopting a sliding window concept; firstly, defining a window size (batch size=1160), equally dividing a data set to be detected into L blocks (the data amount of the last block is insufficient and the data of the 1 st block can be used for filling); setting the size w=2 of the sliding window, starting from the 1 st and 2 nd blocks, and enabling the sliding window to advance by 1 bit for each round to point to the corresponding data block; the last round points to the combination of the last 1 block and the 1 st block, and L rounds of segmentation are required to be executed altogether; 1600 pulsar candidate characteristic data of a group of more complete types are selected from real samples to serve as samples, and each round of data corresponding to a sliding window is added to form a data block to be detected, so that the data set is divided into L parallel data blocks to be detected; currently, a basic assumption exists that examples in the same cluster are more likely to possess the same label; therefore, a decision boundary is set according to dense or sparse areas of various data distribution, so that a pulsar data distribution area is determined, and area division of pulsar signals and non-pulsar interference signals is carried out; selecting clusters with the pulsar sample occupation ratio of more than 50% to enter a pulsar candidate list by calculating the distribution density of pulsar samples in each cluster to count the similarity; the list of noise points excluded in step (3) of the cluster analysis method may generate new phenomenon discovery.

The parallel hybrid clustering method for candidate signal mining in pulsar search, wherein the method for realizing the clustering based on the parallelization of the data blocks of the MapReduce/Spark calculation model in the step (3) comprises the following steps: aiming at large-scale pulsar data processing, according to Sun-Ni theorem, it is very necessary to research the parallelization realization of the clustering algorithm in a MapReduce calculation model; on one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparisons can be reduced; a function G (p) is introduced into Sun-Ni theorem to represent the increase of workload when the storage capacity is limited; the law proposes that the problem is scaled to effectively utilize the memory space when the problem has enough memory space on the premise of meeting the time limit specified by the fixed time acceleration ratio; firstly, dividing data into L data blocks (Block (1)) by the sliding window-based method, and then executing the data in parallel; next, the density calculation of data points in each Block (i) (1 is less than or equal to i is less than or equal to L) and the selection of an initial cluster center point (cluster centers) are completed by Map1 and Reduce1 functions (needing to be explained, the < key, value > input of the Map stage is a list composed of values of each dimension of a current sample, the key is a line number, the value is a list composed of values of each dimension of the current sample, and the Reduce stage output is key.id is the initial cluster center); finally, map2 and Reduce2 functions iterate to complete the calculation of the distance from each data point in Block (i) to the cluster center (cluster centers (i)) and re-mark the cluster category to which the data points belong, wherein a new cluster center is calculated by using the Reduce2 functions to prepare for the next round of clustering task; comparing the distance between the center of the current cluster and the center of the corresponding cluster of the previous round, and ending the operation if the change is smaller than a given threshold value; otherwise, taking the new cluster center as a cluster center of the next round; after the clustering is finished, pulsar clusters and abnormal noise points are extracted; spark is a general calculation engine for large-scale data processing, and the calculation process is similar to MapReduce.

Compared with the prior art, the invention has obvious advantages and beneficial effects. According to the technical scheme, firstly, the density of data points is calculated by adopting a K-nearest neighbor Polynomial kernel (polynominal) function, and outlier interference with too small density is eliminated; secondly, combining density peaks and layers for dividing multi-density cluster layers so as to determine an initial clustering center point; thirdly, carrying out data point allocation and cluster center optimization by using k-means iteration based on Gaussian Radial Basis Function (RBF) distance; through a data partitioning strategy based on a sliding window and a parallelization design based on a MapReduce/Spark model, the scheme ensures the candidate clustering effect and greatly improves the running time at the same time; experimental comparison is carried out on a Parkes high-time-resolution space pulsar patrol (The High Time Resolution Universe Survey, HTRU 2) data set and other common machine learning classification methods, and the results show that the proposed scheme obtains better results on Precision and Recall (Recall) which are respectively 0.946 and 0.905; according to Sun-Ni theorem, when the parallel execution nodes are enough and the communication cost is negligible, the total running time of the algorithm is obviously reduced theoretically; due to the similarity clustering characteristic of the clustering, the method can cluster more reference-meaning classification to promote the discovery of new phenomena (such as special pulsar signals) while improving the screening efficiency of the candidate.

Drawings

FIG. 1a is a density hierarchical clustering two-dimensional decision ρ partition map;

FIG. 1b is a density hierarchical clustering two-dimensional decision delta partition diagram;

FIG. 2 data allocation based on sliding windows;

FIG. 3 MapReduce flow chart;

fig. 4 average run time comparison.

Detailed Description

Examples:

referring to fig. 1 to 3, the parallel hybrid clustering method for candidate signal mining in pulsar search of the present invention includes the following steps:

1. hybrid cluster analysis

(1) Data preprocessing is carried out, feature selection and dimension reduction are carried out on pulsar candidate data in a pulsar search flow based on PRESTO (Pulsar Exploration and Search Toolkit) software through a feature extraction method (Fifty Years of Pulsar Candidate Selection: from simple filters to a new principled real-time classification approach, monthly Notices of the Royal Astronomical Society, 2016) and a principal component analysis method (PCA), and therefore a new feature space input data set with feature vector b is obtained. Optional candidate physical characteristic values include pulse radiation (unimodal, bimodal, and multimodal), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, sum of incoherent power, coherent power, and the like.

Where S is the covariance matrix of the multidimensional random variable. And then according to the formula (2), calculating the local polynominal kernel density of each data point based on the K neighbor, and the global characteristic possessed by the polynominal kernel function, so that the generalization performance of the data point is strong.

Wherein c is a bias coefficient and d is a polynomialAnd (5) step. To eliminate the influence of the data variation size and the numerical size, the method is applied to d _ij And ρ _i The dispersion normalization treatment was adopted as follows.

Wherein min is _d And min _ρ Respectively represent d _ij And ρ _i Minimum value, max of _d Sum max _ρ Respectively represent d _ij And ρ _i Is a maximum value of (a).

(3) Removing outliers according to the formula (5), and calculating the distance delta between non-outliers according to the formula (6) _i The outlier is removed to facilitate the selection of cluster center points. In addition, the number of data points with too small a density is small and the distribution is marginalized. Due to its scarcity and low density, anomalies are present in the data distribution, which may be pure noise or astronomical phenomena (such as special pulsar). This portion of the data is then further determined by the corresponding candidate diagnostic map.

inlier＝{ρ _i ＞ρ _threhold }，ρ _threhold ＝0.01 (5)

(4) All data points with a distance delta greater than the threshold lambda may generate a two-dimensional decision map. For example, a two-dimensional decision graph of 1 set of randomly generated data is shown in FIG. 1, where the horizontal axis represents density ρ and the vertical axis represents distance δ.

It is assumed that the ρ -axis and the δ -axis of this two-dimensional decision map example are divided at intervals of the magnitudes θ and γ, respectively. FIG. 1 randomly generates ρ -and delta-partitions of a data set; right delta division. θ=2γ=0.2;

if the ρ -axis or δ -axis divided region includes two or more non-data point existence regions, the void region is referred to as a void region. In fig. 1 (a) and 1 (b), the empty region divides all data points into two density regions, the rightmost density region being referred to as the maximum density region, and the remainder being the low density regions.

1) In a low-density area, the micro clusters corresponding to the area are combined into a cluster class due to low discrimination;

2) In the area of maximum density, if all representative points are in the same delta area, selecting the representative points as independent cluster centers; if the representative points are not in the same delta area, the distance distinction between the representative points is not high and the representative points may belong to the same cluster, so that the corresponding micro clusters need to be combined into one large cluster.

(5) Determining a cluster class number k and a corresponding cluster C _i (1. Ltoreq.i.ltoreq.k) center _i 。

(6) The data points x are respectively calculated according to the principle of the vicinity _j Assigned to nearest centers _i The cluster where the cluster is located adopts RBF kernel distance in a similarity measurement mode, as shown in a formula (7). The RBF kernel function has local characteristics and strong learning capacity, and the conversion of the measure distance to the high-dimensional space can be realized through the RBF kernel distance.

Where η represents the kernel width. Calculating a new cluster C according to formula (8) _i The mean of all data points in' as a new center _i '，n _i The representation belonging to C _i ' total number of data points.

(7) Calculating the sum of squares error SSE of all objects of the data set:

until no more change in SSE value occurs, the algorithm stops, otherwise it goes back to step (6).

2. Sliding window based data set partitioning strategy

In order to define a more comprehensive pulsar identification range, candidates are screened accurately and maximally according to a data structure, and data division is performed by adopting a sliding window concept. As shown in fig. 2, the window size is defined

The batch size=1160) to be used for forming a data block to be detected by selecting 1600 pieces of more complete pulsar candidate feature data from a real sample as samples and adding the data corresponding to a sliding window (the size w=2) in each round. Currently, a basic assumption exists that examples in the same cluster are more likely to have the same label. Therefore, decision boundaries are set according to dense or sparse areas of various data distribution, so that a pulsar data distribution area is determined, and area division of pulsar signals and non-pulsar interference signals is performed. Selecting clusters with the pulsar sample occupation ratio of more than 50% to enter a pulsar candidate list by calculating the distribution density of pulsar samples in each cluster to count the similarity; the noise point list excluded in the step (3) of the hybrid cluster analysis may possibly generate new astronomical phenomena.

3. Parallelization design based on MapReduce/Spark model

For large-scale pulsar data processing, according to Sun-Ni theorem, it is very necessary to research the parallelization implementation of the clustering algorithm in a MapReduce calculation model. On one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparisons can be reduced. As shown in fig. 3, data is first divided into L data blocks (Block (1)) by the sliding window-based method described above, and then executed in parallel. Next, the density calculation of data points in each Block (i) (1.ltoreq.i.ltoreq.L) and the selection of the initial cluster center point (cluster centers) are completed by Map1 and Reduce1 functions (it should be noted that < key, value > input of Map stage is a list composed of values of each dimension of the current sample, and value is a list composed of values of each dimension of the current sample. Reduce stage output is key.id, i.e. initial cluster center). Finally, map2 and Reduce2 functions iterate to complete the distance calculation of each data point in Block (i) to the cluster centers (i)) and re-label the cluster category to which it belongs, wherein the Reduce2 function is used to calculate a new cluster center to prepare for the next round of clustering task. Comparing the distance between the center of the current cluster and the center of the corresponding cluster of the previous round, and ending the operation if the change is smaller than a given threshold value; otherwise, the new cluster center is used as the cluster center of the next round. And after the clustering is finished, extracting pulsar clusters and abnormal noise points.

Experimental example:

the hardware environment is as follows: a Linux cluster environment with 4 physical compute nodes, including 2 Intel Core i7-9700K@3.6GHzCPU,1 Intel Core i7-1065G7@1.5GHz CPU and 1 Intel Core i5-9300H@2.4GHz CPU, with 32 CPU cores (68G total RAM and 3T total disk space); the software environment is as follows: anaconda3-4.2.0, hadoop-2.7.6 and Spark-2.3.1-bin-hadoop2.6 frameworks under the centos7 system.

1. Data partitioning

The published data set HTRU2 is used, which has been processed by feature extraction methods (Fifty Years of Pulsar Candidate Selection: from simple filters to a new principled real-time classification approach, monthly Notices of the Royal Astronomical Society, 2016). The sliding window size is set to 1160, 1600 of the known pulsars are randomly selected as the pulsar sample set s, and the remaining 39 are randomly mixed into the non-pulsar data samples to form the data set to be detected. According to the data partitioning strategy of section 4.1, the data set to be detected is equally divided into (t) according to the Batchsize ₁ ,t ₂ ,..t ₁₄ ) The experimental data is thus divided into { Block (1): [ s, t ] ₁ ,t ₂ ],Block(2):[s,t ₂ ,t ₃ ],...,

Block(13):[s,t ₁₃ ,t ₁₄ ],Block(14):[s,t ₁₄ ,t ₁ ]Total 14 data blocks. And (3) clustering the blocks (i) respectively, and selecting clusters with the pulsar sample occupation rate of not less than 50% to enter a pulsar candidate list after clustering is completed.

2. Evaluation index

Candidate classification often uses 4 metrics, namely Accuracy (Accuracy), precision (Precision), recall (Recall), and F1-score (F1-score) to evaluate the algorithm.

Accuracy can roughly reflect whether the overall judgment is correct or not, but cannot objectively reflect the classification performance when the data is unbalanced. Precision is used to determine the ratio of the true positive number of samples among the positive number of samples, and Recall is the ratio of the correct positive number of samples to all positive numbers of samples. Since Precision and Recall of the clusters often contradict each other, the F1-Score can be chosen to comprehensively measure these two metrics. Table 2 shows the categorized mix matrix.

Table 1 mixing matrix

The evaluation index of the experiment was set as follows using the general population Precision, recall and F1-Score:

where L represents the number of partitioned data blocks, utp=tp ₁ ∪TP ₂ ∪TP ₃ …TP _L Representing the union, recall, of pulsars identified within each small data block _O Representing Recall rate, recall, of a single block of data _total Representing the overall recall of all data blocks.

3. Parameter setting

The parameters involved in the experiment include K-nearest neighbor parameters for calculating the density of data points, threshold ρ of density _threhold The polynominal kernel parameters c and d, the RBF kernel parameter eta, the threshold lambda of the small cluster, the theta value dividing the density region and the gamma value dividing the distance region are screened. The following table is specifically set.

Table 2 parameter settings

4. Clustering result analysis

Table 3 shows a comparison of the performance of different supervised and unsupervised learning algorithms on HTRU2 dataset. Among the unsupervised algorithms, the parallel hybrid clustering algorithm has the highest Recall value, i.e., 90.5%. Compared with the supervised learning algorithm, the Recall value of the algorithm is only lower than that of the supervised learning algorithm

Gmo_ SNNNNNNNNN (pulsar candidate selection based on self-normalizing neural network, physical school, 2020), F1-Score is lower than gmo_ SNNNNNNNNN, random Forest (Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach, monthly Notices of the Royal Astronomical Society, 2016) and KNN algorithms, but higher than SVM and PNCN (Pulsar candidate selection using pseudo-nearest centroid neighbour classifier, monthly Notices of the Royal Astronomical Society, 2020). In addition, the comparison experiment that 39 pulsars are randomly selected for forming a data set to be detected through a plurality of rounds, and the maximum number of the pulsars detected by the algorithm reaches 36 at one time, and the average value is 34. Due to the advantages of unsupervised learning and rapid convergence of the hybrid clustering, the method is suitable for scenes of rapid classification mining of large-scale pulsar data. Experimental results show that the scheme based on the mixed clustering has feasibility and effectiveness. Under the actual pulsar search scene, the clustering effect is further improved along with optimization of related parameters, a pulsar sample set and a data partitioning strategy.

Table 3 effects of different methods on HTRU2 dataset

5. Time complexity analysis

Assuming the number of samples of the experimental dataset as n, the time complexity of the other algorithms (kmeans++, mcdp, PNCN) is shown in table 4. Where kmeans++ has a temporal complexity of O (nkTM), typically k, T, M are considered constant, i.e., can be reduced to O (n); for McDpc, the computation of ρ and δ time complexity is O (n ² ) Clustering time complexity based on different density levels is also O (n ² ) So the time complexity of the whole algorithm is O (n ² ) The method comprises the steps of carrying out a first treatment on the surface of the PNCN time complexity is taken from its worst-case calculated O (2nMK+FMK) ² And/2), F and M are set to be constant. The serial time complexity of the hybrid clustering algorithm is O (n ² +nkTM), since k, T, M are constants, its complexity is reduced to O (n ² ). Under parallel computing platform, its complexity becomes O ((G (P) m) according to Sun-Ni theorem ² ) Wherein G (P) is a factor, m is the number of samples of Block (i) and m ≡n; when the number of parallel nodes P is enough (P value approaches when the divided data block L reaches a certain threshold) and the communication overhead is negligible, G (P) →1, i.e. the complexity approaches O (m) ² ) Slightly worse than k-means++ and PNCN but better than McDpc. This illustrates that the proposed solution has a relatively significant decrease in run time while improving the clustering effect.

Table 4 algorithm complexity

^a T is the iteration number, M is the feature number of the element, F is the class number, M is the sample number of Block (i), and k is the cluster center number.

The experimental analysis and the time complexity analysis are compared, so that the feasibility and the effectiveness of the proposal are proved, and various performance indexes of the proposal can be greatly improved along with the optimization of data packets and related parameters in actual scenes. The unsupervised clustering method is more suitable for classifying a large number of unlabeled data sets and the situation that the proportion of pulsar sample data to non-pulsar sample data is extremely unbalanced.

6. Actual run time

FIG. 4 shows a comparison of the mean run times of the proposed method (parallel and serial) with McDPC, k-means++, and KNN on the same experimental setup. As can be seen from the figure, the average run time of serial hybrid clustering is longest, but parallel hybrid clustering (23.07 s) is very short compared to other systems. Thus, we believe that the proposed parallel scheme significantly reduces execution time while guaranteeing classification performance.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and variation of the above embodiment according to the technical matter of the present invention still fall within the scope of the technical scheme of the present invention.

Claims

1. A parallel hybrid clustering method for candidate signal mining in pulsar search comprises the following steps:

(1) Cluster analysis of pulsar candidate signals:

calculating the density of data points by adopting a K-neighbor polynomial kernel function, screening out samples with density values smaller than a threshold value of 0.01, judging whether the samples are noise or new astronomical phenomena by further using a candidate diagnosis map, and eliminating outlier interference with too small density;

the k-means iteration based on Gaussian radial basis kernel distance is used for distributing all data points and optimizing the cluster center, and the kernel function is used for calculating similarity calculation between sample data points so as to realize the conversion of the measure distance to a high-dimensional space;

(3) The clustering is realized by parallelization of data blocks based on a MapReduce/Spark calculation model;

the cluster analysis method in the step (1) comprises the following steps:

(1) performing data preprocessing, and performing feature selection and dimension reduction on pulsar candidate data in a pulsar search flow based on PRESTO software through a feature extraction method and a Principal Component Analysis (PCA) method, so as to obtain a new feature space input data set with a feature vector of b; optional candidate physical characteristic values include pulse radiation (unimodal, bimodal, and multimodal), period, dispersion value, signal to noise ratio, noise signal, signal ramp, sum of incoherent power, coherent power;

inlier＝{ρ _i ＞ρ _threhold }，ρ _threhold ＝0.01 (5)

(7) calculating the sum of squares error SSE of all objects of the data set:

the method for realizing the clustering based on the data block parallelization of the MapReduce/Spark calculation model in the step (3) comprises the following steps: aiming at large-scale pulsar data processing, according to Sun-Ni theorem, it is very necessary to research the parallelization realization of the clustering algorithm in a MapReduce calculation model; on one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparisons can be reduced; a function G (p) is introduced into Sun-Ni theorem to represent the increase of workload when the storage capacity is limited; the law proposes that the problem is scaled to effectively utilize the memory space when the problem has enough memory space on the premise of meeting the time limit specified by the fixed time acceleration ratio; firstly, dividing data into L data blocks (Block (1)) by the sliding window-based method, and then executing the data in parallel; next, the density calculation of data points in each Block (i) (1 is less than or equal to i is less than or equal to L) and the selection of an initial cluster center point (cluster centers) are completed by Map1 and Reduce1 functions (needing to be explained, the < key, value > input of the Map stage is a list composed of values of each dimension of a current sample, the key is a line number, the value is a list composed of values of each dimension of the current sample, and the Reduce stage output is key.id is the initial cluster center); finally, map2 and Reduce2 functions iterate to complete the calculation of the distance from each data point in Block (i) to the cluster center (cluster centers (i)) and re-mark the cluster category to which the data points belong, wherein a new cluster center is calculated by using the Reduce2 functions to prepare for the next round of clustering task; comparing the distance between the center of the current cluster and the center of the corresponding cluster of the previous round, and ending the operation if the change is smaller than a given threshold value; otherwise, taking the new cluster center as a cluster center of the next round; after the clustering is finished, pulsar clusters and abnormal noise points are extracted; spark is a general calculation engine for large-scale data processing, and the calculation process is similar to MapReduce.

2. The parallel hybrid clustering method for candidate signal mining in pulsar search according to claim 1, wherein the grouping method for the data sets based on the sliding window grouping strategy in step (2) is as follows: accurately screening candidates to the maximum according to the data structure, and dividing data by adopting a sliding window concept; firstly, defining a window size (batch size=1160), equally dividing a data set to be detected into L blocks (the data amount of the last block is insufficient and the data of the 1 st block can be used for filling); setting the size w=2 of the sliding window, starting from the 1 st and 2 nd blocks, and enabling the sliding window to advance by 1 bit for each round to point to the corresponding data block; the last round points to the combination of the last 1 block and the 1 st block, and L rounds of segmentation are required to be executed altogether; 1600 pulsar candidate characteristic data of a group of more complete types are selected from real samples to serve as samples, and each round of data corresponding to a sliding window is added to form a data block to be detected, so that the data set is divided into L parallel data blocks to be detected; currently, a basic assumption exists that examples in the same cluster are more likely to possess the same label; therefore, a decision boundary is set according to dense or sparse areas of various data distribution, so that a pulsar data distribution area is determined, and area division of pulsar signals and non-pulsar interference signals is carried out; selecting clusters with the pulsar sample occupation ratio of more than 50% to enter a pulsar candidate list by calculating the distribution density of pulsar samples in each cluster to count the similarity; the list of noise points excluded in step (3) of the cluster analysis method may generate new phenomenon discovery.