CN114386466A

CN114386466A - Parallel hybrid clustering method for candidate signal mining in pulsar search

Info

Publication number: CN114386466A
Application number: CN202210036692.XA
Authority: CN
Inventors: 游子毅; 刘莹; 马智; 李思瑶; 王培�; 童超
Original assignee: Guizhou Education University
Current assignee: Guizhou Education University
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-22
Anticipated expiration: 2042-01-13
Also published as: CN114386466B

Abstract

The invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps: clustering analysis of pulsar candidate signals; grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value Batchsize =1160, and setting the size of the sliding window to be w = 2; selecting more complete various pulsar candidate body characteristic data 1600 from real samples as a group of samples, respectively adding the samples into the data to be detected corresponding to each sliding window to form 1 data block, and dividing a data set into a plurality of parallel data blocks with the same size; and parallelizing the data blocks based on the MapReduce/Spark calculation model to realize the clustering. The invention can improve the clustering performance, improve the screening recall rate and reduce the execution time.

Description

Parallel hybrid clustering method for candidate signal mining in pulsar search

Technical Field

The invention belongs to the technical field of astronomy, and particularly relates to a parallel hybrid clustering method for candidate signal mining in pulsar search.

Background

The discovery of the pulsar field favorably promotes the development of relevant fields such as astronomy, physics, navigation and the like, and along with the establishment of a spherical radio telescope FAST with the caliber of 500 meters and the detection of a 19-beam receiver in an inspection sky, the characteristics of high sensitivity and larger sky area coverage are realized, the advantages of the pulsar signal search range are brought, meanwhile, the observation data are greatly increased, and how to effectively screen pulsar candidates from mass data becomes the key of pulsar search;

the work required to be completed in the basic pulsar search is to search a stable periodic pulse signal in a two-dimensional space consisting of P (period) -DM (dispersion amount); at present, conventional methods aided by graphical tools or based on statistics have not been able to meet the needs of such huge data volume processing; the artificial intelligence technology is applied to the screening of pulsar candidates and is mainly divided into three categories according to the principle of the method; the first type is a candidate ordering algorithm based on empirical formulas; such algorithms rely on assumptions such as signal-to-noise ratio, pulse profile shape, etc., and many in practice do not fit well and may result in some specially shaped pulses, such as broad pulses, biased DM curves, or low flow rate pulsar being missed; the second type is a neural network image recognition model which directly utilizes the candidate body diagnostic graph to automatically extract features; compared with the traditional machine learning method, the generalization of the algorithm is better, but each sub-graph of training data needs to be marked manually, and the sample training requirement is larger, so that a large amount of extra labor is invested; the third category is machine learning based classification algorithms; feature selection screened by human experience is the key to influence the binary classification result of pulsar screening, and an incomplete feature design scheme can weaken the performance of a model, so that the feature design problem is particularly critical; in addition, some hybrid models with multi-method integration also achieve remarkable effects;

in actual large-scale pulsar data calculation and search, most of input data sets are label-free data, and the pulsar and non-pulsar sample data ratio is extremely unbalanced, so that the time cost and the workload for identifying pulsar candidates by using a supervised learning classification method are quite large;

the experimental data set HTRU2 was from multi-beam (13-beam) observations from an australian Parkes telescope with a pulsar signal Search pipe DM value set to 0 to 2000cm-3pc, describing pulsar candidate sample data collected during high time resolution cosmic surveys based on presto (pulsar amplification and Search toolkit) software processing; the pulsar search and analysis suite developed by PRESTO american NRAO radio astronomical stages is now used for multiple rounds of sky tours to process short integration time data and X-ray data; the HTRU2 dataset contained 17898 data samples total, 16259 spurious instances due to RFI or noise and 1639 real pulsar instances; the characteristic values comprise 8 attributes of mean value of the pulse profile, standard deviation of the pulse profile, excess kurtosis of the pulse profile, skewness of the pulse profile, mean value of DM-S/N curve, standard deviation of DM-S/N curve, excess peak amount of DM-S/N curve and skewness of DM-S/N curve; HTRU2 is an open, relatively sample-rich data set with high acceptance and is therefore widely used to evaluate the performance of pulsar candidate classification algorithms;

clustering is one of key methods for processing large-scale data mining problems, and comprises clustering algorithms based on division, density, grids and the like; the k-means is widely applied as a clustering algorithm based on division, but the original k-means has the defects that the clustering effect depends on the selection of an initial central point, the numerical data can only be responded, the interference of abnormal values is large, and the like; therefore, many scholars have been improving on this algorithm. (Privacy-monitoring Mechanisms for K-models Clustering, Computers and Security, 2018) proposes a K-MODES algorithm for solving the defect that K-means can only deal with numerical data; a density-Based Clustering method, such as a typical DBSCAN (sensitivity Based Spatial Clustering of Applications with noise) algorithm, can find clusters of any shape, but has large Clustering samples, long convergence time and poor Clustering effect on the condition of non-uniform cluster density; (Clustering by fast search and find of diversity peaks, Science 2014) proposes a fast search Clustering algorithm based on density peaks, which has the main idea that the density of cluster centers should be greater than that of surrounding neighbors, and the distances between different cluster centers are relatively far; to overcome this drawback, (McDPC: multi-center dense peak clustering, Neural Computing and Applications, 2020) further proposes a multi-center clustering method based on density hierarchical partitioning; the hierarchy-based clustering does not need to specify the number of clusters in advance and can discover the hierarchical relationship of the clusters, but the calculation complexity is too high.

Disclosure of Invention

The invention aims to overcome the defects and provide a parallel hybrid clustering method for candidate signal mining in pulsar search, which improves the clustering performance, improves the screening recall rate and reduces the execution time.

The purpose of the invention and the main technical problem of solving the invention are realized by adopting the following technical scheme:

the invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps of:

(1) clustering analysis of pulsar candidate signals:

calculating the density of data points by adopting a K-nearest Polynomial kernel (Polynomial) function, screening out samples with the density value smaller than a threshold value of 0.01, further judging whether the samples are noise or a new astronomical phenomenon through a candidate body diagnostic map, and eliminating outlier interference with too small density;

combining the clustering process characteristics of density peak values and levels, dividing the multi-density cluster levels in the data set, combining micro-cluster groups with similar densities and adjacent distances in the same region, and determining an initial clustering center point;

distributing all data points and optimizing cluster centers by using k-means iteration based on Gaussian Radial Basis Function (RBF) distance, and calculating similarity between sample data points by using an (RBF) kernel function to realize conversion of measure distance to a high-dimensional space;

(2) grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value Batchsize 1160, and setting the size of the sliding window w to be 2; selecting more complete various pulsar candidate body characteristic data 1600 from real samples as a group of samples, respectively adding the samples into the data to be detected corresponding to each sliding window to form 1 data block, and dividing a data set into a plurality of parallel data blocks with the same size;

(3) and parallelizing the data blocks based on the MapReduce/Spark calculation model to realize the clustering.

The above parallel hybrid clustering method for candidate signal mining in pulsar search, wherein the cluster analysis method in step (1) is:

firstly, data preprocessing is carried out, and Pulsar Candidate data in a Pulsar searching process based on PRESTO (Pulsar expansion and Search toolkit) software are subjected to feature Selection and dimension reduction through a feature extraction method (From simple filters to a new normalized real-time classification approach, simple notes of the Royal analytical Society,2016) and a principal component analysis method (PCA), so that a new feature space input data set with a feature vector of b is obtained; the selectable candidate physical characteristic values comprise pulse radiation (unimodal, bimodal and multimodal), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, incoherent power sum and coherent power;

② calculating the Mahalanobis distance between the data points i and j according to the formula (1)

Wherein S is a covariance matrix of the multi-dimensional random variables; then, calculating the local Polynomial core density of each data point based on K neighbor and the global property of the Polynomial core function according to the formula (2), so that the generalization performance of the Polynomial core function is strong;

wherein c is the offset coefficient and d is the order of the polynomial; to eliminate the influence of the size of the data variation and the size of the value, pair d_ijAnd ρ_iAll adopt deviation standardization treatment as follows;

wherein, min_dAnd min_ρEach represents d_ijAnd ρ_iMinimum value of, max_dAnd max_ρEach represents d_ijAnd ρ_iMaximum value of (d);

removing outliers according to the formula (5), and calculating the distance delta between non-outliers according to the formula (6)_iThe outlier is removed, so that the cluster center point is selected; in addition, the number of data points with too low density is small, and the distribution is marginalized; due to their scarcity and low density, they are abnormal in data distribution, and the abnormal phenomena may be pure noise or astronomical new phenomena (such as special pulsar); this portion of data will subsequently be further determined by the corresponding candidate body diagnostic map;

inlier＝{ρ_i＞ρ_threhold}，ρ_threhold＝0.01 (5)

generating 1 two-dimensional decision graph by all data points with the distance delta larger than the threshold value lambda; wherein the horizontal axis is represented by density ρ and the vertical axis is represented by distance δ; merging density level micro-cluster groups on a two-dimensional decision diagram, wherein the method comprises the following steps: if the rho axis or delta axis divided region contains two or more regions without data points, the gap region is called as a vacant region; the empty area divides all data points into two density areas, the rightmost density area is called as the maximum density area, and the rest is a low density area;

(A) in the low-density area, because the discrimination is not high, the micro-clusters corresponding to the area are all combined into a cluster class;

(B) in the maximum density area, if all the representative points are in the same delta area, all the representative points are selected as independent cluster centers; if the representative points are not in the same delta area, the distance distinction degrees between the representative points are not high, and the representative points possibly belong to the same cluster class, so that corresponding micro-clusters are required to be combined into a large cluster;

determining cluster number k and corresponding cluster C_i(1. ltoreq. i. ltoreq.k) center_i；

Sixthly, each data point x is divided according to the principle of proximity_jAssign to nearest center_iIn the cluster, RBF nuclear distance is adopted as a similarity measure mode, and the formula (7) is shown; the RBF kernel function has local characteristics and strong learning capability, and the conversion of the measure distance to a high-dimensional space can be realized through the RBF kernel distance;

wherein η represents the kernel function width; calculating a new cluster C according to equation (8)_i' the mean of all data points within as the new center_i'，n_iDenotes to belong to C_i' total number of data points;

calculating the error square sum SSE of all objects in the data set:

stopping the algorithm until the SSE value is not changed any more, otherwise returning to the step (c);

the parallel hybrid clustering method for candidate signal mining in pulsar search is described above, wherein the method for grouping data sets based on the sliding window grouping strategy in step (2) is as follows: accurately screening candidate bodies to the maximum extent according to a data structure, and dividing data by adopting a sliding window concept; firstly, defining a window size (Batchsize 1160), and equally dividing a data set to be detected into L blocks (the data size of the last block is not enough, and the data of the 1 st block can be selected for filling); setting the size w of the sliding window to be 2, starting from the 1 st and 2 nd blocks in the 1 st round, and advancing each sliding window by 1 bit to point to the corresponding data block; the last round points to the combination of the last 1 block and the 1 st block, and L-round division is required to be executed in total; selecting a group of relatively complete various pulsar candidate characteristic data 1600 from real samples as samples, and adding the data corresponding to a sliding window into each round to form a data block to be detected, so that the data set is divided into L parallel data blocks to be detected; currently, there is a basic assumption for clustering that instances in the same cluster have a greater likelihood of having the same label; therefore, a decision boundary is set according to the dense or sparse region of various data distributions, so that a pulsar data distribution region is determined, and the region division of pulsar signals and non-pulsar interference signals is carried out; calculating the distribution density of pulsar samples in each cluster to count the similarity, and selecting the cluster with pulsar sample occupancy rate more than 50% to enter a pulsar candidate body list; the list of noise points excluded from the third step in the cluster analysis method may possibly lead to the discovery of new phenomena.

The parallel hybrid clustering method for candidate signal mining in pulsar search is characterized in that the method for realizing the clustering based on the data block parallelization of the MapReduce/Spark calculation model in the step (3) comprises the following steps: aiming at large-scale pulsar data processing, according to Sun-Ni theorem, the research on the parallelization realization of the clustering algorithm in a MapReduce calculation model is very necessary; on one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparison can be reduced; a function G (p) is introduced into the Sun-Ni theorem to represent the increment of the workload when the storage capacity is limited; the law proposes that the memory space can be effectively utilized by scaling the problem under the premise of meeting the time limit specified by the fixed time acceleration ratio and having enough memory space; dividing data into L data blocks (Block (1),.. multidot., Block (L)) by the sliding window-based method and then executing the data blocks in parallel; next, the Map1 and Reduce1 functions are used for calculating the density of data points in each block (i) (i is more than or equal to 1 and less than or equal to L) and selecting an initial clustering center point (clusters) (it needs to be noted that key, value > in the Map stage is input, key is a row number, value is a list formed by values of each dimension of the current sample, and output in the Reduce stage is key.id, namely the initial clustering center); finally, the Map2 and Reduce2 functions iterate to calculate the distance from each data point in block (i) to the cluster center (clusters (i)) and re-mark the cluster class to which the data point belongs, wherein the new cluster center is calculated by using the Reduce2 function to prepare for the next round of clustering task; comparing the distance between the center of the cluster of the current round and the center of the corresponding cluster of the previous round, and if the change is smaller than a given threshold value, ending the operation; otherwise, taking the new cluster center as the cluster center of the next round; after clustering is finished, extracting pulsar clusters and abnormal noise points; spark is a general computing engine for large-scale data processing, and the computing process is similar to MapReduce.

Compared with the prior art, the invention has obvious advantages and beneficial effects. According to the technical scheme, firstly, a K-nearest neighbor Polynomial kernel (Polynomial) function is adopted to calculate the density of data points, and outlier interference with too small density is eliminated; secondly, combining the density peak value and the level, and dividing the multi-density cluster level to determine an initial clustering center point; thirdly, performing data point allocation and cluster center optimization by using k-means iteration based on Gaussian Radial Basis Function (RBF) distance; by means of a data partitioning strategy based on a sliding window and a parallelization design based on a MapReduce/Spark model, the running time of the scheme is greatly improved while the clustering effect of the candidate is ensured; compared with other common machine learning classification methods, The results of experiments on a Parkes High Time Resolution universal satellite sky Survey 2 (HTRU 2) data set show that The proposed scheme obtains superior results on Precision (Precision) and Recall (Recall), which are 0.946 and 0.905 respectively; according to Sun-Ni theorem, when the parallel execution nodes are enough and the communication cost is negligible, the total operation time of the algorithm is obviously reduced theoretically; due to the similarity clustering characteristic of clustering, the method can cluster more reference classifications to promote the discovery of new phenomena (such as special pulsar signals) while improving the candidate screening efficiency.

Drawings

FIG. 1a is a density hierarchical clustering two-dimensional decision rho partition diagram;

FIG. 1b is a density hierarchical clustering two-dimensional decision delta partition diagram;

FIG. 2 sliding window based data distribution;

FIG. 3 is a MapReduce flow chart;

fig. 4 average run time comparison.

Detailed Description

Example (b):

referring to fig. 1 to 3, a parallel hybrid clustering method for candidate signal mining in pulsar search according to the present invention comprises the following steps:

1. hybrid cluster analysis

(1) And (2) carrying out data preprocessing, and carrying out feature Selection and dimension reduction on the Pulsar Candidate data in the Pulsar Search flow based on PRESTO (Pulsar amplification and Search toolkit) software by a feature extraction method (From simple filters to a new normalized real-time classification approach, simple notes of the Royal analytical Society,2016) and a principal component analysis method (PCA), so as to obtain a new feature space input data set with a feature vector of b. Alternative candidate physical characteristic values include pulsed radiation (singlet, doublet and multiplet), period, dispersion, signal-to-noise ratio, noise signal, signal ramp, incoherent power sum, coherent power, and the like.

(2) Calculate the Mahalanobis distance between data points i and j according to equation (1) as

Where S is the covariance matrix of the multidimensional random variable. And then, calculating the local Polynomial core density of each data point based on K neighbors according to the formula (2), wherein the Polynomial core function has global characteristics, so that the generalization performance of the Polynomial core function is strong.

Where c is the offset coefficient and d is the order of the polynomial. To eliminate the influence of the size of the data variation and the size of the value, pair d_ijAnd ρ_iDispersion normalization was used as follows.

Wherein, min_dAnd min_ρEach represents d_ijAnd ρ_iMinimum value of, max_dAnd max_ρEach represents d_ijAnd ρ_iIs measured.

(3) Outliers are removed according to the formula (5), and the distance delta between non-outliers is calculated according to the formula (6)_iOutliers are culled to facilitate the selection of cluster center points. In addition, the number of data points with too low a density is small and the distribution is marginalized. Due to their scarcity and low density, anomalies in the data distribution, which may be pure noise or astronomical artifacts (e.g., special pulsar), are present. This portion of the data will subsequently be further determined by the corresponding candidate diagnostic map.

inlier＝{ρ_i＞ρ_threhold}，ρ_threhold＝0.01 (5)

(4) All data points for which the distance δ is greater than the threshold λ can generate a two-dimensional decision map. For example, a two-dimensional decision graph of 1 set of randomly generated data is shown in fig. 1, where the horizontal axis represents density ρ and the vertical axis represents distance δ.

It is assumed that the ρ -axis and the δ -axis of the two-dimensional decision graph example are divided by intervals of size θ and γ, respectively. FIG. 1 randomly generates ρ split and δ split of a data set ρ split; right, δ division, θ -2 γ -0.2;

if two or more regions without data points exist in the rho-axis or delta-axis divided region, the void region is called as a vacant region. In fig. 1(a) and 1(b), the empty region divides all data points into two density regions, the rightmost density region being referred to as the maximum density region, and the remainder being low density regions.

1) In the low-density area, because the discrimination is not high, the micro-clusters corresponding to the area are all combined into a cluster class;

2) in the maximum density area, if all the representative points are in the same delta area, all the representative points are selected as independent cluster centers; if not in the same delta region, the representative inter-dot distance divisions are not high, and may belong to the same cluster class, so that corresponding micro-clusters need to be merged into one large cluster.

(5) Determining cluster number k and corresponding cluster C_i(1. ltoreq. i. ltoreq.k) center_i。

(6) Each data point x is divided according to the principle of proximity_jAssign to nearest center_iIn the cluster, the RBF nuclear distance is adopted as the similarity measure mode, as shown in formula (7). The RBF kernel function has local characteristics and strong learning capability, and the conversion of the measure distance to a high-dimensional space can be realized through the RBF kernel distance.

Wherein eta isRepresenting the kernel function width. Calculating a new cluster C according to equation (8)_i' the mean of all data points within as the new center_i'，n_iDenotes to belong to C_iTotal number of data points of.

(7) The sum of the squared errors SSE for all objects of the data set is calculated:

until the SSE value no longer changes, the algorithm stops, otherwise step (6) is returned to.

2. Sliding window based data set partitioning strategy

In order to define a more comprehensive pulsar identification range, the candidate bodies are accurately screened to the maximum extent according to the data structure, and the data division is carried out by adopting a sliding window concept. As shown in fig. 2, the window size is defined (

The Batchsize is 1160), a complete set of various pulsar candidate characteristic data 1600 is selected from real samples as samples, and each round of data is added into the data corresponding to the sliding window (the size w is 2) to form a data block to be detected. Currently, clustering has a basic assumption that instances in the same cluster have a greater likelihood of having the same label. Therefore, a decision boundary is set according to the dense or sparse region of various data distributions, so that a pulsar data distribution region is determined, and the region division of pulsar signals and non-pulsar interference signals is carried out. Calculating the distribution density of pulsar samples in each cluster to count the similarity, and selecting the cluster with pulsar sample occupancy rate more than 50% to enter a pulsar candidate body list; the list of noise points excluded by the hybrid cluster analysis step (3) may lead to the discovery of new astronomical phenomena.

3. Parallelization design based on MapReduce/Spark model

Aiming at large-scale pulsar data processing, according to Sun-Ni theorem, the research on the parallelization realization of the clustering algorithm in a MapReduce calculation model is very necessary. On one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparisons can be reduced. As shown in fig. 3, the data is first divided into L data blocks (Block (1),.. multidot.,. Block (L)) by the sliding window based method and then executed in parallel. Next, the Map1 and Reduce1 functions complete density calculation of data points in each block (i) (i is more than or equal to 1 and less than or equal to L) and selection of initial clustering center points (it should be noted that < key, value > input in the Map stage is a line number, value is a list consisting of values of each dimension of the current sample, and output in the Reduce stage is key. Finally, the Map2 and Reduce2 functions iterate through the calculation of the distance of each data point within block (i) to the cluster center(s) and re-label the cluster class to which it belongs, wherein the new cluster center is calculated using the Reduce2 function in preparation for the next round of clustering task. Comparing the distance between the center of the cluster of the current round and the center of the corresponding cluster of the previous round, and if the change is smaller than a given threshold value, ending the operation; and otherwise, taking the new cluster center as the cluster center of the next round. And after the clustering is finished, extracting pulsar clusters and abnormal noise points.

Experimental example:

the hardware environment is as follows: a Linux cluster environment with 4 physical compute nodes, including 2 Intel Core i7-9700K @3.6 GHzCPUs, 1 Intel Core i7-1065G7@1.5GHz CPU and 1 Intel Core i5-9300H @2.4GHz CPU with 32 CPU cores (total RAM of 68G, total disk space of 3T); the software environment is as follows: the Anaconda3-4.2.0, Hadoop-2.7.6 and Spark-2.3.1-bin-Hadoop2.6 frameworks under the centros 7 system.

1. Data partitioning

An open data set HTRU2 is adopted, and the data set is obtained by processing a feature extraction method (From simple filters to a new random real-time classification approach, simple notes of the Royal analytical Society, 2016). Setting the sliding window size Batchsize 1160, 1600 of the known pulsar samples are randomly selected as a set s of pulsar samples, and the remaining 39 are randomly mixed into the non-pulsar samplesA data sample forms a data set to be detected. According to the data partitioning strategy of section 4.1, the data set to be detected is equally divided into (t) according to Batchsize₁,t₂,..t₁₄) From this experimental data are divided into { Block (1): s, t₁,t₂],Block(2):[s,t₂,t₃],...,

Block(13):[s,t₁₃,t₁₄],Block(14):[s,t₁₄,t₁]Total 14 data blocks. And (i) clustering, and when clustering is finished, selecting a cluster with pulsar sample occupancy rate not less than 50% to enter a pulsar candidate list.

2. Evaluation index

The candidate classification usually uses 4 indexes of Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1-score (F1-score) to evaluate the algorithm.

Accuracy can roughly reflect the overall judgment correctness, but cannot objectively reflect the classification performance when the data are unbalanced. Precision is used for judging the ratio of the number of true positive samples in the number of positive samples, and Recall is the ratio of the number of correct positive samples to the number of all positive samples. Since Precision and Recall of the clusters tend to contradict each other, F1-Score can be selected to measure the two indexes together. Table 2 shows the categorized confusion matrix.

TABLE 1 MIXING-PROGRAM MATRIX

The evaluation indexes of the experiment are total Precision, Recall and F1-Score, and are set as follows:

wherein L represents the number of divided data blocks, UTP ═ TP₁∪TP₂∪TP₃…TP_LRepresenting a union, Recall, of pulsar identified within each small data block_OIndicating Recall, of a single data block_totalRepresenting the overall recall of all data chunks.

3. Parameter setting

The parameters involved in the experiment include K neighbor parameters for calculating the density of the data points, and the threshold value rho of the density_threholdPolynomial kernel parameters c and d, RBF kernel parameter η, threshold λ for screening small clusters, θ value for density region division and γ value for distance region division. The specific settings are as follows.

TABLE 2 parameter settings

4. Clustering result analysis

Table 3 shows a comparison of the performance of different supervised and unsupervised learning algorithms on the HTRU2 data set. Among the unsupervised algorithms, the parallel hybrid clustering algorithm has the highest Recall value, i.e., 90.5%. Compared with the supervised learning algorithm, the Recall value of the algorithm is only lower than that of the supervised learning algorithm

GMO SNNNNNNNNN (Pulsar candidate selection based on self-normalized neural networks, Physics, 2020), F1-Score is lower than GMO SNNNNNNNNN, Random Forest (Fifty year of Pulsar candidate selection: from simple filters to a new rectangular real-time classification approach, simple notes of the Royal analytical Society, 2020), and KNN algorithm, but higher than SVM and PNCN (Pulsar candidate selection using pseudo-near center qualitative Society, simple notes of the Royal analytical Society, 2016). In addition, through a control experiment of forming a data set to be detected by randomly selecting 39 pulsar in multiple rounds, the maximum pulsar number detected by the algorithm reaches 36 once, and the average value is 34. Due to the advantages of unsupervised learning and rapid convergence of hybrid clustering, the method is suitable for scenes of large-scale pulsar data rapid classification mining. Experimental results show that the proposed scheme based on mixed clustering has feasibility and effectiveness. Under the actual pulsar search scene, the clustering effect of the pulsar is further improved along with the optimization of the related parameters, the pulsar sample set and the data partitioning strategy.

TABLE 3 Effect of different methods on HTRU2 data set

5. Temporal complexity analysis

Let n be the number of samples in the experimental data set, and the time complexity of the other algorithms (kmeans + +, mcdpcc, PNCN) is shown in table 4. Wherein, the time complexity of kmeans + + is o (nktm), and k, T, M are generally regarded as constants, i.e. simplified to o (n); for McDpc, the computation of ρ and δ time complexities is O (n)²) The clustering time complexity based on different density levels is also O (n)²) So the time complexity of the whole algorithm is O (n)²) (ii) a PNCN time complexity is taken from its worst case calculated quantity O (2nMK + FMK)²/2), F and M are set to constant values. The serial time complexity of the hybrid clustering algorithm is O (n)²+ nkTM), the complexity is reduced to O (n) since k, T, M are constants²). Under the parallel computing platform, according to Sun-Ni theorem, the complexity becomes O ((G (P) m)²) Wherein G (P) is a factor, m is the number of samples of Block (i) and m □ n; when the number of parallel nodes P is sufficient (the P value approaches a certain threshold value of the partitioned data block L) and the communication overhead is negligible, g (P) → 1, i.e., the complexity approaches O (m)²) Is slightly inferior to k-means + + and PNCN but superior to McDpc. This shows that the proposed scheme has a significant reduction in running time while improving the clustering effect.

TABLE 4 Algorithm complexity

^aT is iteration number, M is element feature number, F is class number, M is Block (i) sample number, and k is cluster center number.

Compared with experimental analysis and time complexity analysis, the provided scheme is proved to have feasibility and effectiveness, and various performance indexes are improved greatly along with the optimization of data grouping and related parameters in an actual scene. The unsupervised clustering method is more suitable for the classification of a large number of unlabeled data sets and the situation that the sample data proportion of pulsar and non-pulsar is extremely unbalanced.

6. Actual run time

FIG. 4 comparison of the average run times shows the proposed method (parallel and serial) compared to the average run times of McDPC, k-means + + and KNN on the same experimental set-up. As can be seen from the figure, the average run time of the serial hybrid cluster is the longest, but the parallel hybrid cluster (23.07s) is very short compared to other systems. Therefore, we believe that the proposed parallel scheme significantly reduces the execution time while guaranteeing the classification performance.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are within the scope of the present invention without departing from the technical spirit of the present invention.

Claims

1. A parallel mixed clustering method for candidate signal mining in pulsar search comprises the following steps:

(1) clustering analysis of pulsar candidate signals:

calculating the density of data points by adopting a K-nearest neighbor polynomial kernel function, screening out samples with density values smaller than a threshold value of 0.01, further judging whether the samples are noise or new astronomical phenomena through a candidate body diagnostic map, and eliminating outlier interference with too small density;

distributing all data points and optimizing cluster centers by using k-means iteration based on Gaussian radial basis kernel distance, and calculating similarity between sample data points by using a kernel function to realize conversion of measure distance to a high-dimensional space;

2. A parallel hybrid clustering method for candidate signal mining in pulsar searching as claimed in claim 1, wherein said cluster analysis method in step (1) is:

firstly, preprocessing data, and performing feature selection and dimension reduction on pulsar candidate volume data in a pulsar search process based on PRESTO software through a feature extraction method and a Principal Component Analysis (PCA) method to obtain a new feature space input data set with a feature vector b; the selectable candidate physical characteristic values comprise pulse radiation (unimodal, bimodal and multimodal), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, incoherent power sum and coherent power;

inlier＝{ρ_i＞ρ_threhold}，ρ_threhold＝0.01 (5)

calculating the error square sum SSE of all objects in the data set:

and c, stopping the algorithm until the SSE value is not changed any more, and returning to the step c.

3. The parallel hybrid clustering method for candidate signal mining in pulsar searching as claimed in claim 2, wherein the grouping method of the data sets based on the sliding window grouping strategy in step (2) is: accurately screening candidate bodies to the maximum extent according to a data structure, and dividing data by adopting a sliding window concept; firstly, defining a window size (Batchsize 1160), and equally dividing a data set to be detected into L blocks (the data size of the last block is not enough, and the data of the 1 st block can be selected for filling); setting the size w of the sliding window to be 2, starting from the 1 st and 2 nd blocks in the 1 st round, and advancing each sliding window by 1 bit to point to the corresponding data block; the last round points to the combination of the last 1 block and the 1 st block, and L-round division is required to be executed in total; selecting a group of relatively complete various pulsar candidate characteristic data 1600 from real samples as samples, and adding the data corresponding to a sliding window into each round to form a data block to be detected, so that the data set is divided into L parallel data blocks to be detected; currently, there is a basic assumption for clustering that instances in the same cluster have a greater likelihood of having the same label; therefore, a decision boundary is set according to the dense or sparse region of various data distributions, so that a pulsar data distribution region is determined, and the region division of pulsar signals and non-pulsar interference signals is carried out; calculating the distribution density of pulsar samples in each cluster to count the similarity, and selecting the cluster with pulsar sample occupancy rate more than 50% to enter a pulsar candidate body list; the list of noise points excluded from the third step in the cluster analysis method may possibly lead to the discovery of new phenomena.

4. The parallel hybrid clustering method for candidate signal mining in pulsar search according to claim 1 or 2, wherein the method for parallelizing the data blocks based on the MapReduce/Spark calculation model in step (3) to realize the clustering comprises: aiming at large-scale pulsar data processing, according to Sun-Ni theorem, the research on the parallelization realization of the clustering algorithm in a MapReduce calculation model is very necessary; on one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparison can be reduced; a function G (p) is introduced into the Sun-Ni theorem to represent the increment of the workload when the storage capacity is limited; the law proposes that the memory space can be effectively utilized by scaling the problem under the premise of meeting the time limit specified by the fixed time acceleration ratio and having enough memory space; dividing data into L data blocks (Block (1),.. multidot., Block (L)) by the sliding window-based method and then executing the data blocks in parallel; next, the Map1 and Reduce1 functions are used for calculating the density of data points in each block (i) (i is more than or equal to 1 and less than or equal to L) and selecting an initial clustering center point (clusters) (it needs to be noted that key, value > in the Map stage is input, key is a row number, value is a list formed by values of each dimension of the current sample, and output in the Reduce stage is key.id, namely the initial clustering center); finally, the Map2 and Reduce2 functions iterate to calculate the distance from each data point in block (i) to the cluster center (clusters (i)) and re-mark the cluster class to which the data point belongs, wherein the new cluster center is calculated by using the Reduce2 function to prepare for the next round of clustering task; comparing the distance between the center of the cluster of the current round and the center of the corresponding cluster of the previous round, and if the change is smaller than a given threshold value, ending the operation; otherwise, taking the new cluster center as the cluster center of the next round; after clustering is finished, extracting pulsar clusters and abnormal noise points; spark is a general computing engine for large-scale data processing, and the computing process is similar to MapReduce.