CN114386466A - Parallel hybrid clustering method for candidate signal mining in pulsar search - Google Patents

Parallel hybrid clustering method for candidate signal mining in pulsar search Download PDF

Info

Publication number
CN114386466A
CN114386466A CN202210036692.XA CN202210036692A CN114386466A CN 114386466 A CN114386466 A CN 114386466A CN 202210036692 A CN202210036692 A CN 202210036692A CN 114386466 A CN114386466 A CN 114386466A
Authority
CN
China
Prior art keywords
data
pulsar
cluster
clustering
density
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210036692.XA
Other languages
Chinese (zh)
Other versions
CN114386466B (en
Inventor
游子毅
刘莹
马智
李思瑶
王培�
童超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Education University
Original Assignee
Guizhou Education University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Education University filed Critical Guizhou Education University
Priority to CN202210036692.XA priority Critical patent/CN114386466B/en
Publication of CN114386466A publication Critical patent/CN114386466A/en
Application granted granted Critical
Publication of CN114386466B publication Critical patent/CN114386466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps: clustering analysis of pulsar candidate signals; grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value Batchsize =1160, and setting the size of the sliding window to be w = 2; selecting more complete various pulsar candidate body characteristic data 1600 from real samples as a group of samples, respectively adding the samples into the data to be detected corresponding to each sliding window to form 1 data block, and dividing a data set into a plurality of parallel data blocks with the same size; and parallelizing the data blocks based on the MapReduce/Spark calculation model to realize the clustering. The invention can improve the clustering performance, improve the screening recall rate and reduce the execution time.

Description

Parallel hybrid clustering method for candidate signal mining in pulsar search
Technical Field
The invention belongs to the technical field of astronomy, and particularly relates to a parallel hybrid clustering method for candidate signal mining in pulsar search.
Background
The discovery of the pulsar field favorably promotes the development of relevant fields such as astronomy, physics, navigation and the like, and along with the establishment of a spherical radio telescope FAST with the caliber of 500 meters and the detection of a 19-beam receiver in an inspection sky, the characteristics of high sensitivity and larger sky area coverage are realized, the advantages of the pulsar signal search range are brought, meanwhile, the observation data are greatly increased, and how to effectively screen pulsar candidates from mass data becomes the key of pulsar search;
the work required to be completed in the basic pulsar search is to search a stable periodic pulse signal in a two-dimensional space consisting of P (period) -DM (dispersion amount); at present, conventional methods aided by graphical tools or based on statistics have not been able to meet the needs of such huge data volume processing; the artificial intelligence technology is applied to the screening of pulsar candidates and is mainly divided into three categories according to the principle of the method; the first type is a candidate ordering algorithm based on empirical formulas; such algorithms rely on assumptions such as signal-to-noise ratio, pulse profile shape, etc., and many in practice do not fit well and may result in some specially shaped pulses, such as broad pulses, biased DM curves, or low flow rate pulsar being missed; the second type is a neural network image recognition model which directly utilizes the candidate body diagnostic graph to automatically extract features; compared with the traditional machine learning method, the generalization of the algorithm is better, but each sub-graph of training data needs to be marked manually, and the sample training requirement is larger, so that a large amount of extra labor is invested; the third category is machine learning based classification algorithms; feature selection screened by human experience is the key to influence the binary classification result of pulsar screening, and an incomplete feature design scheme can weaken the performance of a model, so that the feature design problem is particularly critical; in addition, some hybrid models with multi-method integration also achieve remarkable effects;
in actual large-scale pulsar data calculation and search, most of input data sets are label-free data, and the pulsar and non-pulsar sample data ratio is extremely unbalanced, so that the time cost and the workload for identifying pulsar candidates by using a supervised learning classification method are quite large;
the experimental data set HTRU2 was from multi-beam (13-beam) observations from an australian Parkes telescope with a pulsar signal Search pipe DM value set to 0 to 2000cm-3pc, describing pulsar candidate sample data collected during high time resolution cosmic surveys based on presto (pulsar amplification and Search toolkit) software processing; the pulsar search and analysis suite developed by PRESTO american NRAO radio astronomical stages is now used for multiple rounds of sky tours to process short integration time data and X-ray data; the HTRU2 dataset contained 17898 data samples total, 16259 spurious instances due to RFI or noise and 1639 real pulsar instances; the characteristic values comprise 8 attributes of mean value of the pulse profile, standard deviation of the pulse profile, excess kurtosis of the pulse profile, skewness of the pulse profile, mean value of DM-S/N curve, standard deviation of DM-S/N curve, excess peak amount of DM-S/N curve and skewness of DM-S/N curve; HTRU2 is an open, relatively sample-rich data set with high acceptance and is therefore widely used to evaluate the performance of pulsar candidate classification algorithms;
clustering is one of key methods for processing large-scale data mining problems, and comprises clustering algorithms based on division, density, grids and the like; the k-means is widely applied as a clustering algorithm based on division, but the original k-means has the defects that the clustering effect depends on the selection of an initial central point, the numerical data can only be responded, the interference of abnormal values is large, and the like; therefore, many scholars have been improving on this algorithm. (Privacy-monitoring Mechanisms for K-models Clustering, Computers and Security, 2018) proposes a K-MODES algorithm for solving the defect that K-means can only deal with numerical data; a density-Based Clustering method, such as a typical DBSCAN (sensitivity Based Spatial Clustering of Applications with noise) algorithm, can find clusters of any shape, but has large Clustering samples, long convergence time and poor Clustering effect on the condition of non-uniform cluster density; (Clustering by fast search and find of diversity peaks, Science 2014) proposes a fast search Clustering algorithm based on density peaks, which has the main idea that the density of cluster centers should be greater than that of surrounding neighbors, and the distances between different cluster centers are relatively far; to overcome this drawback, (McDPC: multi-center dense peak clustering, Neural Computing and Applications, 2020) further proposes a multi-center clustering method based on density hierarchical partitioning; the hierarchy-based clustering does not need to specify the number of clusters in advance and can discover the hierarchical relationship of the clusters, but the calculation complexity is too high.
Disclosure of Invention
The invention aims to overcome the defects and provide a parallel hybrid clustering method for candidate signal mining in pulsar search, which improves the clustering performance, improves the screening recall rate and reduces the execution time.
The purpose of the invention and the main technical problem of solving the invention are realized by adopting the following technical scheme:
the invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps of:
(1) clustering analysis of pulsar candidate signals:
calculating the density of data points by adopting a K-nearest Polynomial kernel (Polynomial) function, screening out samples with the density value smaller than a threshold value of 0.01, further judging whether the samples are noise or a new astronomical phenomenon through a candidate body diagnostic map, and eliminating outlier interference with too small density;
combining the clustering process characteristics of density peak values and levels, dividing the multi-density cluster levels in the data set, combining micro-cluster groups with similar densities and adjacent distances in the same region, and determining an initial clustering center point;
distributing all data points and optimizing cluster centers by using k-means iteration based on Gaussian Radial Basis Function (RBF) distance, and calculating similarity between sample data points by using an (RBF) kernel function to realize conversion of measure distance to a high-dimensional space;
(2) grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value Batchsize 1160, and setting the size of the sliding window w to be 2; selecting more complete various pulsar candidate body characteristic data 1600 from real samples as a group of samples, respectively adding the samples into the data to be detected corresponding to each sliding window to form 1 data block, and dividing a data set into a plurality of parallel data blocks with the same size;
(3) and parallelizing the data blocks based on the MapReduce/Spark calculation model to realize the clustering.
The above parallel hybrid clustering method for candidate signal mining in pulsar search, wherein the cluster analysis method in step (1) is:
firstly, data preprocessing is carried out, and Pulsar Candidate data in a Pulsar searching process based on PRESTO (Pulsar expansion and Search toolkit) software are subjected to feature Selection and dimension reduction through a feature extraction method (From simple filters to a new normalized real-time classification approach, simple notes of the Royal analytical Society,2016) and a principal component analysis method (PCA), so that a new feature space input data set with a feature vector of b is obtained; the selectable candidate physical characteristic values comprise pulse radiation (unimodal, bimodal and multimodal), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, incoherent power sum and coherent power;
② calculating the Mahalanobis distance between the data points i and j according to the formula (1)
Figure BDA0003468709350000031
Wherein S is a covariance matrix of the multi-dimensional random variables; then, calculating the local Polynomial core density of each data point based on K neighbor and the global property of the Polynomial core function according to the formula (2), so that the generalization performance of the Polynomial core function is strong;
Figure BDA0003468709350000032
wherein c is the offset coefficient and d is the order of the polynomial; to eliminate the influence of the size of the data variation and the size of the value, pair dijAnd ρiAll adopt deviation standardization treatment as follows;
Figure BDA0003468709350000033
Figure BDA0003468709350000034
wherein, mindAnd minρEach represents dijAnd ρiMinimum value of, maxdAnd maxρEach represents dijAnd ρiMaximum value of (d);
removing outliers according to the formula (5), and calculating the distance delta between non-outliers according to the formula (6)iThe outlier is removed, so that the cluster center point is selected; in addition, the number of data points with too low density is small, and the distribution is marginalized; due to their scarcity and low density, they are abnormal in data distribution, and the abnormal phenomena may be pure noise or astronomical new phenomena (such as special pulsar); this portion of data will subsequently be further determined by the corresponding candidate body diagnostic map;
inlier={ρi>ρthrehold},ρthrehold=0.01 (5)
Figure BDA0003468709350000041
generating 1 two-dimensional decision graph by all data points with the distance delta larger than the threshold value lambda; wherein the horizontal axis is represented by density ρ and the vertical axis is represented by distance δ; merging density level micro-cluster groups on a two-dimensional decision diagram, wherein the method comprises the following steps: if the rho axis or delta axis divided region contains two or more regions without data points, the gap region is called as a vacant region; the empty area divides all data points into two density areas, the rightmost density area is called as the maximum density area, and the rest is a low density area;
(A) in the low-density area, because the discrimination is not high, the micro-clusters corresponding to the area are all combined into a cluster class;
(B) in the maximum density area, if all the representative points are in the same delta area, all the representative points are selected as independent cluster centers; if the representative points are not in the same delta area, the distance distinction degrees between the representative points are not high, and the representative points possibly belong to the same cluster class, so that corresponding micro-clusters are required to be combined into a large cluster;
determining cluster number k and corresponding cluster Ci(1. ltoreq. i. ltoreq.k) centeri
Sixthly, each data point x is divided according to the principle of proximityjAssign to nearest centeriIn the cluster, RBF nuclear distance is adopted as a similarity measure mode, and the formula (7) is shown; the RBF kernel function has local characteristics and strong learning capability, and the conversion of the measure distance to a high-dimensional space can be realized through the RBF kernel distance;
Figure BDA0003468709350000042
wherein η represents the kernel function width; calculating a new cluster C according to equation (8)i' the mean of all data points within as the new centeri',niDenotes to belong to Ci' total number of data points;
Figure BDA0003468709350000043
calculating the error square sum SSE of all objects in the data set:
Figure BDA0003468709350000044
stopping the algorithm until the SSE value is not changed any more, otherwise returning to the step (c);
the parallel hybrid clustering method for candidate signal mining in pulsar search is described above, wherein the method for grouping data sets based on the sliding window grouping strategy in step (2) is as follows: accurately screening candidate bodies to the maximum extent according to a data structure, and dividing data by adopting a sliding window concept; firstly, defining a window size (Batchsize 1160), and equally dividing a data set to be detected into L blocks (the data size of the last block is not enough, and the data of the 1 st block can be selected for filling); setting the size w of the sliding window to be 2, starting from the 1 st and 2 nd blocks in the 1 st round, and advancing each sliding window by 1 bit to point to the corresponding data block; the last round points to the combination of the last 1 block and the 1 st block, and L-round division is required to be executed in total; selecting a group of relatively complete various pulsar candidate characteristic data 1600 from real samples as samples, and adding the data corresponding to a sliding window into each round to form a data block to be detected, so that the data set is divided into L parallel data blocks to be detected; currently, there is a basic assumption for clustering that instances in the same cluster have a greater likelihood of having the same label; therefore, a decision boundary is set according to the dense or sparse region of various data distributions, so that a pulsar data distribution region is determined, and the region division of pulsar signals and non-pulsar interference signals is carried out; calculating the distribution density of pulsar samples in each cluster to count the similarity, and selecting the cluster with pulsar sample occupancy rate more than 50% to enter a pulsar candidate body list; the list of noise points excluded from the third step in the cluster analysis method may possibly lead to the discovery of new phenomena.
The parallel hybrid clustering method for candidate signal mining in pulsar search is characterized in that the method for realizing the clustering based on the data block parallelization of the MapReduce/Spark calculation model in the step (3) comprises the following steps: aiming at large-scale pulsar data processing, according to Sun-Ni theorem, the research on the parallelization realization of the clustering algorithm in a MapReduce calculation model is very necessary; on one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparison can be reduced; a function G (p) is introduced into the Sun-Ni theorem to represent the increment of the workload when the storage capacity is limited; the law proposes that the memory space can be effectively utilized by scaling the problem under the premise of meeting the time limit specified by the fixed time acceleration ratio and having enough memory space; dividing data into L data blocks (Block (1),.. multidot., Block (L)) by the sliding window-based method and then executing the data blocks in parallel; next, the Map1 and Reduce1 functions are used for calculating the density of data points in each block (i) (i is more than or equal to 1 and less than or equal to L) and selecting an initial clustering center point (clusters) (it needs to be noted that key, value > in the Map stage is input, key is a row number, value is a list formed by values of each dimension of the current sample, and output in the Reduce stage is key.id, namely the initial clustering center); finally, the Map2 and Reduce2 functions iterate to calculate the distance from each data point in block (i) to the cluster center (clusters (i)) and re-mark the cluster class to which the data point belongs, wherein the new cluster center is calculated by using the Reduce2 function to prepare for the next round of clustering task; comparing the distance between the center of the cluster of the current round and the center of the corresponding cluster of the previous round, and if the change is smaller than a given threshold value, ending the operation; otherwise, taking the new cluster center as the cluster center of the next round; after clustering is finished, extracting pulsar clusters and abnormal noise points; spark is a general computing engine for large-scale data processing, and the computing process is similar to MapReduce.
Compared with the prior art, the invention has obvious advantages and beneficial effects. According to the technical scheme, firstly, a K-nearest neighbor Polynomial kernel (Polynomial) function is adopted to calculate the density of data points, and outlier interference with too small density is eliminated; secondly, combining the density peak value and the level, and dividing the multi-density cluster level to determine an initial clustering center point; thirdly, performing data point allocation and cluster center optimization by using k-means iteration based on Gaussian Radial Basis Function (RBF) distance; by means of a data partitioning strategy based on a sliding window and a parallelization design based on a MapReduce/Spark model, the running time of the scheme is greatly improved while the clustering effect of the candidate is ensured; compared with other common machine learning classification methods, The results of experiments on a Parkes High Time Resolution universal satellite sky Survey 2 (HTRU 2) data set show that The proposed scheme obtains superior results on Precision (Precision) and Recall (Recall), which are 0.946 and 0.905 respectively; according to Sun-Ni theorem, when the parallel execution nodes are enough and the communication cost is negligible, the total operation time of the algorithm is obviously reduced theoretically; due to the similarity clustering characteristic of clustering, the method can cluster more reference classifications to promote the discovery of new phenomena (such as special pulsar signals) while improving the candidate screening efficiency.
Drawings
FIG. 1a is a density hierarchical clustering two-dimensional decision rho partition diagram;
FIG. 1b is a density hierarchical clustering two-dimensional decision delta partition diagram;
FIG. 2 sliding window based data distribution;
FIG. 3 is a MapReduce flow chart;
fig. 4 average run time comparison.
Detailed Description
Example (b):
referring to fig. 1 to 3, a parallel hybrid clustering method for candidate signal mining in pulsar search according to the present invention comprises the following steps:
1. hybrid cluster analysis
(1) And (2) carrying out data preprocessing, and carrying out feature Selection and dimension reduction on the Pulsar Candidate data in the Pulsar Search flow based on PRESTO (Pulsar amplification and Search toolkit) software by a feature extraction method (From simple filters to a new normalized real-time classification approach, simple notes of the Royal analytical Society,2016) and a principal component analysis method (PCA), so as to obtain a new feature space input data set with a feature vector of b. Alternative candidate physical characteristic values include pulsed radiation (singlet, doublet and multiplet), period, dispersion, signal-to-noise ratio, noise signal, signal ramp, incoherent power sum, coherent power, and the like.
(2) Calculate the Mahalanobis distance between data points i and j according to equation (1) as
Figure BDA0003468709350000061
Where S is the covariance matrix of the multidimensional random variable. And then, calculating the local Polynomial core density of each data point based on K neighbors according to the formula (2), wherein the Polynomial core function has global characteristics, so that the generalization performance of the Polynomial core function is strong.
Figure BDA0003468709350000071
Where c is the offset coefficient and d is the order of the polynomial. To eliminate the influence of the size of the data variation and the size of the value, pair dijAnd ρiDispersion normalization was used as follows.
Figure BDA0003468709350000072
Figure BDA0003468709350000073
Wherein, mindAnd minρEach represents dijAnd ρiMinimum value of, maxdAnd maxρEach represents dijAnd ρiIs measured.
(3) Outliers are removed according to the formula (5), and the distance delta between non-outliers is calculated according to the formula (6)iOutliers are culled to facilitate the selection of cluster center points. In addition, the number of data points with too low a density is small and the distribution is marginalized. Due to their scarcity and low density, anomalies in the data distribution, which may be pure noise or astronomical artifacts (e.g., special pulsar), are present. This portion of the data will subsequently be further determined by the corresponding candidate diagnostic map.
inlier={ρi>ρthrehold},ρthrehold=0.01 (5)
Figure BDA0003468709350000074
(4) All data points for which the distance δ is greater than the threshold λ can generate a two-dimensional decision map. For example, a two-dimensional decision graph of 1 set of randomly generated data is shown in fig. 1, where the horizontal axis represents density ρ and the vertical axis represents distance δ.
It is assumed that the ρ -axis and the δ -axis of the two-dimensional decision graph example are divided by intervals of size θ and γ, respectively. FIG. 1 randomly generates ρ split and δ split of a data set ρ split; right, δ division, θ -2 γ -0.2;
if two or more regions without data points exist in the rho-axis or delta-axis divided region, the void region is called as a vacant region. In fig. 1(a) and 1(b), the empty region divides all data points into two density regions, the rightmost density region being referred to as the maximum density region, and the remainder being low density regions.
1) In the low-density area, because the discrimination is not high, the micro-clusters corresponding to the area are all combined into a cluster class;
2) in the maximum density area, if all the representative points are in the same delta area, all the representative points are selected as independent cluster centers; if not in the same delta region, the representative inter-dot distance divisions are not high, and may belong to the same cluster class, so that corresponding micro-clusters need to be merged into one large cluster.
(5) Determining cluster number k and corresponding cluster Ci(1. ltoreq. i. ltoreq.k) centeri
(6) Each data point x is divided according to the principle of proximityjAssign to nearest centeriIn the cluster, the RBF nuclear distance is adopted as the similarity measure mode, as shown in formula (7). The RBF kernel function has local characteristics and strong learning capability, and the conversion of the measure distance to a high-dimensional space can be realized through the RBF kernel distance.
Figure BDA0003468709350000081
Wherein eta isRepresenting the kernel function width. Calculating a new cluster C according to equation (8)i' the mean of all data points within as the new centeri',niDenotes to belong to CiTotal number of data points of.
Figure BDA0003468709350000082
(7) The sum of the squared errors SSE for all objects of the data set is calculated:
Figure BDA0003468709350000083
until the SSE value no longer changes, the algorithm stops, otherwise step (6) is returned to.
2. Sliding window based data set partitioning strategy
In order to define a more comprehensive pulsar identification range, the candidate bodies are accurately screened to the maximum extent according to the data structure, and the data division is carried out by adopting a sliding window concept. As shown in fig. 2, the window size is defined (
The Batchsize is 1160), a complete set of various pulsar candidate characteristic data 1600 is selected from real samples as samples, and each round of data is added into the data corresponding to the sliding window (the size w is 2) to form a data block to be detected. Currently, clustering has a basic assumption that instances in the same cluster have a greater likelihood of having the same label. Therefore, a decision boundary is set according to the dense or sparse region of various data distributions, so that a pulsar data distribution region is determined, and the region division of pulsar signals and non-pulsar interference signals is carried out. Calculating the distribution density of pulsar samples in each cluster to count the similarity, and selecting the cluster with pulsar sample occupancy rate more than 50% to enter a pulsar candidate body list; the list of noise points excluded by the hybrid cluster analysis step (3) may lead to the discovery of new astronomical phenomena.
3. Parallelization design based on MapReduce/Spark model
Aiming at large-scale pulsar data processing, according to Sun-Ni theorem, the research on the parallelization realization of the clustering algorithm in a MapReduce calculation model is very necessary. On one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparisons can be reduced. As shown in fig. 3, the data is first divided into L data blocks (Block (1),.. multidot.,. Block (L)) by the sliding window based method and then executed in parallel. Next, the Map1 and Reduce1 functions complete density calculation of data points in each block (i) (i is more than or equal to 1 and less than or equal to L) and selection of initial clustering center points (it should be noted that < key, value > input in the Map stage is a line number, value is a list consisting of values of each dimension of the current sample, and output in the Reduce stage is key. Finally, the Map2 and Reduce2 functions iterate through the calculation of the distance of each data point within block (i) to the cluster center(s) and re-label the cluster class to which it belongs, wherein the new cluster center is calculated using the Reduce2 function in preparation for the next round of clustering task. Comparing the distance between the center of the cluster of the current round and the center of the corresponding cluster of the previous round, and if the change is smaller than a given threshold value, ending the operation; and otherwise, taking the new cluster center as the cluster center of the next round. And after the clustering is finished, extracting pulsar clusters and abnormal noise points.
Experimental example:
the hardware environment is as follows: a Linux cluster environment with 4 physical compute nodes, including 2 Intel Core i7-9700K @3.6 GHzCPUs, 1 Intel Core i7-1065G7@1.5GHz CPU and 1 Intel Core i5-9300H @2.4GHz CPU with 32 CPU cores (total RAM of 68G, total disk space of 3T); the software environment is as follows: the Anaconda3-4.2.0, Hadoop-2.7.6 and Spark-2.3.1-bin-Hadoop2.6 frameworks under the centros 7 system.
1. Data partitioning
An open data set HTRU2 is adopted, and the data set is obtained by processing a feature extraction method (From simple filters to a new random real-time classification approach, simple notes of the Royal analytical Society, 2016). Setting the sliding window size Batchsize 1160, 1600 of the known pulsar samples are randomly selected as a set s of pulsar samples, and the remaining 39 are randomly mixed into the non-pulsar samplesA data sample forms a data set to be detected. According to the data partitioning strategy of section 4.1, the data set to be detected is equally divided into (t) according to Batchsize1,t2,..t14) From this experimental data are divided into { Block (1): s, t1,t2],Block(2):[s,t2,t3],...,
Block(13):[s,t13,t14],Block(14):[s,t14,t1]Total 14 data blocks. And (i) clustering, and when clustering is finished, selecting a cluster with pulsar sample occupancy rate not less than 50% to enter a pulsar candidate list.
2. Evaluation index
The candidate classification usually uses 4 indexes of Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1-score (F1-score) to evaluate the algorithm.
Accuracy can roughly reflect the overall judgment correctness, but cannot objectively reflect the classification performance when the data are unbalanced. Precision is used for judging the ratio of the number of true positive samples in the number of positive samples, and Recall is the ratio of the number of correct positive samples to the number of all positive samples. Since Precision and Recall of the clusters tend to contradict each other, F1-Score can be selected to measure the two indexes together. Table 2 shows the categorized confusion matrix.
TABLE 1 MIXING-PROGRAM MATRIX
Figure BDA0003468709350000101
The evaluation indexes of the experiment are total Precision, Recall and F1-Score, and are set as follows:
Figure BDA0003468709350000102
Figure BDA0003468709350000103
Figure BDA0003468709350000104
Figure BDA0003468709350000105
wherein L represents the number of divided data blocks, UTP ═ TP1∪TP2∪TP3…TPLRepresenting a union, Recall, of pulsar identified within each small data blockOIndicating Recall, of a single data blocktotalRepresenting the overall recall of all data chunks.
3. Parameter setting
The parameters involved in the experiment include K neighbor parameters for calculating the density of the data points, and the threshold value rho of the densitythreholdPolynomial kernel parameters c and d, RBF kernel parameter η, threshold λ for screening small clusters, θ value for density region division and γ value for distance region division. The specific settings are as follows.
TABLE 2 parameter settings
Figure BDA0003468709350000106
4. Clustering result analysis
Table 3 shows a comparison of the performance of different supervised and unsupervised learning algorithms on the HTRU2 data set. Among the unsupervised algorithms, the parallel hybrid clustering algorithm has the highest Recall value, i.e., 90.5%. Compared with the supervised learning algorithm, the Recall value of the algorithm is only lower than that of the supervised learning algorithm
GMO SNNNNNNNNN (Pulsar candidate selection based on self-normalized neural networks, Physics, 2020), F1-Score is lower than GMO SNNNNNNNNN, Random Forest (Fifty year of Pulsar candidate selection: from simple filters to a new rectangular real-time classification approach, simple notes of the Royal analytical Society, 2020), and KNN algorithm, but higher than SVM and PNCN (Pulsar candidate selection using pseudo-near center qualitative Society, simple notes of the Royal analytical Society, 2016). In addition, through a control experiment of forming a data set to be detected by randomly selecting 39 pulsar in multiple rounds, the maximum pulsar number detected by the algorithm reaches 36 once, and the average value is 34. Due to the advantages of unsupervised learning and rapid convergence of hybrid clustering, the method is suitable for scenes of large-scale pulsar data rapid classification mining. Experimental results show that the proposed scheme based on mixed clustering has feasibility and effectiveness. Under the actual pulsar search scene, the clustering effect of the pulsar is further improved along with the optimization of the related parameters, the pulsar sample set and the data partitioning strategy.
TABLE 3 Effect of different methods on HTRU2 data set
Figure BDA0003468709350000111
5. Temporal complexity analysis
Let n be the number of samples in the experimental data set, and the time complexity of the other algorithms (kmeans + +, mcdpcc, PNCN) is shown in table 4. Wherein, the time complexity of kmeans + + is o (nktm), and k, T, M are generally regarded as constants, i.e. simplified to o (n); for McDpc, the computation of ρ and δ time complexities is O (n)2) The clustering time complexity based on different density levels is also O (n)2) So the time complexity of the whole algorithm is O (n)2) (ii) a PNCN time complexity is taken from its worst case calculated quantity O (2nMK + FMK)2/2), F and M are set to constant values. The serial time complexity of the hybrid clustering algorithm is O (n)2+ nkTM), the complexity is reduced to O (n) since k, T, M are constants2). Under the parallel computing platform, according to Sun-Ni theorem, the complexity becomes O ((G (P) m)2) Wherein G (P) is a factor, m is the number of samples of Block (i) and m □ n; when the number of parallel nodes P is sufficient (the P value approaches a certain threshold value of the partitioned data block L) and the communication overhead is negligible, g (P) → 1, i.e., the complexity approaches O (m)2) Is slightly inferior to k-means + + and PNCN but superior to McDpc. This shows that the proposed scheme has a significant reduction in running time while improving the clustering effect.
TABLE 4 Algorithm complexity
Figure BDA0003468709350000121
aT is iteration number, M is element feature number, F is class number, M is Block (i) sample number, and k is cluster center number.
Compared with experimental analysis and time complexity analysis, the provided scheme is proved to have feasibility and effectiveness, and various performance indexes are improved greatly along with the optimization of data grouping and related parameters in an actual scene. The unsupervised clustering method is more suitable for the classification of a large number of unlabeled data sets and the situation that the sample data proportion of pulsar and non-pulsar is extremely unbalanced.
6. Actual run time
FIG. 4 comparison of the average run times shows the proposed method (parallel and serial) compared to the average run times of McDPC, k-means + + and KNN on the same experimental set-up. As can be seen from the figure, the average run time of the serial hybrid cluster is the longest, but the parallel hybrid cluster (23.07s) is very short compared to other systems. Therefore, we believe that the proposed parallel scheme significantly reduces the execution time while guaranteeing the classification performance.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are within the scope of the present invention without departing from the technical spirit of the present invention.

Claims (4)

1. A parallel mixed clustering method for candidate signal mining in pulsar search comprises the following steps:
(1) clustering analysis of pulsar candidate signals:
calculating the density of data points by adopting a K-nearest neighbor polynomial kernel function, screening out samples with density values smaller than a threshold value of 0.01, further judging whether the samples are noise or new astronomical phenomena through a candidate body diagnostic map, and eliminating outlier interference with too small density;
combining the clustering process characteristics of density peak values and levels, dividing the multi-density cluster levels in the data set, combining micro-cluster groups with similar densities and adjacent distances in the same region, and determining an initial clustering center point;
distributing all data points and optimizing cluster centers by using k-means iteration based on Gaussian radial basis kernel distance, and calculating similarity between sample data points by using a kernel function to realize conversion of measure distance to a high-dimensional space;
(2) grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value Batchsize 1160, and setting the size of the sliding window w to be 2; selecting more complete various pulsar candidate body characteristic data 1600 from real samples as a group of samples, respectively adding the samples into the data to be detected corresponding to each sliding window to form 1 data block, and dividing a data set into a plurality of parallel data blocks with the same size;
(3) and parallelizing the data blocks based on the MapReduce/Spark calculation model to realize the clustering.
2. A parallel hybrid clustering method for candidate signal mining in pulsar searching as claimed in claim 1, wherein said cluster analysis method in step (1) is:
firstly, preprocessing data, and performing feature selection and dimension reduction on pulsar candidate volume data in a pulsar search process based on PRESTO software through a feature extraction method and a Principal Component Analysis (PCA) method to obtain a new feature space input data set with a feature vector b; the selectable candidate physical characteristic values comprise pulse radiation (unimodal, bimodal and multimodal), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, incoherent power sum and coherent power;
② calculating the Mahalanobis distance between the data points i and j according to the formula (1)
Figure FDA0003468709340000011
Wherein S is a covariance matrix of the multi-dimensional random variables; then, calculating the local Polynomial core density of each data point based on K neighbor and the global property of the Polynomial core function according to the formula (2), so that the generalization performance of the Polynomial core function is strong;
Figure FDA0003468709340000012
wherein c is the offset coefficient and d is the order of the polynomial; to eliminate the influence of the size of the data variation and the size of the value, pair dijAnd ρiAll adopt deviation standardization treatment as follows;
Figure FDA0003468709340000021
Figure FDA0003468709340000022
wherein, mindAnd minρEach represents dijAnd ρiMinimum value of, maxdAnd maxρEach represents dijAnd ρiMaximum value of (d);
removing outliers according to the formula (5), and calculating the distance delta between non-outliers according to the formula (6)iThe outlier is removed, so that the cluster center point is selected; in addition, the number of data points with too low density is small, and the distribution is marginalized; due to their scarcity and low density, they are abnormal in data distribution, and the abnormal phenomena may be pure noise or astronomical new phenomena (such as special pulsar); this portion of data will subsequently be further determined by the corresponding candidate body diagnostic map;
inlier={ρi>ρthrehold},ρthrehold=0.01 (5)
Figure FDA0003468709340000023
generating 1 two-dimensional decision graph by all data points with the distance delta larger than the threshold value lambda; wherein the horizontal axis is represented by density ρ and the vertical axis is represented by distance δ; merging density level micro-cluster groups on a two-dimensional decision diagram, wherein the method comprises the following steps: if the rho axis or delta axis divided region contains two or more regions without data points, the gap region is called as a vacant region; the empty area divides all data points into two density areas, the rightmost density area is called as the maximum density area, and the rest is a low density area;
(A) in the low-density area, because the discrimination is not high, the micro-clusters corresponding to the area are all combined into a cluster class;
(B) in the maximum density area, if all the representative points are in the same delta area, all the representative points are selected as independent cluster centers; if the representative points are not in the same delta area, the distance distinction degrees between the representative points are not high, and the representative points possibly belong to the same cluster class, so that corresponding micro-clusters are required to be combined into a large cluster;
determining cluster number k and corresponding cluster Ci(1. ltoreq. i. ltoreq.k) centeri
Sixthly, each data point x is divided according to the principle of proximityjAssign to nearest centeriIn the cluster, RBF nuclear distance is adopted as a similarity measure mode, and the formula (7) is shown; the RBF kernel function has local characteristics and strong learning capability, and the conversion of the measure distance to a high-dimensional space can be realized through the RBF kernel distance;
Figure FDA0003468709340000024
wherein η represents the kernel function width; calculating a new cluster C according to equation (8)i' the mean of all data points within as the new centeri',niDenotes to belong to Ci' total number of data points;
Figure FDA0003468709340000031
calculating the error square sum SSE of all objects in the data set:
Figure FDA0003468709340000032
and c, stopping the algorithm until the SSE value is not changed any more, and returning to the step c.
3. The parallel hybrid clustering method for candidate signal mining in pulsar searching as claimed in claim 2, wherein the grouping method of the data sets based on the sliding window grouping strategy in step (2) is: accurately screening candidate bodies to the maximum extent according to a data structure, and dividing data by adopting a sliding window concept; firstly, defining a window size (Batchsize 1160), and equally dividing a data set to be detected into L blocks (the data size of the last block is not enough, and the data of the 1 st block can be selected for filling); setting the size w of the sliding window to be 2, starting from the 1 st and 2 nd blocks in the 1 st round, and advancing each sliding window by 1 bit to point to the corresponding data block; the last round points to the combination of the last 1 block and the 1 st block, and L-round division is required to be executed in total; selecting a group of relatively complete various pulsar candidate characteristic data 1600 from real samples as samples, and adding the data corresponding to a sliding window into each round to form a data block to be detected, so that the data set is divided into L parallel data blocks to be detected; currently, there is a basic assumption for clustering that instances in the same cluster have a greater likelihood of having the same label; therefore, a decision boundary is set according to the dense or sparse region of various data distributions, so that a pulsar data distribution region is determined, and the region division of pulsar signals and non-pulsar interference signals is carried out; calculating the distribution density of pulsar samples in each cluster to count the similarity, and selecting the cluster with pulsar sample occupancy rate more than 50% to enter a pulsar candidate body list; the list of noise points excluded from the third step in the cluster analysis method may possibly lead to the discovery of new phenomena.
4. The parallel hybrid clustering method for candidate signal mining in pulsar search according to claim 1 or 2, wherein the method for parallelizing the data blocks based on the MapReduce/Spark calculation model in step (3) to realize the clustering comprises: aiming at large-scale pulsar data processing, according to Sun-Ni theorem, the research on the parallelization realization of the clustering algorithm in a MapReduce calculation model is very necessary; on one hand, the accuracy of the clustering result can be improved; on the other hand, the number of data comparison can be reduced; a function G (p) is introduced into the Sun-Ni theorem to represent the increment of the workload when the storage capacity is limited; the law proposes that the memory space can be effectively utilized by scaling the problem under the premise of meeting the time limit specified by the fixed time acceleration ratio and having enough memory space; dividing data into L data blocks (Block (1),.. multidot., Block (L)) by the sliding window-based method and then executing the data blocks in parallel; next, the Map1 and Reduce1 functions are used for calculating the density of data points in each block (i) (i is more than or equal to 1 and less than or equal to L) and selecting an initial clustering center point (clusters) (it needs to be noted that key, value > in the Map stage is input, key is a row number, value is a list formed by values of each dimension of the current sample, and output in the Reduce stage is key.id, namely the initial clustering center); finally, the Map2 and Reduce2 functions iterate to calculate the distance from each data point in block (i) to the cluster center (clusters (i)) and re-mark the cluster class to which the data point belongs, wherein the new cluster center is calculated by using the Reduce2 function to prepare for the next round of clustering task; comparing the distance between the center of the cluster of the current round and the center of the corresponding cluster of the previous round, and if the change is smaller than a given threshold value, ending the operation; otherwise, taking the new cluster center as the cluster center of the next round; after clustering is finished, extracting pulsar clusters and abnormal noise points; spark is a general computing engine for large-scale data processing, and the computing process is similar to MapReduce.
CN202210036692.XA 2022-01-13 2022-01-13 Parallel hybrid clustering method for candidate signal mining in pulsar search Active CN114386466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210036692.XA CN114386466B (en) 2022-01-13 2022-01-13 Parallel hybrid clustering method for candidate signal mining in pulsar search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210036692.XA CN114386466B (en) 2022-01-13 2022-01-13 Parallel hybrid clustering method for candidate signal mining in pulsar search

Publications (2)

Publication Number Publication Date
CN114386466A true CN114386466A (en) 2022-04-22
CN114386466B CN114386466B (en) 2024-04-05

Family

ID=81201874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210036692.XA Active CN114386466B (en) 2022-01-13 2022-01-13 Parallel hybrid clustering method for candidate signal mining in pulsar search

Country Status (1)

Country Link
CN (1) CN114386466B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116360956A (en) * 2023-06-02 2023-06-30 济南大陆机电股份有限公司 Data intelligent processing method and system for big data task scheduling
CN116611025A (en) * 2023-05-19 2023-08-18 贵州师范大学 Multi-mode feature fusion method for pulsar candidate signals

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170356976A1 (en) * 2016-06-10 2017-12-14 Board Of Trustees Of Michigan State University System and method for quantifying cell numbers in magnetic resonance imaging (mri)
CN113344019A (en) * 2021-01-20 2021-09-03 昆明理工大学 K-means algorithm for improving decision value selection initial clustering center

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170356976A1 (en) * 2016-06-10 2017-12-14 Board Of Trustees Of Michigan State University System and method for quantifying cell numbers in magnetic resonance imaging (mri)
CN113344019A (en) * 2021-01-20 2021-09-03 昆明理工大学 K-means algorithm for improving decision value selection initial clustering center

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王元超;郑建华;潘之辰;李明涛;: "脉冲星候选样本分类方法综述", 深空探测学报, no. 03, 15 June 2018 (2018-06-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611025A (en) * 2023-05-19 2023-08-18 贵州师范大学 Multi-mode feature fusion method for pulsar candidate signals
CN116611025B (en) * 2023-05-19 2024-01-26 贵州师范大学 Multi-mode feature fusion method for pulsar candidate signals
CN116360956A (en) * 2023-06-02 2023-06-30 济南大陆机电股份有限公司 Data intelligent processing method and system for big data task scheduling
CN116360956B (en) * 2023-06-02 2023-08-08 济南大陆机电股份有限公司 Data intelligent processing method and system for big data task scheduling

Also Published As

Publication number Publication date
CN114386466B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
Jimenez et al. Classification of hyperdimensional data based on feature and decision fusion approaches using projection pursuit, majority voting, and neural networks
US10902285B2 (en) Learning method and apparatus for pattern recognition
CN110990461A (en) Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN114386466B (en) Parallel hybrid clustering method for candidate signal mining in pulsar search
CN111178611B (en) Method for predicting daily electric quantity
CN110619231B (en) Differential discernability k prototype clustering method based on MapReduce
CN106845536B (en) Parallel clustering method based on image scaling
CN107301328B (en) Cancer subtype accurate discovery and evolution analysis method based on data flow clustering
CN103473786A (en) Gray level image segmentation method based on multi-objective fuzzy clustering
CN104346481A (en) Community detection method based on dynamic synchronous model
CN106570104B (en) Multi-partition clustering preprocessing method for stream data
CN112735536A (en) Single cell integrated clustering method based on subspace randomization
CN105809113A (en) Three-dimensional human face identification method and data processing apparatus using the same
CN113052225A (en) Alarm convergence method and device based on clustering algorithm and time sequence association rule
CN113326862A (en) Audit big data fusion clustering and risk data detection method, medium and equipment
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment
CN110659686A (en) Fuzzy coarse grain outlier detection method for mixed attribute data
CN113436223A (en) Point cloud data segmentation method and device, computer equipment and storage medium
CN112819208A (en) Spatial similarity geological disaster prediction method based on feature subset coupling model
Elhoussaine A fuzzy neighborhood rough set method for anomaly detection in large scale data
CN116484244A (en) Automatic driving accident occurrence mechanism analysis method based on clustering model
Zhou et al. Pre-clustering active learning method for automatic classification of building structures in urban areas
CN114792397A (en) SAR image urban road extraction method, system and storage medium
CN113205124A (en) Clustering method, system and storage medium under high-dimensional real scene based on density peak value
CN114185956A (en) Data mining method based on canty and k-means algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant