CN113095442A

CN113095442A - Hail identification method based on semi-supervised learning under multi-dimensional radar data

Info

Publication number: CN113095442A
Application number: CN202110624140.6A
Authority: CN
Inventors: 文立玉; 罗飞; 钟宇; 舒红平; 曹亮; 刘魁; 郭本俊
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-07-09
Anticipated expiration: 2041-06-04
Also published as: CN113095442B

Abstract

The invention provides a hail identification method based on semi-supervised learning under multi-dimensional radar data, which comprises the following steps: s1: acquiring a labeled sample set, randomly extracting a supervision sample set, a rainstorm sample training set and a hail sample training set, acquiring an unlabeled data set, and randomly and equally dividing the unlabeled data set into q first samples; s2: calculating the clustering center of each cluster of training set; s3: dividing a first sample cluster into corresponding clusters, and updating a cluster center; s4: iteration is carried out, and the cluster centers of all clusters and the confidence degrees of the corresponding clusters at the moment are obtained; s5: repeating the steps S2-S4 on the supervised sample set to obtain the supervised confidence of the supervised sample set on each cluster center, and classifying the supervised sample set into corresponding clusters; s6: judging whether the first sample is updated to the cluster, and repeating the steps S2-S6 until the first sample is processed; s7: and inputting the optimal clustering center as a recognition model to obtain the confidence of each sample to each cluster for classification. The method effectively improves the accuracy of hail recognition and reduces the false alarm rate.

Description

Hail identification method based on semi-supervised learning under multi-dimensional radar data

Technical Field

The invention belongs to the technical field of computer artificial intelligence and meteorological intersection, and particularly relates to a hail identification method based on semi-supervised learning under multi-dimensional radar data.

Background

Hail is strong local disastrous weather generated in special geography, terrain environment and certain large-scale circulation background; the solar energy mobile phone has the characteristics of sudden occurrence, rapid movement, severe weather and strong destructive power, often causes huge losses in various aspects such as agriculture, traffic, electric power, communication and the like to places, and even threatens the life safety of people.

The method has accurate recognition on hail, and is particularly important in hail prediction and rescue after hail disaster. The accuracy of hail prediction is effectively improved, relevant departments are timely and accurately informed to take powerful preventive measures, and great life and property loss caused by hail can be avoided as far as possible. At present, in the aspect of identification by using Doppler weather radar products, a hail identification technology mainly utilizes a (Support Vector Machine, SVM) Support Vector Machine and a K-means clustering algorithm to carry out identification; specifically, the method comprises the following steps:

1. the support vector machine is an earlier two-classification method for classifying data according to a supervised learning mode, and essentially searches for a separating hyperplane with the largest geometric interval on a feature space and completes the two-classification problem of multi-dimensional data according to the separating hyperplane. However, when the support vector machine is used to solve the linear inseparable problem, if the test data falls between the support vectors, there is a possibility of misclassification. The scholars propose that the mode of combining the support vector machine with the k-nearest neighbor method is used for reducing error classification and improving the identification accuracy;

however, the method is sensitive to parameter and kernel function selection, the performance of the support vector machine mainly depends on kernel function selection, and the kernel function selection is still manually set according to experience at present, and certain errors exist. The recognition accuracy of the model is not high, and the classification effect needs to be improved.

2. The normal Bayes classifier assumes that all the feature vectors participating in the operation are not connected, counts the prior probability that any component of the n-dimensional feature vector is hail by using the input sample type, and classifies unknown samples according to the probability. The principle is that machine learning is carried out on a large number of input sample data of different types, and the internal characteristic rule among the samples of the types is searched. Searching unknown samples which are more consistent with the rules according to the rules, and realizing classification;

the method relies on extensive training data. However, hail sample data is difficult to obtain and small in data size; the training of large-scale samples is difficult to implement, so that the classification model cannot be expected, and the judgment has limitation.

3. The principle of the K-means clustering algorithm is that data partitioning is completed according to the distance from each data object to each cluster clustering center by calculating the clustering center of K-cluster sample data; if the clustering centers of the two adjacent times are not changed, the data clustering iteration is finished; and if the distance of the sample data is changed, re-calculating the distance of the sample data by using the updated clustering center of the k cluster samples. After iteration is carried out until division, the clustering center is not changed, and a clustering result is obtained;

the method ignores the difference between the n-dimensional features participating in the operation, and the recognition effect needs to be improved after classification.

Disclosure of Invention

In view of this, the present invention provides a hail identification method based on semi-supervised learning under multi-dimensional radar data, which can effectively improve the accuracy of hail identification and reduce the false alarm rate.

In order to achieve the purpose, the technical scheme of the invention is as follows: a hail identification method under multi-dimensional radar data based on semi-supervised learning comprises the following steps:

s1: acquiring a marked sample set, extracting a supervision sample set from the marked sample set by a random sampling method, dividing the marked sample set into a rainstorm sample training set and a hail sample training set according to rainstorm and hail labels, acquiring an unmarked data set, and randomly dividing the unmarked data set into q first samples;

s2: calculating a mean vector of the rainstorm sample training set and the hail sample training set as a clustering center;

s3: calculating the weighted distance of a first sample to each cluster center, clustering and dividing the first sample into a corresponding cluster rainstorm sample training set or hail sample training set, and updating each cluster center by using the method in S2;

s4: repeating the step S3 until the last iteration is the same as the value of the previous iteration, obtaining the cluster centers at the moment, and taking the distance from the first sample point to each cluster center as the confidence of the corresponding cluster;

s5: repeating the steps S2-S4 by taking the supervised sample set as a first sample in the step S3, obtaining the supervised confidence of the supervised sample set to each clustering center, and classifying the data in the supervised sample set into corresponding rainstorm or hail clusters according to the supervised confidence;

s6: the classification evaluation index of the model to the supervision sample set can be obtained through calculation according to the data labels, when the index meets the preset condition, the first sample is updated to a rainstorm or hail cluster, the mean vector of each cluster at the moment is calculated and reserved, the pseudo label of the corresponding cluster of the data is finally made according to the position of the data in the first sample in the cluster, and S2-S6 is repeated until q parts of the first samples are processed; respectively calculating evaluation indexes of the retained mean vector to the last hail and rainstorm training set, and selecting the mean vector of the optimal index as a final model;

s7: and (4) inputting the mean vector of the final model as an identification model, and finishing the confidence coefficient of each sample in the to-be-identified sample to each cluster according to the steps S3-S4 to classify and finish identification.

Further, the annotated sample set and the unlabeled data set include radar base reflectance images and doppler weather radar series data.

Further, the mean vector in step S2 is calculated as:

wherein,

the mean of the jth parameter that is involved in computing the mean vector samples, N is the total number of samples involved in computing the mean vector,

the value of the jth parameter for the ith sample, the number of parameters for the p single samples,

to record the sample number of the current process,

to record the parameter number of the current process,

is a vector of the mean value of the vectors,

representing a matrix transposition.

Further, the weighted distance in step S3 is obtained by:

wherein p represents the number of elements involved in the recognition,

is shown as

The weight value of each element is calculated,

the mean value vector is represented by a mean value vector,

is shown as

The mean vector of the individual elements is,

is shown as

Mean of individual elements.

Further, the classification evaluation index in step S6 includes: hit rate, false alarm rate and critical success index; wherein:

wherein POD is hit rate, FAR is false alarm rate, CSI is critical success index,

the number of times that the tag is hail consistent with the actual tag is identified for the model,

identifying the number of times that the tag is inconsistent with the actual tag for no hail for the model,

identifying the times that the label is inconsistent with the actual label for the model;

further, the preset conditions are as follows:

(ii) an increase in POD index of less than 90% POD and an increase greater than an increase in FAR;

POD index increased by 90% or more of POD and the increase value was greater than the increase value of 2/3 FAR.

Compared with the prior art, the invention has the following advantages:

the invention discloses a hail identification method based on semi-supervised learning under multi-dimensional radar data, which is characterized by utilizing radar basic reflectivity images and Doppler weather radar series data to extract features, clustering hail cloud data based on a Constrained Seed K-means algorithm (Constrained Seed K-means), and innovatively establishing a hail identification model by adopting a semi-supervised clustering and self-learning method. Only a small amount of hail samples and a large amount of unknown data are needed to complete the construction of the training set, and the problem of low hail identification accuracy caused by a small amount of original hail training data is solved; meanwhile, the method carries out weight analysis on multi-dimensional characteristics of hail and rainstorm, and transversely analyzes hail characteristic variables to obtain characteristic variable weights with good performance. And a supervision sample set is used for supervising the training process, so that the change direction of the model is ensured to be in accordance with the expectation, and meanwhile, the aim of automatically optimizing the training model is fulfilled by combining an automatic parameter optimization method. The hail recognition accuracy is effectively improved, and meanwhile the false alarm rate is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.

FIG. 1 is a diagram showing the variation of the classification evaluation index in the present invention;

FIG. 2 is a diagram illustrating the classification result of test data according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.

Example 1

The embodiment discloses a hail identification method based on semi-supervised learning under multi-dimensional radar data, which comprises the following steps:

in this embodiment, the labeled sample set includes radar basic reflectivity images and doppler weather radar series data;

specifically, m sets of labeled samples are read:

random access from a set of labeled samples

A sample named

For supervising the training process. Supervised sample set

：

Dividing into rainstorm sample training sets according to label sample labels

Hail sample training set

：

In the invention, due to the limitation of the initial training set in quantity, the obtained model cannot ensure the accuracy of the initial training set in application. Therefore, a large amount of unlabeled data sets D are used for training the initial model so as to achieve the effect of improving the recognition accuracy and stability; thus, the unlabeled sample set D is read in:

random p-dimensional samples in labeled sample set X and unlabeled sample set D

Expressed as:

in this step, the mean vector is used as a k cluster center (where k =2, including rainstorm clusters and hail clusters); mean vector

The mean vector calculation formula for the p-dimensional random sample is as follows:

wherein,

is the jth parameter value of the ith sample, p is the parameter number of a single sample, i is the number of the sample recorded with the current processing, j is the number of the parameter recorded with the current processing,

is a mean vector, T represents a matrix transpose; x represents a sample monomer; separately calculating hail clusters

And rainstorm cluster

Mean vector of

、

Taking the initial clustering center as the initial clustering center of the k clusters of samples;

s3: respectively calculating the weighted distance of a first sample to each cluster center, clustering and dividing the first sample into a corresponding cluster rainstorm sample training set or hail sample training set, and updating each cluster center by using the method in S2;

and then, clustering division is carried out on the x by calculating the distance between the sample x and each cluster center. In the invention, a multi-element sample clustering method is adopted, and weighting processing is carried out on the elements participating in calculation; weighted distance calculation formula:

in which p denotes the participation in recognitionThe number of the elements is the same as the number of the elements,

represents the weight value of the ith element,

representing a mean vector;

further, divide the random into equal parts

Selecting an unprocessed portion, adding the selected portion into the training process, and calculating the sample by using a weighted distance calculation formula

For each cluster center

And classifying x into corresponding clusters according to the weighted distances

In (1). Updating with the mean vector calculation formula in step S2

Cluster center of

；

S4: repeating the step S3 until the last iteration is the same as the value of the previous iteration, obtaining the cluster centers at the moment, and taking the distance from the first sample point to each cluster center as the confidence coefficient from each point to the corresponding cluster;

using updated in this step

For the sample

Is repeated onThe process is iterated until

The value is the same as the value in the previous iteration; reference to

Calculating the weighted distance of x to each cluster center in a calculation mode; and using the confidence as the confidence of x to k clusters

The clustering process can be represented by the following equation:

；

s5: repeating the steps S2-S4 on the supervised sample set to obtain the supervised confidence of the supervised sample set to each cluster center, and classifying the data in the supervised sample set into corresponding rainstorm or hail clusters according to the supervised confidence;

after completing the process of matching a set of samples

After classification, it is necessary to ensure that the change direction of the updated model is in accordance with the expectation, so the model is supervised using the supervision sample set in step S1:

in particular, the confidence level is referenced

Is used for calculating a supervision sample set

Training set of middle sample to be updated

Confidence of each cluster center in

。

Wherein, according to

Confidence of k clusters

Will be provided with

And classifying into corresponding rainstorm or hail clusters.

S6: the classification evaluation index of the model to the supervision sample set can be obtained through calculation according to the data label, when the index meets the preset condition, the first sample is updated to a rainstorm or hail cluster, a pseudo label of the cluster corresponding to the data is finally made according to the position of the data in the first sample in the cluster, and S2-S6 are repeated until q parts of the first samples are processed;

due to the fact that

To label the samples, model pairs can thus be computed from the data labels

The classification evaluation index hit rate (POD), False Alarm Rate (FAR), and Critical Success Index (CSI) of (a) is calculated as follows:

wherein,

the performance is considered good when the supervision index meets any of the following conditions:

(ii) POD index increases with an increase value greater than the FAR increase value at POD < 90%;

POD index increased by greater than 2/3FAR increase for POD > 90%;

if the model performs well, it will

Update to a cluster

Calculating and keeping the mean vector of hail and rainstorm clusters; clustering according to the final x

Position in (2) make a dummy label for the x corresponding cluster:

；

if the model performance does not meet the expectation, discarding

(ii) a Waiting for the input of the next unlabeled sample set or outputting the recognition model at the moment;

if the identification model needs to be output, taking the mean vector reserved in the training process as model input, and respectively carrying out classification identification on the hail and rainstorm training set obtained at the last time to obtain corresponding CSI values; and taking the mean vector with the highest CSI value as the final model output.

S7: carrying out automatic parameter tuning on the updated recognition model;

this step is carried outThe method is used for optimizing the weight and parameters of the recognition model updated in step S6, specifically, recognizing test data by using the trained model to obtain the CSI value of the model to the test set, and taking the CSI value as the reference value of the bayesian optimization method to perform the above optimization on the CSI value

The parameters (VIL characteristic weight value, H _ R _ Max characteristic weight value, R _ Max characteristic weight value) are automatically optimized. Updating the obtained parameter optimization result into a model;

s8: and (4) inputting the mean vector of the final model as an identification model, and finishing the confidence coefficient of each sample in the to-be-identified sample to each cluster according to the steps S3-S4 to classify and finish identification.

In this step, each cluster of training set is obtained through the above steps

As a recognition model input; at the same time utilize

Calculating to-be-identified sample by using calculation method

Confidence of each sample in each cluster:

and classifying according to the confidence coefficient to finish the identification.

Example 2

Based on the method in example 1, this example proposes a specific implementation to train and test the method in example 1:

in the embodiment, 104 initial hail training data and 103 hail test data are selected, 104 are randomly separated from 207 hail live data to serve as training data, and the other 103 are used as test data; 1098 rainstorm training data and 1098 rainstorm test data are simultaneously selected, and 1098 training data and the other 1098 data are randomly separated from 2196 rainstorm live data and serve as test data. The unknown sample data may be: collecting the time of hail occurrence in the festival in 2019 by the network, analyzing combined reflectivity files near the time point to obtain 20850 sample data serving as unlabelled data sets for training an identification model;

dividing 20850 unknown sample data into 1000 groups randomly, and adding training according to the method in the embodiment 1, wherein the weight values used in the training are as follows: VIL characteristic weight value: 0.1369, respectively; h _ R _ max characteristic weight value: 0.7220, respectively; r _ Max characteristic weight value: 0.1411, respectively; in the training process, after the model is updated every time, classifying the test data by using the model, wherein a classification evaluation index change diagram is shown in figure 1; the classification scheme is shown in FIG. 2, wherein in FIG. 2 "

"the shaped points are correctly identified hail points," - "the shaped dots are correctly identified rainstorm points," the x "the shaped points are incorrectly identified rainstorm points"

"the point of the shape is a hail point which is identified by mistake;

further, taking hail reduction as an example, controlling other conditions to be unchanged, carrying out hail identification by using a semi-supervised support vector machine identification method, and finally obtaining a prediction comparison result as follows:

watch (A)

Hail suppression detection comparison result data

From the above experimental results, it can be seen that: the hail identification method has the advantages that the accuracy of hail test data is 96.11%, the false alarm rate is 10.81%, the critical success index is 86.08%, and the overall effect of the method is higher than that of the identification result of the same type of method (semi-supervised support vector machine) under the condition of the same training data and test data.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A hail identification method based on semi-supervised learning under multi-dimensional radar data is characterized by comprising the following steps:

s1: acquiring a marked sample set, extracting a supervision sample set from the marked sample set by a random sampling method, dividing the marked sample set into a rainstorm sample training set and a hail sample training set according to rainstorm and hail labels, acquiring an unlabeled data set, and randomly averaging the unlabeled data set into q first samples;

s6: the classification evaluation index of the model to the supervision sample set can be obtained through calculation according to the data labels, when the index meets the preset condition, the first sample is updated to a rainstorm or hail cluster, the mean vector of each cluster is calculated and reserved, the pseudo label of the cluster corresponding to the data is finally made according to the position of the data in the first sample in the cluster, and S2-S6 are repeated until q parts of the first samples are processed; respectively calculating evaluation indexes of the retained mean vector to the last hail and rainstorm training set, and selecting the mean vector of the optimal index as a final model;

s7: and (4) inputting the mean vector of the final model as an identification model, calculating the confidence coefficient of each sample in the to-be-identified sample to each cluster according to the steps S3-S4, and classifying to finish identification.

2. The method of claim 1, wherein the annotated sample set and the unlabeled data set comprise radar base reflectance images and doppler weather radar series data.

3. The method according to claim 1, wherein the classification evaluation index in step S6 includes: hit rate, false alarm rate, and critical success index.

4. The method according to claim 3, wherein the preset condition is:

when the hit rate is less than 90%, the hit rate index is increased and the increase value is greater than the increase value of the false alarm rate;

when the hit rate is greater than or equal to 90%, the hit rate index increases and the increase value is greater than the increase value of the false alarm rate of 2/3.