CN110930282B

CN110930282B - Local rainfall type analysis method based on machine learning

Info

Publication number: CN110930282B
Application number: CN201911243213.6A
Authority: CN
Inventors: 王帆
Original assignee: China Institute of Water Resources and Hydropower Research
Current assignee: China Institute of Water Resources and Hydropower Research
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-10-09
Anticipated expiration: 2039-12-06
Also published as: CN110930282A

Abstract

The invention discloses a local rainfall type analysis method based on machine learning, which comprises the following steps: 1) collecting, processing and storing data; 2) automatic extraction of rainfall events; 3) generating a local rainfall sample set; 4) performing cluster analysis on rainfall events based on GPU acceleration; 5) and analyzing the generated clustering tree to obtain a representative rain pattern. According to the rainfall event analysis method, the rainfall events are automatically extracted by collecting the station observation data of the drainage basin, and then the most representative rainfall process is analyzed by adopting a machine learning method to serve as the representative rainfall type of local rainfall, so that the workload of artificial analysis can be greatly saved, the difference caused by subjective judgment is avoided, meanwhile, the analysis result has regional pertinence, and powerful support can be provided for analysis of critical rainfall of mountain torrents and numerical simulation of urban inland inundation.

Description

Local rainfall type analysis method based on machine learning

Technical Field

The invention belongs to the technical field of water conservancy projects, particularly relates to the technical field of flood control forecast, and particularly relates to a local rainfall type analysis method based on machine learning.

Background

In recent years, extreme rainstorm events in China are frequent, local rainstorm has strong burstiness and short time efficiency, and are main inducing factors of mountain torrents and urban waterlogging. For local rainstorm, except for rainfall and raininess, the rainstorm type is used as a description of the rainstorm process, the distribution of the rainstorm intensity on a time scale is shown, and the distribution is one of main disaster-causing characteristics of the rainstorm event, and even if the rainstorm process has the same rainfall and raininess, the disaster-causing characteristics are different from one another.

Because the torrential rain and flood process in hilly areas steeply rises and falls and is difficult to forecast in real time, the method of dynamic critical rainfall is mainly adopted to carry out torrential flood early warning at present. Meanwhile, the urbanization of China is rapidly developed, flood control and drainage standards adopted by most cities are low, waterlogging disasters are frequent, and influence assessment is usually carried out on rainstorm waterlogging by adopting a numerical simulation mode at present. Research has shown that the rainstorm type has a direct influence on the determination of critical rainfall of mountain torrents disasters and the determination of the maximum range and the maximum depth of urban waterlogging.

At present, when determining the critical rainfall of the mountain torrents and simulating the urban waterlogging numerical value, the local rainstorm rainfall type is mainly calculated by adopting a same-frequency analysis method or a design rainfall type, and the commonly used design rainfall types comprise a Chicago rainfall type, a Huff rainfall type, a Pilgrime rainfall type, a Yen & Chow rainfall type and the like. The same-frequency method needs more human intervention, and results are subjective due to different samples and different comprehensions caused by different expert experiences. Various design rain types such as Chicago rain type, Huff rain type, Yen & Chow rain type and the like are obtained by generalized design of foreign scholars according to a rainstorm sample in a certain area, have a certain difference with the actual rainfall process, and currently, no accepted rain type exists as the basis for design.

Disclosure of Invention

The invention aims to provide a local rainfall type analysis method aiming at the problem.

A local rainfall type analysis method based on machine learning comprises the following steps:

1) collecting, processing and storing data: collecting the rainfall data of hydrology and meteorological sites in a flow domain to be analyzed and carrying out equal-time-period processing;

2) automatic extraction of rainfall events: sequentially reading continuous rainfall time sequences of all stations in the database, dividing the continuous rainfall time sequences into independent rainfall sessions, and generating rainfall time sequences of a plurality of rainfall sessions;

3) generating a local rainfall sample set: generating a sample set by utilizing the rainfall time sequences of the plurality of occasions extracted in the step 2), wherein the elements of the sample set are independent rainfall events and are subjected to standardization treatment, and the number of the elements in the set is the same as that of the rainfall occasions; dividing the sample set into a plurality of subsets according to different duration of each rainfall event;

4) rainfall event cluster analysis based on GPU acceleration: performing cluster analysis based on each subset of the sample set generated in the step 3) to generate a plurality of cluster trees, wherein the specific steps of the cluster analysis are as follows: 4-1, generating initial clusters: treating each element in the subset as an initial cluster; 4-2, calculating a distance matrix: the matrix size is (N multiplied by N), N is the number of rainfall events contained in the subset, the element (i, j) of the matrix is the distance measurement of the i cluster and the j cluster, the similarity of the rainfall event i and the rainfall event j is represented, the DTW distance is used as the similarity measurement standard, and the similarity is stronger when the distance is smaller; accelerating the calculation of the matrix by adopting GPU parallel calculation; 4-3, merging the clusters based on the distance matrix in the step 4-2, finding out two clusters with the shortest distance for merging, renumbering the clusters, calculating the distance between the new cluster and each other cluster, and updating the distance matrix; 4-4, repeating the step 4-3 until all the cluster clusters are combined into one cluster, thereby generating a cluster tree; 4-5, repeating the steps 4-2-4, so that a corresponding clustering tree is generated based on each subset in the sample set;

5) analyzing the clustering tree generated in the step 4): taking the root node as the 1 st layer of the clustering tree, the nth layer of the clustering tree comprises n nodes, each node is 1 cluster and comprises 1 clustering center, traversing and searching the node clusters of the given layer, calculating the distance matrix of the rainfall events contained in each node cluster, and the matrix size is (m)_i×m_i) N, n is the number of the nodes in the layer, m_iThe number of rainfall events contained in the ith node is, and the element (i, j) of the matrix is the DTW distance between the rainfall event i and the rainfall event j; and calculating a distance matrix of rainfall events contained in each node cluster, and then calculating the sum of all rows of the node clusters, wherein the rainfall event corresponding to the row index with the minimum sum is the clustering center of the node cluster, namely the representative rainfall type of the local rainfall of the watershed.

Further, the time span of the rainfall data in step 1) covers 10 years or more than 10 years.

Further, the method for dividing the rainfall in the step 2) comprises the following steps: setting a time threshold, and regarding the rainfall process as two rainfall processes when the intermission time of the rainfall process exceeds the threshold, and regarding the rainfall process as one rainfall process when the intermission time of the rainfall process is less than the threshold; and setting a magnitude threshold, and when the total rainfall in one rainfall process is lower than the magnitude threshold, determining that the rainfall is micro rainfall and not taking the rainfall into consideration.

Further, the standardized processing method of the rainfall event in the step 3) comprises the following steps:

where n is the length of the rainfall event, P_iFor standardized rainfall sequence points, P_jFor the original rainfall sequence points, i, j are the normalized sequence and the time index of the original sequence, respectively.

Further, the subset dividing method in step 3) is as follows: and dividing the total duration into a plurality of time intervals according to the duration of each event in the rainfall event set, and extracting and generating the events with the rainfall durations in the same interval into a subset.

Further, the DTW distance calculation method includes: for time series X ═ X₁，x₂，...，x_i，...，x_mY ═ Y₁，y₂，...，y_i，...，y_nFinding a warped path W to represent the mapping relation W between the time sequences X and Y, wherein m and n represent the lengths of the two time sequences, respectively, { W ═ W }₁，w₂，...，w_k，...，w_KK is more than or equal to max (n, m) and less than or equal to K and n + m-1, and the kth element of W is recorded as W_kThe cumulative distance calculation formula of the point (i, j) is represented by a correspondence relationship between the ith element of the time series X and the jth element of the time series Y: γ (i, j) ═ d (x)_i，y_j) + min { γ (i-1, j-1), γ (i-1, j), γ (i, j-1) }, given an initial condition γ (1, 1) ═ d (x)₁，y₁) And the accumulated distance matrix is obtained by iterative calculation,

i.e. the DTW distance of time series X and Y.

Further, the specific method for accelerating the calculation of the matrix by adopting the parallel calculation of the GPU in the steps 4) and 4-2 is as follows: assigning a thread to each matrix element to account for DTW distance, assigning a thread count to the thread block first, and for two-dimensional matrix operations, assigning a thread count (tb, tb) to each thread block, where tb is²The number of threads contained in the thread block required to be smaller than the maximum number of threads contained in the thread block allowed by the GPU; second, thread blocks are allocated for the thread cells, and for two-dimensional matrix operations, thread blocks (bg, bg) may be allocated for each thread cell, where

N is the number of samples in the rainfall event sample subset, and bg is smaller than the maximum number of thread blocks contained in the thread grid allowed by the GPU; after thread allocation is completed, DTW distance calculation of each element of the matrix is completed by using the GPU and returned to a CPU memory, and therefore distance matrix counting is completedAnd (4) calculating.

The invention has the beneficial effects that:

the rainfall events are automatically extracted by collecting station observation data of a drainage basin (region), and then the most representative rainfall process is analyzed by adopting a machine learning method to serve as a local rainfall representative rainfall pattern, so that the workload of artificial analysis can be greatly saved, and the difference caused by subjective judgment is avoided.

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention;

FIG. 2 is a schematic illustration of rainfall data interpolation;

FIG. 3 is a dynamic warping path of the time series;

FIG. 4 is a rainfall event ensemble;

fig. 5 rain sample example 1;

fig. 6 rain sample example 2;

fig. 7 rain sample example 3;

fig. 8 rain sample example 4;

fig. 9 rain sample example 5;

fig. 10 rainfall sample example 6;

FIG. 11 a subset 1 clustering tree;

FIG. 12 a subset 2 cluster tree;

FIG. 13 a 3 rd subset cluster tree;

FIG. 14 is a 4 th subset cluster tree;

FIG. 15 a subset cluster tree of FIG. 5;

FIG. 16 a subset cluster tree;

FIG. 17 cluster number 1 of subset number 1;

cluster 2 of the subset 1 of fig. 18;

FIG. 19 cluster No. 3 of subset No. 1;

FIG. 20 cluster No. 4 of subset No. 1;

FIG. 21 cluster 5 of subset 1;

representative rain pattern 1 of subset 1 of fig. 22;

representative rain pattern 2 of subset 1 of fig. 23;

representative rain pattern 3 of subset 1 of fig. 24;

representative rain pattern 4 of subset 1 of fig. 25;

representative rain pattern 5 of subset 1 of fig. 26.

Detailed Description

Example 1

1) data collection, processing and storage

And (3) collecting data: collecting rainfall observation data of hydrology and meteorological sites in a basin (region) to be analyzed, wherein the demand of cluster analysis on data volume is large, and the time span of the rainfall data needs to cover 10 years or more than 10 years.

And (3) processing data: processing rainfall data into an equal-period time sequence, if the original data is unequal-period data, performing interpolation processing on the data, preferably performing interpolation according to a rainfall accumulation curve, as shown in fig. 2, firstly obtaining the rainfall accumulation curve by using the original sequence, and further obtaining the equal-period rainfall time sequence { P'₁，P′₂，P′₃，...，P′₁₂}。

And storing the processed equivalent time rainfall time sequence data to a database.

2) Automatic extraction of rainfall events

And sequentially reading the rainfall time sequence of each station in the database, and dividing the rainfall time sequence into independent rainfall fields. In rainfall time series { P₁，P₂，P₃，...，P_tAnd its corresponding time stamp sequence { T }₁，T₂，T₃，...，T_tFor example, the division method is as follows: setting time threshold Th_TWhen the interval time T of the rainfall process_j-T_iExceeds threshold Th_TThe process is regarded as two precipitation processes, and the threshold value is not exceededTh_TThe rainfall process is regarded as a one-time rainfall process, so that automatic and continuous rainfall field division is realized; setting magnitude threshold Th_AWhen the total rainfall in one rainfall process is lower than the threshold Th_AThe rainfall is considered to be trace rainfall and is not considered. By means of the method, the rainfall time sequence of each station is calculated in a traversing mode to obtain n rainfall sequences { P_i1，P_i2，...，P_ikAnd its time mark sequence { T }_i1，T_i2，...，T_ikAnd f, wherein i is 1, N is the number of rainfall fields, and k is the number of periods corresponding to the rainfall fields.

3) Generating local rainfall sample sets

Utilizing the N rainfall sequences { P) extracted in the step 2)_i1，P_i2，...，P_ikN, N is the number of rainfall events, k is the number of periods corresponding to the rainfall event, and a sample set containing N elements is generated, wherein the elements of the set are independent rainfall events. Using formulas

Standardizing rainfall events, wherein n is the length of the rainfall event, P'_iFor standardized rainfall sequence points, P_jIs the original rainfall sequence point. The sample set is divided into a plurality of subsets according to different divisions of each rainfall event duration.

4) Rainfall event cluster analysis based on GPU acceleration

Performing cluster analysis based on each subset of the sample set generated in step 3) to generate a cluster tree. The clustering analysis comprises the following specific steps:

4-1. generating an initial cluster with each element in the subset as an initial cluster, D ═ x for a subset of N data objects₁，x₂，...，x_NSet an initial cluster set C ═ C₁，C₂，...，C_NIn which C is_j＝{x_j}；

4-2. calculating the first distance Matrix_FThe matrix size is (N × N), and N is the number of rainfall events contained in the subsetAnd counting, wherein an element (i, j) of the matrix is the similarity between the cluster i and the cluster j, and for the first distance matrix, the element (i, j) is the similarity between the rainfall event i and the rainfall event j. Using the DTW distance as a similarity measure, the smaller the distance is, the stronger the similarity is, and the DTW distance calculation method is as follows:

for time series X ═ X₁，x₂，...，x_i，...，x_mY ═ Y₁，y₂，...，y_i，...，y_nFind a warped path W to represent the mapping between the time series X and Y, as shown in fig. 3, where W ═ W₁，w₂，...，w_k，...，w_KK is more than or equal to max (n, m) and less than or equal to K and n + m-1, and the kth element of W is recorded as W_kThe term (i, j) denotes a correspondence relationship between the ith element of the time series X and the jth element of the time series Y. The selection of the twisted path has three constraints: the warp path starts at the start element of the matrix and ends at the diagonal element, i.e. w₁＝(1，1)，w_K(m, n); the twisted path is continuous at each step, i.e. for w_k＝(a，b)，w_k-1(a ', b') provided that a-a 'is ≦ 1 and b-b' is ≦ 1; the warped path is monotonic on the time axis, i.e., for w_k＝(a，b)，w_k-1(a ', b') where a-a 'is not less than 0 and b-b' is not less than 0.

There are many paths that can satisfy the constraint condition, and here, the path with the minimum distortion cost is found, that is:

wherein d (w)_k) Is w_kThe distance between two corresponding elements of the representation.

According to the dynamic planning idea, if the point (i, j) is on the optimal path, the sub-path from the point (1, 1) to the point (i, j) is also a local optimal solution, that is, the optimal path from the point (1, 1) to the point (m, n) can be obtained by the recursive search of the local optimal solution from the starting point (1, 1) to the end point (m, n), so that the optimal path can be conveniently foundMatrix, the matrix elements (i, j) being two time series points x_iAnd point y_jDistance d (x) therebetween_i，y_j)＝(x_i-y_j)². The cumulative distance calculation formula for defining point (i, j):

γ(i，j)＝d(x_i，y_j)+min{γ(i-1，j-1)，γ(i-1，j)，γ(i，j-1)}

an initial condition γ (1, 1) ═ d (x) is given₁，y₁) The cumulative distance matrix can be obtained by iterative computation.

Namely the DTW distance between the time sequence X and the time sequence Y, the best matching path can be obtained by reversely searching the accumulated distance matrix from the point gamma (m, n).

The time complexity of the calculation of the first distance matrix is

N is the number of elements in the sample set, the time complexity of DTW distance calculation is O (m.n), and m and N are the lengths of rainfall event time sequences, so that the calculation time of the first distance matrix is often very long, and the traditional method is difficult to meet the requirement of large-data-volume analysis.

The calculations between the elements are independent of each other and are suitable for parallel calculations, so that the calculation of the first matrix is accelerated by using GPU parallel calculations. The specific method comprises the following steps: assigning a thread to each matrix element to be responsible for calculating the DTW distance, firstly assigning thread numbers to the thread blocks, wherein the maximum thread number in the thread blocks is different according to the GPU performance, and assigning the thread numbers (tb, tb) to each thread block for two-dimensional matrix operation, wherein tb is²The number of threads contained in the thread block required to be smaller than the maximum number of threads contained in the thread block allowed by the GPU; second, thread blocks are allocated for the thread cells, and for two-dimensional matrix operations, thread blocks (bg, bg) may be allocated for each thread cell, where

And is less than the maximum number of thread blocks contained in the thread grid allowed by the GPU, N isNumber of samples in the subset. And after thread allocation is completed, completing DTW distance calculation of each element of the matrix by using the GPU and returning the DTW distance calculation to a CPU memory, thereby completing the calculation of the first distance matrix.

4-3, merging the clusters based on the first distance matrix, and finding out two clusters C with the closest distance_i*And C_j*Merging C_i*And C_j*：C_i*＝C_i*∪C_j*Renumbering the cluster, deleting jth row and jth column of distance matrix M (current distance matrix), calculating distance between new cluster and other clusters, and updating distance matrix.

4-4, repeating the previous step until all cluster clusters are combined into one cluster, thereby generating a cluster tree.

And 4-5, repeating the steps 4-2-4 to generate a corresponding clustering tree based on each subset in the sample set.

5) Clustering center extraction and local rainfall pattern analysis

Analyzing the clustering tree generated in the step 4), taking the root node as the 1 st layer of the clustering tree, wherein the n th layer of the clustering tree comprises n nodes, each node is 1 cluster and comprises 1 clustering center, traversing and searching each layer of node clusters from the root node downwards, and calculating the distance Matrix of rainfall events contained in each node cluster_DMatrix size of (m)_i×m_i) N, n is the number of nodes in the layer, i is the node index, m_iThe number of rainfall events included in the node is, and the element (i, j) of the matrix is the DTW distance between the rainfall event i and the rainfall event j (the calculation method may refer to the calculation method of the DTW distance in step 4). First, calculate the distance Matrix_DAnd then calculating the sum of each row, wherein the rainfall event corresponding to the row index with the minimum sum is a clustering center, namely the representative rainfall type of the local rainfall of the drainage basin (region).

In this embodiment: local rainfall type analysis is carried out on a certain sub-basin of the Yangtze river basin, the basin area is 572 square kilometers, 18 rainfall stations are arranged in the basin, and the station rainfall data time span is 15 years.

Setting a threshold Th_A10mm, threshold Th_T6h, threshold Th_LAfter the rainfall events are automatically extracted, 3092 rainfall events are extracted, wherein the duration of the rainfall events is 3 to 120 hours, and the rainfall events are shown in fig. 4.

Calculating and standardizing the accumulated rainfall process according to the rainfall events, taking the standardized time sequence of the accumulated rainfall process as samples, dividing subsets according to rainfall duration intervals, and dividing 6 subsets according to 6 intervals of [3, 6), [6, 12), [12, 24), [24, 48), [48, 96), [96, 192), wherein the number of the samples in each subset is respectively: 521. 874, 1051, 542, 97, 7, some of the rainfall processes and samples are shown in FIGS. 5-10.

Performing cluster analysis on each subset as a basis to obtain a cluster tree as shown in fig. 11-16:

selecting a 5 th-layer extraction clustering center according to the generated clustering tree, taking a first subset, namely 3-5 hours of rainfall events as an example: for the cluster tree generated based on the first subset, 5 representative rain types of rainfall in the drainage basin for 3 to 5 hours can be obtained, and 5 cluster clusters and cluster centers (representative rain types) are respectively shown in fig. 17 to 26.

Claims

1. A local rainfall type analysis method based on machine learning is characterized in that: the method comprises the following steps:

3) generating a local rainfall sample set: generating a sample set by utilizing the rainfall time sequences of the plurality of occasions extracted in the step 2), wherein the elements of the sample set are independent rainfall events and are subjected to standardization treatment, and the number of the elements in the set is the same as that of the rainfall occasions; dividing the sample set into a plurality of subsets according to different duration of each rainfall event; the subset dividing method in the step 3) comprises the following steps: dividing the total duration into a plurality of time intervals according to the duration of each event in the rainfall event set, and extracting the events with the rainfall durations in the same interval to generate a subset;

4) carrying out cluster analysis on rainfall events based on GPU acceleration, wherein the cluster analysis is carried out on each subset of a sample set generated in the step 3) to generate a plurality of cluster trees, and the specific steps of the cluster analysis are 4-1. generating initial clusters, taking each element in the subsets as an initial cluster, 4-2. calculating a distance matrix, namely the matrix is N × N, N is the number of the rainfall events contained in the subsets, the elements (i, j) of the matrix are distance measurement of the i cluster and the j cluster and represent the similarity of the rainfall events i and the rainfall events j, using DTW distance as a similarity measurement standard, the similarity is stronger when the distance is smaller, adopting GPU parallel calculation to accelerate the calculation of the matrix, 4-3. merging the clusters based on the distance matrix in the step 4-2, finding out and merging two clusters with the closest distance, renumbering the clusters, calculating the distance between a new cluster and each other cluster, updating the distance matrix, 4-4. repeating the step 4-3 until all the clusters are merged into one cluster, thereby generating a cluster tree set, 4-5. repeating the clustering method generates a plurality of cluster trees based on the distance of the subsets, and the cluster trees, wherein the clustering method comprises the steps of the step 4-4, and the step X, and the step of calculating the step of the clustering trees, and the step of the step₁，x₂，...，x_i，...，x_mY ═ Y₁，y₂，...，y_i，...，y_nFinding a warped path W to represent the mapping relation W between the time sequences X and Y, wherein m and n represent the lengths of the two time sequences, respectively, { W ═ W }₁，w₂，...，w_k，...，w_KK is more than or equal to max (n, m) and less than or equal to K and n + m-1, and the kth element of W is recorded as W_kThe cumulative distance calculation formula of the point (i, j) is represented by a correspondence relationship between the ith element of the time series X and the jth element of the time series Y: γ (i, j) ═ d (x)_i，y_j) + min { γ (i-1, j-1), γ (i-1, j), γ (i, j-1) }, given an initial condition γ (1, 1) ═ d (x)₁，y₁) And the accumulated distance matrix is obtained by iterative calculation,

namely the DTW distance between the time sequence X and the time sequence Y;

2. The machine learning-based local rainfall pattern analysis method according to claim 1, characterized in that: the time span of rainfall data in step 1) covers 10 years or more than 10 years.

3. The machine learning-based local rainfall pattern analysis method according to claim 1, characterized in that: the method for dividing the rainfall in the step 2) comprises the following steps: setting a time threshold, and regarding the rainfall process as two rainfall processes when the intermission time of the rainfall process exceeds the threshold, and regarding the rainfall process as one rainfall process when the intermission time of the rainfall process is less than the threshold; and setting a magnitude threshold, and when the total rainfall in one rainfall process is lower than the magnitude threshold, determining that the rainfall is micro rainfall and not taking the rainfall into consideration.

4. The machine learning-based local rainfall pattern analysis method according to claim 1, characterized in that: the standardized processing method of the rainfall event in the step 3) comprises the following steps:

where n is the length of the rainfall event, P_i' after standardizationRainfall sequence points, P_jIs the original rainfall sequence point.

5. The machine learning-based local rainfall pattern analysis method of claim 1, characterized in that: step 4)4-2, the concrete method for accelerating the calculation of the matrix by adopting GPU parallel calculation comprises the following steps: assigning a thread to each matrix element to account for DTW distance, assigning a thread count to the thread block first, and for two-dimensional matrix operations, assigning a thread count (tb, tb) to each thread block, where tb is²The number of threads contained in the thread block required to be smaller than the maximum number of threads contained in the thread block allowed by the GPU; second, thread blocks are allocated for the thread cells, and for two-dimensional matrix operations, thread blocks (bg, bg) may be allocated for each thread cell, where

N is the number of samples in the rainfall event sample subset, and bg is smaller than the maximum number of thread blocks contained in the thread grid allowed by the GPU; and after the thread allocation is completed, completing DTW distance calculation of each element of the matrix by using the GPU and returning the DTW distance calculation to a CPU memory, thereby completing the calculation of the distance matrix.