CN112966722A

CN112966722A - Regional landslide susceptibility prediction method based on semi-supervised random forest model

Info

Publication number: CN112966722A
Application number: CN202110168854.0A
Authority: CN
Inventors: 黄发明; 潘李含; 李文彬; 陶思玉
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-15

Abstract

The invention relates to a regional landslide susceptibility prediction method based on a semi-supervised random forest model, which comprises the following steps: s1: screening a known landslide sample by a landslide record and related control factors in a space analysis research area; s2: determining a control factor capable of most representing the landslide development characteristics based on frequency ratio and correlation analysis, and establishing a random forest model; s3: outputting and predicting an initial landslide susceptibility value for the fully supervised machine learning, namely a random forest model, according to the five types of landslide susceptibility grades in the step S2 based on the FR value of the control factor, the known landslide grid unit and the randomly selected non-landslide grid unit; s4: expanding a known landslide sample; s5: randomly selecting a grid unit from an extremely low incidence area as a non-landslide sample; s6: and establishing a semi-supervised random forest model. The landslide incidence prediction modeling performance is further improved on the basis of full-supervision machine learning.

Description

Regional landslide susceptibility prediction method based on semi-supervised random forest model

Technical Field

The invention relates to the technical field of geological disaster prediction, in particular to a regional landslide susceptibility prediction method based on a semi-supervised random forest model.

Background

Under the influence of seasonal heavy rainfall and large-scale engineering construction, a plurality of mountain landslides occur in China every year, and serious loss is often caused to the safety of local residents, building facilities, environment and the like. The landslide susceptibility research can accurately predict the spatial probability of the occurrence of the potential landslide in the specific area. Therefore, it is necessary to enhance the research on predicting the landslide tendency in the area to guide the disaster prevention and reduction work in the high-occurrence landslide area.

Machine learning is a multi-disciplinary cross specialty, covers probability theory knowledge, statistical knowledge, approximate theoretical knowledge and complex algorithm knowledge, uses a computer as a tool and aims at simulating a human learning mode in real time, and effectively improves learning efficiency by dividing the existing content into knowledge structures.^[1]

Machine learning has several definitions:

(1) machine learning is the science of artificial intelligence, and the main research object in the field is artificial intelligence, particularly how to improve the performance of a specific algorithm in empirical learning.

(2) Machine learning is a study of computer algorithms that can be automatically improved through experience.

(3) Machine learning is the use of data or past experience to optimize the performance criteria of a computer program.

At present, machine learning is widely used for landslide incidence prediction, and model training and testing are mainly carried out by utilizing landslide-non-landslide samples, control factors of the landslide samples and other data to realize incidence calculation. Machine learning is believed to have a better nonlinear prediction capability than mathematical statistical models, which can predict a more accurate landslide prevalence. Machine learning models can be classified into two broad categories, unsupervised and fully supervised machine learning, according to whether known sample data is utilized as a model output variable.

Although supervised and unsupervised machine learning has achieved a range of results in predicting landslide prevalence, several deficiencies remain. On one hand, in the process of training and testing, the unsupervised machine learning does not need known landslide and non-landslide samples as model output variables, but the modeling accuracy of the unsupervised machine learning is difficult to guarantee due to the lack of guidance of priori knowledge such as landslide and non-landslide. On the other hand, landslide incidence prediction modeling based on fully supervised machine learning also has some defects, which are mainly expressed as follows: 1) the difficulty and the cost for field investigation and landslide sample data acquisition are high. In a large-scale research area, it is generally difficult to obtain all landslide samples, and it can be seen that the landslide samples are known to be expanded; 2) in the modeling process, a mode of randomly selecting non-landslide samples in the whole research area brings a large amount of errors to the training and testing process of the fully supervised machine learning, and the precision of landslide proneness prediction is reduced.

Random forest refers to a classifier that trains and predicts a sample using multiple trees. A random forest, which is a flexible and easy-to-use machine learning algorithm, builds multiple decision trees and merges them together to obtain a more accurate and stable prediction. Therefore, the invention provides a regional landslide susceptibility prediction method based on a semi-supervised random forest model.

Disclosure of Invention

In order to solve the defects of the prior art, the regional landslide susceptibility prediction method based on the semi-supervised random forest model is provided.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a regional landslide susceptibility prediction method based on a semi-supervised random forest model comprises the following steps:

s1: known landslide samples are screened out through the landslide record and related control factors in the RS and ArcGIS platform management and space analysis research area, wherein the control factors comprise four categories of landform, basic geology, hydrological environment and ground cover;

s2: determining a control factor capable of most representing the landslide development characteristics based on frequency ratio and correlation analysis, wherein the frequency ratio is marked as FR, and establishing a random forest model:

the method comprises the steps of converting a surface file recorded by a known landslide into a grid unit in ArcGIS software, randomly selecting non-landslide grid units with equal proportion in a non-landslide area of a research area to form a training test data set of a model, and further randomly dividing the training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing; in the training and testing process of the random forest model, expressing the landslide grid unit with known positive samples by using 1, and expressing the non-landslide grid unit randomly selected by using negative samples by using 0; the output variable of the random forest model is the probability value of each grid unit between 0 and 1, and the distribution of the probability values between 0 and 1 reflects the distribution rule of the regional landslide proneness; predicting an initial landslide susceptibility value of the whole research area by using a random forest model which is well trained and tested, and then dividing the research area into 5 types of landslide susceptibility grades by adopting a natural discontinuity classification method in ArcGIS software and combining a landslide susceptibility distribution rule: 1-very low susceptibility region, 2-low susceptibility region, 3-medium susceptibility region, 4-high susceptibility region and 5-very high susceptibility region;

s3: outputting and predicting an initial landslide susceptibility value for the fully supervised machine learning, namely a random forest model, according to the five types of landslide susceptibility grades in the step S2 based on the FR value of the control factor, the known landslide grid unit and the randomly selected non-landslide grid unit;

s4: superposing the high-resolution remote sensing image and an initial landslide susceptibility map, delineating an area with extremely high landslide occurrence probability in ArcGIS by utilizing the shape, size, tone and structural characteristics of a landslide body on the image and a remote sensing interpretation mark established by a field survey result from an initial extremely high landslide susceptibility area, and randomly selecting a grid unit in the area as a 'potential landslide' for expanding a known landslide sample; the expanded landslide sample and the known landslide sample in the step S1 jointly form a landslide sample;

s5: randomly selecting a grid unit from the 1-extremely low incidence area as a non-landslide sample;

s6: establishing a semi-supervised random forest model: obtaining the expanded landslide sample and the accurately selected non-landslide sample on the basis of the steps S4 and S5, and importing the samples into the random forest model again for training and testing, namely successfully constructing the semi-supervised random forest model and finally predicting the landslide development tendency, wherein the method specifically comprises the following steps:

s61: importing the initial landslide susceptibility value obtained in the step S21 into ArcGIS software to generate an initial landslide susceptibility graph, and further obtaining a 5-extremely high susceptibility area of the research area by using a natural discontinuity classification method; then utilizing the shape, size, tone and structural characteristics of the slope body on the image and the field investigation result; analyzing to establish a regional landslide remote sensing interpretation mark; finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark; obtaining an expanded landslide sample mark as 1 and an accurately selected non-landslide sample mark as 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of the random forest model again.

S62: and the semi-supervised random forest model tested by the second training is used for predicting the landslide susceptibility of the research area, and the semi-supervised random forest model also guides the predicted landslide susceptibility value into ArcGIS software to be divided into 5 grades according to a natural discontinuity grading method, wherein the grades comprise a 1-extremely low susceptibility area, a 2-low susceptibility area, a 3-medium susceptibility area, a 4-high susceptibility area and a 5-extremely high susceptibility area.

And step S1, acquiring four major control factors of the landform, the hydrological environment, the stratigraphic lithology and the surface covering of the research area based on the ArcGIS platform and remote sensing image visual interpretation according to the basic geological data of the research area.

The frequency ratio method of step S2 is an efficient quantitative analysis method, and the frequency ratio calculation formula is:

FR >1 indicates that the attribute within a certain interval of the control factor is favorable for landslide development, and FR <1 indicates that the attribute within a certain interval of the control factor is unfavorable for landslide development.

The step S2 is to calculate the correlation coefficient between the control factors through correlation analysis in the SPSS23 software, select Pearson correlation coefficient in the SPSS23 software and check the significance correlation. Firstly, the significance is seen, if the significance is less than 0.05, the linear relation exists between two different control factors; then looking at the correlation coefficient, if the absolute value of the correlation coefficient is more than 0.8, the correlation is extremely strong, 0.6-0.8 is strong, 0.4-0.6 is medium, 0.2-0.4 is weak, and less than 0.2 is irrelevant; if the result shows that the correlation among the control factors is not large, the control factors can be used as the input variables of the model.

Step S3 is to combine the known landslide grid cells and the randomly selected 1:1 non-landslide grid cells in the non-landslide area of the study area into a model training test data set, and further randomly divide the model training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing.

The classification method is characterized in that the classification is carried out by dividing the classification into 5 grades between 0 and 1 by adopting a natural discontinuity point classification method, the classification is not carried out at equal intervals, the natural classification is carried out based on the inherent natural classification in the data, and then the classification interval is identified.

Step S4, after obtaining the initial susceptibility diagram, further extracting a 5-extremely high susceptibility region of the research region; then, establishing a regional landslide remote sensing interpretation mark by analyzing the characteristics of historical landslide form, tone and the like of the research region; and finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark.

And step S6, marking the landslide sample obtained after the expansion on the basis of the steps S4 and S5 as 1 and marking the accurately selected non-landslide sample as 0, introducing the landslide sample into the random forest model again, and dividing the training test set according to the proportion of 7:3 to perform the training test.

The invention has the beneficial effects that: the non-landslide grid cells are randomly selected by the fully supervised machine learning to serve as output variables for model training and testing, and a large amount of errors exist in the model training and testing process due to uncertain non-landslide samples, so that the modeling precision of the fully supervised machine learning is reduced. In the modeling process of semi-supervised machine learning, a non-landslide sample with very high reliability is selected from an extremely low incidence area, so that the errors of training and testing data sets are reduced, and the modeling precision is improved; on the other hand, the known landslide recording quantity is expanded by screening the 'potential landslide' with extremely high probability, so that the training test sample of semi-supervised machine learning has wider representativeness, and the trained model can more accurately reflect the nonlinear function relationship between the landslide and the control factors. In conclusion, the analysis shows that the existing landslide label samples are well utilized and expanded by the semi-supervised machine learning to guide the landslide-non-landslide classification process, and the landslide incidence prediction modeling performance is further improved on the basis of the fully supervised machine learning.

Description of the drawings:

FIG. 1 is a flow chart of a regional landslide susceptibility prediction method based on a semi-supervised random forest model.

Detailed Description

The invention discloses a regional landslide susceptibility prediction method based on a semi-supervised random forest model, which comprises the following steps:

the invention aims to realize the regional landslide tendency prediction method based on a semi-supervised random forest model, which comprises the following steps of:

s1: managing and spatially analyzing landslide records and related control factors in a research area by an RS (remote sensing) and ArcGIS (geographic information System) platform to obtain known landslide samples, wherein the control factors are at least one of landform, basic geology, hydrological environment and surface coverage data;

landslide record data quality has a very important influence on the predictiveness performance of a research area. The landslide record is beneficial to knowing the information of the landslide such as the position, the motion type, the triggering times, the scale size and the geological environment development condition related to the landslide.

In the landslide susceptibility prediction process, control factors with representative landform, basic geology, hydrological environment, surface coverage and the like are selected according to landslide development characteristics and influence factors of a research area and natural geographic characteristics of the research area to perform the susceptibility prediction.

S2: determining a control factor capable of most representing the landslide development characteristics based on frequency ratio and correlation analysis, wherein the frequency ratio is marked as FR, and establishing a random forest model: in ArcGIS software, a known landslide recorded surface file is converted into a grid unit, meanwhile, non-landslide grid units with equal proportion are randomly selected in a non-landslide area of a research area to form a training test data set of a model, and the training test data set is further randomly divided into two parts: 70% of the data set was used for training and the remaining 30% was used for testing. In the training and testing process of the random forest model, expressing the landslide grid unit with known positive samples by using 1, and expressing the non-landslide grid unit randomly selected by using negative samples by using 0; (ii) a The output variable of the random forest model is the probability value of each grid unit between 0 and 1, and the distribution of the probability values between 0 and 1 reflects the distribution rule of the regional landslide proneness; predicting an initial landslide incidence value of the whole research area by using a random forest model which is well trained and tested, and then dividing the research area into 5 types of landslide incidence grades by adopting a natural break point classification method in ArcGIS software and combining a landslide incidence distribution rule: 1-very low susceptibility region, 2-low susceptibility region, 3-medium susceptibility region, 4-high susceptibility region and 5-very high susceptibility region.

S3: predicting an initial landslide susceptibility value for fully supervised machine learning, namely a random forest model, based on the FR value of the control factor, a known landslide grid unit and a randomly selected non-landslide grid unit;

s4: superposing the high-resolution remote sensing image and an initial landslide susceptibility map, starting from an ArcGIS middle circle to generate an area with extremely high landslide probability by utilizing the shape, size, tone and structural characteristics of a landslide body on the image and a remote sensing interpretation mark established by a field survey result and an artificial visual mode in an initial extremely high landslide susceptibility area, and then randomly selecting a grid unit in the area as a 'potential landslide' to expand a known landslide sample; the expanded landslide sample and the known landslide sample in the step S1 jointly form a landslide sample;

s5: meanwhile, randomly selecting a grid unit from the 1-extremely low incidence area as a non-landslide sample;

s6: and (3) obtaining the expanded landslide sample and the accurately selected non-landslide sample on the basis of the steps S4 and S5, and introducing the samples into the random forest model again for training and testing, namely successfully constructing a semi-supervised random forest model and carrying out final landslide susceptibility prediction:

s61: establishing a semi-supervised random forest model: importing the initial landslide susceptibility value obtained in the step S21 into ArcGIS software to generate an initial landslide susceptibility graph, and further obtaining a 5-extremely high susceptibility area of the research area by using a natural discontinuity classification method; then utilizing the shape, size, tone and structural characteristics of the slope body on the image and the field investigation result; analyzing to establish a regional landslide remote sensing interpretation mark; finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark; obtaining an expanded landslide sample mark as 1 and an accurately selected non-landslide sample mark as 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of the random forest model again.

S62: the semi-supervised random forest model tested by the second training is used for predicting the landslide susceptibility of the research area, and the semi-supervised random forest model also guides the predicted landslide susceptibility value into ArcGIS software to be divided into 5 grades according to a natural discontinuity grading method, wherein the grades comprise a 1-extremely low susceptibility area, a 2-low susceptibility area, a 3-medium susceptibility area, a 4-high susceptibility area and a 5-extremely high susceptibility area;

the random forest model is a classifier comprising a plurality of decision trees, and the main idea is to replace extracting samples to construct different training sets and randomly select different feature sets, so that the generated classification trees are diversified. Because the set of different classification trees can reflect the actual result more comprehensively than a single tree, the prediction capability of the model can be improved, and overfitting can be avoided. In addition, a large number of theories and application researches at home and abroad prove the accuracy of the random forest model from different angles, and the random forest model has good containment degree on abnormal values and noise in data sets, and is one of the best machine learning models acknowledged at present.

In addition, random forests use out-of-bag errors to achieve unbiased estimation of the generalization errors, and the generalization errors gradually converge as the number of trees increases. The importance of each environmental factor variable can also be reflected by the out-of-bag error, when only the out-of-bag error of a single variable is changed, the magnitude of the change of the out-of-bag error determines the importance of the factor variable, and the accuracy average reduction value and the impurity average reduction value are used for measuring the importance of the input variable. Furthermore, the predictive performance of the model is controlled primarily by adjusting the number of trees and the number of feature sets taken.

One of the random forest characteristics is a relative weight that can give an environmental factor, which is derived based on the kini index. And measuring the optimal segmentation by using the impurity degree in the random forest classification tree, wherein the impurity degree is calculated by a Gini index method. Calculating a reduction value of the damping index of the environment factor k in node segmentation; and averaging all the trees after summing all the nodes in the forest, wherein the average is the importance of the environment factor k. The importance of the environmental factors is measured as the percentage of the environmental factor average reduction value to the sum of all the environmental factor average reduction values, and can be formulated as

And (4) calculating.

The random forest classification effect is related to the relevance of any two trees in the forest and the classification capability of each tree in the forest. The number m of feature choices is reduced, and the relevance and classification capability of the tree are correspondingly reduced; increasing m, both increases. The key step of constructing the random forest is how to select the optimal m, and the optimal m is selected mainly according to the calculation of the out-of-bag error.

One of the most important advantages of random forests is that cross-validation is not required or a separate test set is used to obtain an unbiased estimate of the error. The model can establish an unbiased estimate of the error during the decision tree generation.

The method mainly utilizes the RF function package in the R language to carry out prediction modeling on the landslide susceptibility. The precision of the random forest model is mainly controlled by the factor characteristic quantity and the quantity of trees, and the optimal parameters of the random forest are mainly obtained by automatic screening of the factor characteristic quantity and the out-of-bag errors. The invention mainly obtains the importance of each environmental factor through the average reduction value of the accuracy and the average reduction value of the impurity degree in the random forest model.

Landslide susceptibility refers to the spatial probability of occurrence of regional landslide, predicting the spatial location of a future landslide event likely to occur through similar underlying environmental conditions of past landslide occurrences. The selection of control factors for the area of study is important for accurate and reliable landslide liability prediction.

In the landslide susceptibility prediction process, four kinds of control factors with representative landform, basic geology, hydrological environment and surface coverage are selected according to landslide development characteristics and influence factors of a research area and natural geographic characteristics of the research area to perform the susceptibility prediction.

In specific implementation, taking the southern health area of gan city, Jiangxi province as an example, according to the landslide development characteristics and influence factors of the area and the natural geographic characteristics of the area, as well as the calculation result of the frequency ratio between the landslide and the control factors thereof, and considering the difficulty degree of obtaining the relevant control factors, the landform (elevation, gradient, slope direction, plane curvature, section curvature and topographic relief), the engineering geology (lithology), the hydrological environment (modified normalized differentiated water body index (MNDWI) and the distance from a water system) and the surface covering (normalized building index (NDBI) and Normalized Differentiated Vegetation Index (NDVI)) are selected to have 11 control factors.

(1) Landform and engineering geological factor

The selected topographic factors such as elevation, gradient, slope direction, section curvature, plane curvature, topographic relief degree and the like are all obtained from the elevation through ArcGIS software space analysis. The slope is an important factor for promoting landslide, directly affects the shear stress that destabilizes the landslide, and the occurrence of landslide is directly related to a certain slope. The slope direction reveals the difference of the distribution of the water content of the soil and the vegetation coverage in all directions. The plane curvature and the section curvature reflect the influence of the terrain slope on the water velocity and the convergence property, respectively. Relief is expressed as the difference between the highest and lowest altitudes in the nan kang region and is commonly used to quantitatively characterize the topography of the region on a macroscopic scale. The difference of the lithologic physical properties represents the difference of rock-soil mass in the aspects of permeability, matrix suction, shear strength and the like

(2) Hydrological environment and surface coating factor

The hydrological environmental factors influence the landslide susceptibility by controlling the partial movement of the landslide surface and the ground water. The influence of the hydrological environment on landslide development is represented by acquiring the distance between grid cells in the area and a water system and an improved normalized difference water body index (MNDWI). The surface coverage factor is characterized herein primarily using a normalized building index (NDBI) and a normalized vegetation index (NDVI). NDBI reflects the distribution of residential building sites and also laterally learns the concentrated activity areas of local residents. NDVI can characterize the density of surface vegetation and high coverage vegetation can inhibit landslide development.

In specific implementation, all control factors are converted into a grid format by utilizing ArcGIS software, and the resolution of the remote sensing image and the control factors is 30 m.

According to the formula

And acquiring the frequency ratio of each control factor, wherein FR is greater than 1, which indicates that the control factor is favorable for the development of the landslide, and the larger FR indicates that the control factor has a greater effect on the development of the landslide.

Before the predictive modeling of the vulnerability, the independence between the control factors needs to be determined to avoid information coincidence. The collinearity problem among the control factors is calculated by utilizing the correlation analysis of the SPSS23 software, and the result shows that the correlation among the control factors is not large and can be used as the input variable of machine learning.

In specific implementation, the FR values of the 11 selected control factors are normalized to be between [0 and 1], and then the FR values are used as input variables of the random forest model, and the control factors are also input variables of the semi-supervised random forest model. The surface file of the landslide at 233, which has occurred in the nan kang district, is turned into a grid cell in ArcGIS, resulting in 2598 landslide grid cells in total. Meanwhile, 2598 marked landslide grids and 2598 non-landslide grid units randomly selected in the Nankang non-landslide area form a model training test data set, and the model training test data set is further randomly divided into two parts: 70% of the data set was used for training and the remaining 30% was used for testing. And setting the easy-to-send labels of the known landslide grid and the randomly selected non-landslide grid to be 1 and 0 respectively in the training and testing process of the random forest model.

When the method is implemented, the RF function package in the R language is mainly utilized to carry out prediction modeling on the landslide susceptibility. The precision of the random forest model is mainly controlled by the factor characteristic quantity and the tree quantity, and the optimal parameters of the random forest are obtained mainly by automatic screening of the characteristic quantity of the control factors and the out-of-bag errors. The number of the factor features of the first random forest model is 2, and the number of the trees is 600. At around 600 f, the error in the model is substantially stable.

And then, acquiring the importance of each control factor through the average reduction value of the accuracy and the average reduction value of the impurity degree in the random forest model. And the more important control factors are the gradient, the distance from the water system, the plane curvature, the topographic relief degree, the elevation and the like.

Then, importing the initial landslide susceptibility value of the whole Nankang area, which is obtained by predicting the trained random forest model, into ArcGIS software to be converted into a grid file, and dividing the Nankang area into 5 types of landslide susceptibility grades by adopting a natural discontinuity classification method and combining a landslide susceptibility distribution rule: 1-very low susceptibility region (38.7%), 2-low susceptibility region (23.2%), 3-medium susceptibility region (16.7%), 4-high susceptibility region (12.3%) and 5-very high susceptibility region (9.1%).

In the modeling process of the semi-supervised random forest, firstly, 520 potential landslide grid units with high probability are determined in a 5-extremely high incidence area through high-resolution remote sensing image interpretation, the grids account for 20% of known landslide grids, the number of landslide grids can be effectively increased, and the reliability is prevented from being reduced due to too many landslide grids; the 520 potential landslide grids are used for expanding 2598 known landslide grids to jointly form 3118 known landslide grid units, and the labels of the known landslide grid units are set to 1; then, 3118 non-landslide grid units with extremely high probability are randomly selected from 1-extremely low incidence areas of the initial landslide incidence value, and the labels of the non-landslide grid units are set to be 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of random forest models again.

And the random forest model of the second training test also utilizes an RF function packet in the R language to obtain the optimal parameters of the random forest through automatic screening of the characteristic quantity of the control factors and the out-of-bag error. The number of factor features for the second random forest model is 2 and the number of trees is 400. At around 400 f, the in-model error is substantially stable. In addition, in order to facilitate model comparison, the semi-supervised random forest model also divides the predicted landslide proneness into 5 grades according to a natural discontinuity classification method: 1-very low susceptibility region (37.5%), 2-low susceptibility region (21.3%), 3-medium susceptibility region (13.6%), 4-high susceptibility region (12.9%) and 5-very high susceptibility region (14.7%).

Finally, the area auc (area underserved ROC) values under the receiver operating characteristic curves (ROCs) are used to evaluate the accuracy of the two models respectively. The AUC values of the random forest model and the semi-supervised random forest model were 0.899 and 0.974, respectively. The semi-supervised machine learning is shown to greatly improve the probability prediction precision of the fully-supervised machine learning. Further shows that the occurrence probability prediction performance of the machine learning model can be greatly improved by expanding known landslide samples and accurately and effectively screening non-landslide samples.

Claims

1. A regional landslide susceptibility prediction method based on a semi-supervised random forest model is characterized by comprising the following steps:

s1: known landslide samples are screened out through RS and ArcGIS platform management and space analysis landslide entries in a research area and related control factors, wherein the control factors comprise four categories of landform, basic geology, hydrological environment and ground surface covering;

s2: determining a control factor capable of most characterizing landslide development characteristics based on frequency ratio and correlation analysis, wherein the frequency ratio is marked as FR, and establishing a random forest model:

2. The method for predicting the occurrence of the regional landslide based on the semi-supervised random forest model as recited in claim 1, wherein the step S1 is implemented by acquiring four major control factors of the topography, the hydrological environment, the stratigraphic lithology and the surface coverage of the research area based on an ArcGIS platform and remote sensing image visual interpretation according to basic geological data of the research area.

3. The method for predicting regional landslide proneness based on semi-supervised random forest model as claimed in claim 1, wherein the frequency ratio method of step S2 is an efficient quantitative analysis method, and the frequency ratio calculation formula is:

4. The method as claimed in claim 1, wherein the step S2 is to calculate the correlation coefficient between the control factors through correlation analysis in SPSS23 software, and select Pearson correlation coefficient in SPSS23 software and check the significance correlation. Firstly, the significance is seen, if the significance is less than 0.05, the linear relation exists between two different control factors; then looking at the correlation coefficient, if the absolute value of the correlation coefficient is more than 0.8, the correlation is extremely strong, 0.6-0.8 is strong, 0.4-0.6 is medium, 0.2-0.4 is weak, and less than 0.2 is irrelevant; if the result shows that the correlation among the control factors is not large, the control factors can be used as the input variables of the model.

5. The regional landslide susceptibility prediction method based on semi-supervised random forest model as recited in claim 1, wherein: step S3 is to combine the known landslide grid cells and the randomly selected 1:1 non-landslide grid cells in the non-landslide area of the study area into a model training test data set, and further randomly divide the model training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing.

6. The regional landslide susceptibility prediction method based on semi-supervised random forest model as recited in claim 1, wherein: the classification method is characterized in that the classification is carried out by dividing the classification into 5 grades between 0 and 1 by adopting a natural discontinuity point classification method, the classification is not carried out at equal intervals, the natural classification is carried out based on the inherent natural classification in the data, and then the classification interval is identified.

7. The regional landslide susceptibility prediction method based on semi-supervised random forest model as claimed in claim 1, wherein the step (4) is to further extract a 5-extremely high susceptibility region of the research region after obtaining an initial susceptibility map; then, establishing a regional landslide remote sensing interpretation mark by analyzing the characteristics of historical landslide form, tone and the like of the research region; and finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark.

8. The regional landslide susceptibility prediction method based on the semi-supervised random forest model as claimed in claim 1, wherein the step S6 is to mark the landslide sample after the expansion based on the steps S4 and S5 as 1 and the accurately selected non-landslide sample as 0, and to re-introduce the landslide sample into the random forest model, and to divide the training test set according to the ratio of 7:3 for the training test.