CN112966722A - Regional landslide susceptibility prediction method based on semi-supervised random forest model - Google Patents

Regional landslide susceptibility prediction method based on semi-supervised random forest model Download PDF

Info

Publication number
CN112966722A
CN112966722A CN202110168854.0A CN202110168854A CN112966722A CN 112966722 A CN112966722 A CN 112966722A CN 202110168854 A CN202110168854 A CN 202110168854A CN 112966722 A CN112966722 A CN 112966722A
Authority
CN
China
Prior art keywords
landslide
susceptibility
random forest
forest model
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110168854.0A
Other languages
Chinese (zh)
Inventor
黄发明
潘李含
李文彬
陶思玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang University
Original Assignee
Nanchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang University filed Critical Nanchang University
Priority to CN202110168854.0A priority Critical patent/CN112966722A/en
Publication of CN112966722A publication Critical patent/CN112966722A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Abstract

The invention relates to a regional landslide susceptibility prediction method based on a semi-supervised random forest model, which comprises the following steps: s1: screening a known landslide sample by a landslide record and related control factors in a space analysis research area; s2: determining a control factor capable of most representing the landslide development characteristics based on frequency ratio and correlation analysis, and establishing a random forest model; s3: outputting and predicting an initial landslide susceptibility value for the fully supervised machine learning, namely a random forest model, according to the five types of landslide susceptibility grades in the step S2 based on the FR value of the control factor, the known landslide grid unit and the randomly selected non-landslide grid unit; s4: expanding a known landslide sample; s5: randomly selecting a grid unit from an extremely low incidence area as a non-landslide sample; s6: and establishing a semi-supervised random forest model. The landslide incidence prediction modeling performance is further improved on the basis of full-supervision machine learning.

Description

Regional landslide susceptibility prediction method based on semi-supervised random forest model
Technical Field
The invention relates to the technical field of geological disaster prediction, in particular to a regional landslide susceptibility prediction method based on a semi-supervised random forest model.
Background
Under the influence of seasonal heavy rainfall and large-scale engineering construction, a plurality of mountain landslides occur in China every year, and serious loss is often caused to the safety of local residents, building facilities, environment and the like. The landslide susceptibility research can accurately predict the spatial probability of the occurrence of the potential landslide in the specific area. Therefore, it is necessary to enhance the research on predicting the landslide tendency in the area to guide the disaster prevention and reduction work in the high-occurrence landslide area.
Machine learning is a multi-disciplinary cross specialty, covers probability theory knowledge, statistical knowledge, approximate theoretical knowledge and complex algorithm knowledge, uses a computer as a tool and aims at simulating a human learning mode in real time, and effectively improves learning efficiency by dividing the existing content into knowledge structures.[1]
Machine learning has several definitions:
(1) machine learning is the science of artificial intelligence, and the main research object in the field is artificial intelligence, particularly how to improve the performance of a specific algorithm in empirical learning.
(2) Machine learning is a study of computer algorithms that can be automatically improved through experience.
(3) Machine learning is the use of data or past experience to optimize the performance criteria of a computer program.
At present, machine learning is widely used for landslide incidence prediction, and model training and testing are mainly carried out by utilizing landslide-non-landslide samples, control factors of the landslide samples and other data to realize incidence calculation. Machine learning is believed to have a better nonlinear prediction capability than mathematical statistical models, which can predict a more accurate landslide prevalence. Machine learning models can be classified into two broad categories, unsupervised and fully supervised machine learning, according to whether known sample data is utilized as a model output variable.
Although supervised and unsupervised machine learning has achieved a range of results in predicting landslide prevalence, several deficiencies remain. On one hand, in the process of training and testing, the unsupervised machine learning does not need known landslide and non-landslide samples as model output variables, but the modeling accuracy of the unsupervised machine learning is difficult to guarantee due to the lack of guidance of priori knowledge such as landslide and non-landslide. On the other hand, landslide incidence prediction modeling based on fully supervised machine learning also has some defects, which are mainly expressed as follows: 1) the difficulty and the cost for field investigation and landslide sample data acquisition are high. In a large-scale research area, it is generally difficult to obtain all landslide samples, and it can be seen that the landslide samples are known to be expanded; 2) in the modeling process, a mode of randomly selecting non-landslide samples in the whole research area brings a large amount of errors to the training and testing process of the fully supervised machine learning, and the precision of landslide proneness prediction is reduced.
Random forest refers to a classifier that trains and predicts a sample using multiple trees. A random forest, which is a flexible and easy-to-use machine learning algorithm, builds multiple decision trees and merges them together to obtain a more accurate and stable prediction. Therefore, the invention provides a regional landslide susceptibility prediction method based on a semi-supervised random forest model.
Disclosure of Invention
In order to solve the defects of the prior art, the regional landslide susceptibility prediction method based on the semi-supervised random forest model is provided.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a regional landslide susceptibility prediction method based on a semi-supervised random forest model comprises the following steps:
s1: known landslide samples are screened out through the landslide record and related control factors in the RS and ArcGIS platform management and space analysis research area, wherein the control factors comprise four categories of landform, basic geology, hydrological environment and ground cover;
s2: determining a control factor capable of most representing the landslide development characteristics based on frequency ratio and correlation analysis, wherein the frequency ratio is marked as FR, and establishing a random forest model:
the method comprises the steps of converting a surface file recorded by a known landslide into a grid unit in ArcGIS software, randomly selecting non-landslide grid units with equal proportion in a non-landslide area of a research area to form a training test data set of a model, and further randomly dividing the training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing; in the training and testing process of the random forest model, expressing the landslide grid unit with known positive samples by using 1, and expressing the non-landslide grid unit randomly selected by using negative samples by using 0; the output variable of the random forest model is the probability value of each grid unit between 0 and 1, and the distribution of the probability values between 0 and 1 reflects the distribution rule of the regional landslide proneness; predicting an initial landslide susceptibility value of the whole research area by using a random forest model which is well trained and tested, and then dividing the research area into 5 types of landslide susceptibility grades by adopting a natural discontinuity classification method in ArcGIS software and combining a landslide susceptibility distribution rule: 1-very low susceptibility region, 2-low susceptibility region, 3-medium susceptibility region, 4-high susceptibility region and 5-very high susceptibility region;
s3: outputting and predicting an initial landslide susceptibility value for the fully supervised machine learning, namely a random forest model, according to the five types of landslide susceptibility grades in the step S2 based on the FR value of the control factor, the known landslide grid unit and the randomly selected non-landslide grid unit;
s4: superposing the high-resolution remote sensing image and an initial landslide susceptibility map, delineating an area with extremely high landslide occurrence probability in ArcGIS by utilizing the shape, size, tone and structural characteristics of a landslide body on the image and a remote sensing interpretation mark established by a field survey result from an initial extremely high landslide susceptibility area, and randomly selecting a grid unit in the area as a 'potential landslide' for expanding a known landslide sample; the expanded landslide sample and the known landslide sample in the step S1 jointly form a landslide sample;
s5: randomly selecting a grid unit from the 1-extremely low incidence area as a non-landslide sample;
s6: establishing a semi-supervised random forest model: obtaining the expanded landslide sample and the accurately selected non-landslide sample on the basis of the steps S4 and S5, and importing the samples into the random forest model again for training and testing, namely successfully constructing the semi-supervised random forest model and finally predicting the landslide development tendency, wherein the method specifically comprises the following steps:
s61: importing the initial landslide susceptibility value obtained in the step S21 into ArcGIS software to generate an initial landslide susceptibility graph, and further obtaining a 5-extremely high susceptibility area of the research area by using a natural discontinuity classification method; then utilizing the shape, size, tone and structural characteristics of the slope body on the image and the field investigation result; analyzing to establish a regional landslide remote sensing interpretation mark; finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark; obtaining an expanded landslide sample mark as 1 and an accurately selected non-landslide sample mark as 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of the random forest model again.
S62: and the semi-supervised random forest model tested by the second training is used for predicting the landslide susceptibility of the research area, and the semi-supervised random forest model also guides the predicted landslide susceptibility value into ArcGIS software to be divided into 5 grades according to a natural discontinuity grading method, wherein the grades comprise a 1-extremely low susceptibility area, a 2-low susceptibility area, a 3-medium susceptibility area, a 4-high susceptibility area and a 5-extremely high susceptibility area.
And step S1, acquiring four major control factors of the landform, the hydrological environment, the stratigraphic lithology and the surface covering of the research area based on the ArcGIS platform and remote sensing image visual interpretation according to the basic geological data of the research area.
The frequency ratio method of step S2 is an efficient quantitative analysis method, and the frequency ratio calculation formula is:
Figure BDA0002938282790000031
FR >1 indicates that the attribute within a certain interval of the control factor is favorable for landslide development, and FR <1 indicates that the attribute within a certain interval of the control factor is unfavorable for landslide development.
The step S2 is to calculate the correlation coefficient between the control factors through correlation analysis in the SPSS23 software, select Pearson correlation coefficient in the SPSS23 software and check the significance correlation. Firstly, the significance is seen, if the significance is less than 0.05, the linear relation exists between two different control factors; then looking at the correlation coefficient, if the absolute value of the correlation coefficient is more than 0.8, the correlation is extremely strong, 0.6-0.8 is strong, 0.4-0.6 is medium, 0.2-0.4 is weak, and less than 0.2 is irrelevant; if the result shows that the correlation among the control factors is not large, the control factors can be used as the input variables of the model.
Step S3 is to combine the known landslide grid cells and the randomly selected 1:1 non-landslide grid cells in the non-landslide area of the study area into a model training test data set, and further randomly divide the model training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing.
The classification method is characterized in that the classification is carried out by dividing the classification into 5 grades between 0 and 1 by adopting a natural discontinuity point classification method, the classification is not carried out at equal intervals, the natural classification is carried out based on the inherent natural classification in the data, and then the classification interval is identified.
Step S4, after obtaining the initial susceptibility diagram, further extracting a 5-extremely high susceptibility region of the research region; then, establishing a regional landslide remote sensing interpretation mark by analyzing the characteristics of historical landslide form, tone and the like of the research region; and finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark.
And step S6, marking the landslide sample obtained after the expansion on the basis of the steps S4 and S5 as 1 and marking the accurately selected non-landslide sample as 0, introducing the landslide sample into the random forest model again, and dividing the training test set according to the proportion of 7:3 to perform the training test.
The invention has the beneficial effects that: the non-landslide grid cells are randomly selected by the fully supervised machine learning to serve as output variables for model training and testing, and a large amount of errors exist in the model training and testing process due to uncertain non-landslide samples, so that the modeling precision of the fully supervised machine learning is reduced. In the modeling process of semi-supervised machine learning, a non-landslide sample with very high reliability is selected from an extremely low incidence area, so that the errors of training and testing data sets are reduced, and the modeling precision is improved; on the other hand, the known landslide recording quantity is expanded by screening the 'potential landslide' with extremely high probability, so that the training test sample of semi-supervised machine learning has wider representativeness, and the trained model can more accurately reflect the nonlinear function relationship between the landslide and the control factors. In conclusion, the analysis shows that the existing landslide label samples are well utilized and expanded by the semi-supervised machine learning to guide the landslide-non-landslide classification process, and the landslide incidence prediction modeling performance is further improved on the basis of the fully supervised machine learning.
Description of the drawings:
FIG. 1 is a flow chart of a regional landslide susceptibility prediction method based on a semi-supervised random forest model.
Detailed Description
The invention discloses a regional landslide susceptibility prediction method based on a semi-supervised random forest model, which comprises the following steps:
the invention aims to realize the regional landslide tendency prediction method based on a semi-supervised random forest model, which comprises the following steps of:
s1: managing and spatially analyzing landslide records and related control factors in a research area by an RS (remote sensing) and ArcGIS (geographic information System) platform to obtain known landslide samples, wherein the control factors are at least one of landform, basic geology, hydrological environment and surface coverage data;
landslide record data quality has a very important influence on the predictiveness performance of a research area. The landslide record is beneficial to knowing the information of the landslide such as the position, the motion type, the triggering times, the scale size and the geological environment development condition related to the landslide.
In the landslide susceptibility prediction process, control factors with representative landform, basic geology, hydrological environment, surface coverage and the like are selected according to landslide development characteristics and influence factors of a research area and natural geographic characteristics of the research area to perform the susceptibility prediction.
S2: determining a control factor capable of most representing the landslide development characteristics based on frequency ratio and correlation analysis, wherein the frequency ratio is marked as FR, and establishing a random forest model: in ArcGIS software, a known landslide recorded surface file is converted into a grid unit, meanwhile, non-landslide grid units with equal proportion are randomly selected in a non-landslide area of a research area to form a training test data set of a model, and the training test data set is further randomly divided into two parts: 70% of the data set was used for training and the remaining 30% was used for testing. In the training and testing process of the random forest model, expressing the landslide grid unit with known positive samples by using 1, and expressing the non-landslide grid unit randomly selected by using negative samples by using 0; (ii) a The output variable of the random forest model is the probability value of each grid unit between 0 and 1, and the distribution of the probability values between 0 and 1 reflects the distribution rule of the regional landslide proneness; predicting an initial landslide incidence value of the whole research area by using a random forest model which is well trained and tested, and then dividing the research area into 5 types of landslide incidence grades by adopting a natural break point classification method in ArcGIS software and combining a landslide incidence distribution rule: 1-very low susceptibility region, 2-low susceptibility region, 3-medium susceptibility region, 4-high susceptibility region and 5-very high susceptibility region.
S3: predicting an initial landslide susceptibility value for fully supervised machine learning, namely a random forest model, based on the FR value of the control factor, a known landslide grid unit and a randomly selected non-landslide grid unit;
s4: superposing the high-resolution remote sensing image and an initial landslide susceptibility map, starting from an ArcGIS middle circle to generate an area with extremely high landslide probability by utilizing the shape, size, tone and structural characteristics of a landslide body on the image and a remote sensing interpretation mark established by a field survey result and an artificial visual mode in an initial extremely high landslide susceptibility area, and then randomly selecting a grid unit in the area as a 'potential landslide' to expand a known landslide sample; the expanded landslide sample and the known landslide sample in the step S1 jointly form a landslide sample;
s5: meanwhile, randomly selecting a grid unit from the 1-extremely low incidence area as a non-landslide sample;
s6: and (3) obtaining the expanded landslide sample and the accurately selected non-landslide sample on the basis of the steps S4 and S5, and introducing the samples into the random forest model again for training and testing, namely successfully constructing a semi-supervised random forest model and carrying out final landslide susceptibility prediction:
s61: establishing a semi-supervised random forest model: importing the initial landslide susceptibility value obtained in the step S21 into ArcGIS software to generate an initial landslide susceptibility graph, and further obtaining a 5-extremely high susceptibility area of the research area by using a natural discontinuity classification method; then utilizing the shape, size, tone and structural characteristics of the slope body on the image and the field investigation result; analyzing to establish a regional landslide remote sensing interpretation mark; finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark; obtaining an expanded landslide sample mark as 1 and an accurately selected non-landslide sample mark as 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of the random forest model again.
S62: the semi-supervised random forest model tested by the second training is used for predicting the landslide susceptibility of the research area, and the semi-supervised random forest model also guides the predicted landslide susceptibility value into ArcGIS software to be divided into 5 grades according to a natural discontinuity grading method, wherein the grades comprise a 1-extremely low susceptibility area, a 2-low susceptibility area, a 3-medium susceptibility area, a 4-high susceptibility area and a 5-extremely high susceptibility area;
the random forest model is a classifier comprising a plurality of decision trees, and the main idea is to replace extracting samples to construct different training sets and randomly select different feature sets, so that the generated classification trees are diversified. Because the set of different classification trees can reflect the actual result more comprehensively than a single tree, the prediction capability of the model can be improved, and overfitting can be avoided. In addition, a large number of theories and application researches at home and abroad prove the accuracy of the random forest model from different angles, and the random forest model has good containment degree on abnormal values and noise in data sets, and is one of the best machine learning models acknowledged at present.
In addition, random forests use out-of-bag errors to achieve unbiased estimation of the generalization errors, and the generalization errors gradually converge as the number of trees increases. The importance of each environmental factor variable can also be reflected by the out-of-bag error, when only the out-of-bag error of a single variable is changed, the magnitude of the change of the out-of-bag error determines the importance of the factor variable, and the accuracy average reduction value and the impurity average reduction value are used for measuring the importance of the input variable. Furthermore, the predictive performance of the model is controlled primarily by adjusting the number of trees and the number of feature sets taken.
One of the random forest characteristics is a relative weight that can give an environmental factor, which is derived based on the kini index. And measuring the optimal segmentation by using the impurity degree in the random forest classification tree, wherein the impurity degree is calculated by a Gini index method. Calculating a reduction value of the damping index of the environment factor k in node segmentation; and averaging all the trees after summing all the nodes in the forest, wherein the average is the importance of the environment factor k. The importance of the environmental factors is measured as the percentage of the environmental factor average reduction value to the sum of all the environmental factor average reduction values, and can be formulated as
Figure BDA0002938282790000071
And (4) calculating.
The random forest classification effect is related to the relevance of any two trees in the forest and the classification capability of each tree in the forest. The number m of feature choices is reduced, and the relevance and classification capability of the tree are correspondingly reduced; increasing m, both increases. The key step of constructing the random forest is how to select the optimal m, and the optimal m is selected mainly according to the calculation of the out-of-bag error.
One of the most important advantages of random forests is that cross-validation is not required or a separate test set is used to obtain an unbiased estimate of the error. The model can establish an unbiased estimate of the error during the decision tree generation.
The method mainly utilizes the RF function package in the R language to carry out prediction modeling on the landslide susceptibility. The precision of the random forest model is mainly controlled by the factor characteristic quantity and the quantity of trees, and the optimal parameters of the random forest are mainly obtained by automatic screening of the factor characteristic quantity and the out-of-bag errors. The invention mainly obtains the importance of each environmental factor through the average reduction value of the accuracy and the average reduction value of the impurity degree in the random forest model.
Landslide susceptibility refers to the spatial probability of occurrence of regional landslide, predicting the spatial location of a future landslide event likely to occur through similar underlying environmental conditions of past landslide occurrences. The selection of control factors for the area of study is important for accurate and reliable landslide liability prediction.
Landslide record data quality has a very important influence on the predictiveness performance of a research area. The landslide record is beneficial to knowing the information of the landslide such as the position, the motion type, the triggering times, the scale size and the geological environment development condition related to the landslide.
In the landslide susceptibility prediction process, four kinds of control factors with representative landform, basic geology, hydrological environment and surface coverage are selected according to landslide development characteristics and influence factors of a research area and natural geographic characteristics of the research area to perform the susceptibility prediction.
In specific implementation, taking the southern health area of gan city, Jiangxi province as an example, according to the landslide development characteristics and influence factors of the area and the natural geographic characteristics of the area, as well as the calculation result of the frequency ratio between the landslide and the control factors thereof, and considering the difficulty degree of obtaining the relevant control factors, the landform (elevation, gradient, slope direction, plane curvature, section curvature and topographic relief), the engineering geology (lithology), the hydrological environment (modified normalized differentiated water body index (MNDWI) and the distance from a water system) and the surface covering (normalized building index (NDBI) and Normalized Differentiated Vegetation Index (NDVI)) are selected to have 11 control factors.
(1) Landform and engineering geological factor
The selected topographic factors such as elevation, gradient, slope direction, section curvature, plane curvature, topographic relief degree and the like are all obtained from the elevation through ArcGIS software space analysis. The slope is an important factor for promoting landslide, directly affects the shear stress that destabilizes the landslide, and the occurrence of landslide is directly related to a certain slope. The slope direction reveals the difference of the distribution of the water content of the soil and the vegetation coverage in all directions. The plane curvature and the section curvature reflect the influence of the terrain slope on the water velocity and the convergence property, respectively. Relief is expressed as the difference between the highest and lowest altitudes in the nan kang region and is commonly used to quantitatively characterize the topography of the region on a macroscopic scale. The difference of the lithologic physical properties represents the difference of rock-soil mass in the aspects of permeability, matrix suction, shear strength and the like
(2) Hydrological environment and surface coating factor
The hydrological environmental factors influence the landslide susceptibility by controlling the partial movement of the landslide surface and the ground water. The influence of the hydrological environment on landslide development is represented by acquiring the distance between grid cells in the area and a water system and an improved normalized difference water body index (MNDWI). The surface coverage factor is characterized herein primarily using a normalized building index (NDBI) and a normalized vegetation index (NDVI). NDBI reflects the distribution of residential building sites and also laterally learns the concentrated activity areas of local residents. NDVI can characterize the density of surface vegetation and high coverage vegetation can inhibit landslide development.
In specific implementation, all control factors are converted into a grid format by utilizing ArcGIS software, and the resolution of the remote sensing image and the control factors is 30 m.
According to the formula
Figure BDA0002938282790000081
And acquiring the frequency ratio of each control factor, wherein FR is greater than 1, which indicates that the control factor is favorable for the development of the landslide, and the larger FR indicates that the control factor has a greater effect on the development of the landslide.
Before the predictive modeling of the vulnerability, the independence between the control factors needs to be determined to avoid information coincidence. The collinearity problem among the control factors is calculated by utilizing the correlation analysis of the SPSS23 software, and the result shows that the correlation among the control factors is not large and can be used as the input variable of machine learning.
In specific implementation, the FR values of the 11 selected control factors are normalized to be between [0 and 1], and then the FR values are used as input variables of the random forest model, and the control factors are also input variables of the semi-supervised random forest model. The surface file of the landslide at 233, which has occurred in the nan kang district, is turned into a grid cell in ArcGIS, resulting in 2598 landslide grid cells in total. Meanwhile, 2598 marked landslide grids and 2598 non-landslide grid units randomly selected in the Nankang non-landslide area form a model training test data set, and the model training test data set is further randomly divided into two parts: 70% of the data set was used for training and the remaining 30% was used for testing. And setting the easy-to-send labels of the known landslide grid and the randomly selected non-landslide grid to be 1 and 0 respectively in the training and testing process of the random forest model.
When the method is implemented, the RF function package in the R language is mainly utilized to carry out prediction modeling on the landslide susceptibility. The precision of the random forest model is mainly controlled by the factor characteristic quantity and the tree quantity, and the optimal parameters of the random forest are obtained mainly by automatic screening of the characteristic quantity of the control factors and the out-of-bag errors. The number of the factor features of the first random forest model is 2, and the number of the trees is 600. At around 600 f, the error in the model is substantially stable.
And then, acquiring the importance of each control factor through the average reduction value of the accuracy and the average reduction value of the impurity degree in the random forest model. And the more important control factors are the gradient, the distance from the water system, the plane curvature, the topographic relief degree, the elevation and the like.
Then, importing the initial landslide susceptibility value of the whole Nankang area, which is obtained by predicting the trained random forest model, into ArcGIS software to be converted into a grid file, and dividing the Nankang area into 5 types of landslide susceptibility grades by adopting a natural discontinuity classification method and combining a landslide susceptibility distribution rule: 1-very low susceptibility region (38.7%), 2-low susceptibility region (23.2%), 3-medium susceptibility region (16.7%), 4-high susceptibility region (12.3%) and 5-very high susceptibility region (9.1%).
In the modeling process of the semi-supervised random forest, firstly, 520 potential landslide grid units with high probability are determined in a 5-extremely high incidence area through high-resolution remote sensing image interpretation, the grids account for 20% of known landslide grids, the number of landslide grids can be effectively increased, and the reliability is prevented from being reduced due to too many landslide grids; the 520 potential landslide grids are used for expanding 2598 known landslide grids to jointly form 3118 known landslide grid units, and the labels of the known landslide grid units are set to 1; then, 3118 non-landslide grid units with extremely high probability are randomly selected from 1-extremely low incidence areas of the initial landslide incidence value, and the labels of the non-landslide grid units are set to be 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of random forest models again.
And the random forest model of the second training test also utilizes an RF function packet in the R language to obtain the optimal parameters of the random forest through automatic screening of the characteristic quantity of the control factors and the out-of-bag error. The number of factor features for the second random forest model is 2 and the number of trees is 400. At around 400 f, the in-model error is substantially stable. In addition, in order to facilitate model comparison, the semi-supervised random forest model also divides the predicted landslide proneness into 5 grades according to a natural discontinuity classification method: 1-very low susceptibility region (37.5%), 2-low susceptibility region (21.3%), 3-medium susceptibility region (13.6%), 4-high susceptibility region (12.9%) and 5-very high susceptibility region (14.7%).
Finally, the area auc (area underserved ROC) values under the receiver operating characteristic curves (ROCs) are used to evaluate the accuracy of the two models respectively. The AUC values of the random forest model and the semi-supervised random forest model were 0.899 and 0.974, respectively. The semi-supervised machine learning is shown to greatly improve the probability prediction precision of the fully-supervised machine learning. Further shows that the occurrence probability prediction performance of the machine learning model can be greatly improved by expanding known landslide samples and accurately and effectively screening non-landslide samples.

Claims (8)

1. A regional landslide susceptibility prediction method based on a semi-supervised random forest model is characterized by comprising the following steps:
s1: known landslide samples are screened out through RS and ArcGIS platform management and space analysis landslide entries in a research area and related control factors, wherein the control factors comprise four categories of landform, basic geology, hydrological environment and ground surface covering;
s2: determining a control factor capable of most characterizing landslide development characteristics based on frequency ratio and correlation analysis, wherein the frequency ratio is marked as FR, and establishing a random forest model:
the method comprises the steps of converting a surface file recorded by a known landslide into a grid unit in ArcGIS software, randomly selecting non-landslide grid units with equal proportion in a non-landslide area of a research area to form a training test data set of a model, and further randomly dividing the training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing; in the training and testing process of the random forest model, expressing the landslide grid unit with known positive samples by using 1, and expressing the non-landslide grid unit randomly selected by using negative samples by using 0; the output variable of the random forest model is the probability value of each grid unit between 0 and 1, and the distribution of the probability values between 0 and 1 reflects the distribution rule of the regional landslide proneness; predicting an initial landslide susceptibility value of the whole research area by using a random forest model which is well trained and tested, and then dividing the research area into 5 types of landslide susceptibility grades by adopting a natural discontinuity classification method in ArcGIS software and combining a landslide susceptibility distribution rule: 1-very low susceptibility region, 2-low susceptibility region, 3-medium susceptibility region, 4-high susceptibility region and 5-very high susceptibility region;
s3: outputting and predicting an initial landslide susceptibility value for the fully supervised machine learning, namely a random forest model, according to the five types of landslide susceptibility grades in the step S2 based on the FR value of the control factor, the known landslide grid unit and the randomly selected non-landslide grid unit;
s4: superposing the high-resolution remote sensing image and an initial landslide susceptibility map, delineating an area with extremely high landslide occurrence probability in ArcGIS by utilizing the shape, size, tone and structural characteristics of a landslide body on the image and a remote sensing interpretation mark established by a field survey result from an initial extremely high landslide susceptibility area, and randomly selecting a grid unit in the area as a 'potential landslide' for expanding a known landslide sample; the expanded landslide sample and the known landslide sample in the step S1 jointly form a landslide sample;
s5: randomly selecting a grid unit from the 1-extremely low incidence area as a non-landslide sample;
s6: establishing a semi-supervised random forest model: obtaining the expanded landslide sample and the accurately selected non-landslide sample on the basis of the steps S4 and S5, and importing the samples into the random forest model again for training and testing, namely successfully constructing the semi-supervised random forest model and finally predicting the landslide development tendency, wherein the method specifically comprises the following steps:
s61: importing the initial landslide susceptibility value obtained in the step S21 into ArcGIS software to generate an initial landslide susceptibility graph, and further obtaining a 5-extremely high susceptibility area of the research area by using a natural discontinuity classification method; then utilizing the shape, size, tone and structural characteristics of the slope body on the image and the field investigation result; analyzing to establish a regional landslide remote sensing interpretation mark; finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark; obtaining an expanded landslide sample mark as 1 and an accurately selected non-landslide sample mark as 0; and finally, randomly dividing the expanded landslide and non-landslide grid data into training samples (70%) and testing samples (30%) so as to be used for modeling processes such as training and testing of the random forest model again.
S62: and the semi-supervised random forest model tested by the second training is used for predicting the landslide susceptibility of the research area, and the semi-supervised random forest model also guides the predicted landslide susceptibility value into ArcGIS software to be divided into 5 grades according to a natural discontinuity grading method, wherein the grades comprise a 1-extremely low susceptibility area, a 2-low susceptibility area, a 3-medium susceptibility area, a 4-high susceptibility area and a 5-extremely high susceptibility area.
2. The method for predicting the occurrence of the regional landslide based on the semi-supervised random forest model as recited in claim 1, wherein the step S1 is implemented by acquiring four major control factors of the topography, the hydrological environment, the stratigraphic lithology and the surface coverage of the research area based on an ArcGIS platform and remote sensing image visual interpretation according to basic geological data of the research area.
3. The method for predicting regional landslide proneness based on semi-supervised random forest model as claimed in claim 1, wherein the frequency ratio method of step S2 is an efficient quantitative analysis method, and the frequency ratio calculation formula is:
Figure FDA0002938282780000021
FR >1 indicates that the attribute within a certain interval of the control factor is favorable for landslide development, and FR <1 indicates that the attribute within a certain interval of the control factor is unfavorable for landslide development.
4. The method as claimed in claim 1, wherein the step S2 is to calculate the correlation coefficient between the control factors through correlation analysis in SPSS23 software, and select Pearson correlation coefficient in SPSS23 software and check the significance correlation. Firstly, the significance is seen, if the significance is less than 0.05, the linear relation exists between two different control factors; then looking at the correlation coefficient, if the absolute value of the correlation coefficient is more than 0.8, the correlation is extremely strong, 0.6-0.8 is strong, 0.4-0.6 is medium, 0.2-0.4 is weak, and less than 0.2 is irrelevant; if the result shows that the correlation among the control factors is not large, the control factors can be used as the input variables of the model.
5. The regional landslide susceptibility prediction method based on semi-supervised random forest model as recited in claim 1, wherein: step S3 is to combine the known landslide grid cells and the randomly selected 1:1 non-landslide grid cells in the non-landslide area of the study area into a model training test data set, and further randomly divide the model training test data set into two parts: 70% of the data set was used for training and the remaining 30% was used for testing.
6. The regional landslide susceptibility prediction method based on semi-supervised random forest model as recited in claim 1, wherein: the classification method is characterized in that the classification is carried out by dividing the classification into 5 grades between 0 and 1 by adopting a natural discontinuity point classification method, the classification is not carried out at equal intervals, the natural classification is carried out based on the inherent natural classification in the data, and then the classification interval is identified.
7. The regional landslide susceptibility prediction method based on semi-supervised random forest model as claimed in claim 1, wherein the step (4) is to further extract a 5-extremely high susceptibility region of the research region after obtaining an initial susceptibility map; then, establishing a regional landslide remote sensing interpretation mark by analyzing the characteristics of historical landslide form, tone and the like of the research region; and finally, interpreting the landslide hidden danger points in the 5-extremely high incidence area according to the landslide remote sensing interpretation mark.
8. The regional landslide susceptibility prediction method based on the semi-supervised random forest model as claimed in claim 1, wherein the step S6 is to mark the landslide sample after the expansion based on the steps S4 and S5 as 1 and the accurately selected non-landslide sample as 0, and to re-introduce the landslide sample into the random forest model, and to divide the training test set according to the ratio of 7:3 for the training test.
CN202110168854.0A 2021-02-07 2021-02-07 Regional landslide susceptibility prediction method based on semi-supervised random forest model Pending CN112966722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110168854.0A CN112966722A (en) 2021-02-07 2021-02-07 Regional landslide susceptibility prediction method based on semi-supervised random forest model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110168854.0A CN112966722A (en) 2021-02-07 2021-02-07 Regional landslide susceptibility prediction method based on semi-supervised random forest model

Publications (1)

Publication Number Publication Date
CN112966722A true CN112966722A (en) 2021-06-15

Family

ID=76275161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110168854.0A Pending CN112966722A (en) 2021-02-07 2021-02-07 Regional landslide susceptibility prediction method based on semi-supervised random forest model

Country Status (1)

Country Link
CN (1) CN112966722A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705607A (en) * 2021-07-22 2021-11-26 西安交通大学 Landslide susceptibility evaluation method based on two-step strategy
CN113780174A (en) * 2021-09-10 2021-12-10 福州大学 High vegetation platform storm and rain type landslide identification method combined with random forest algorithm
CN113987807A (en) * 2021-10-29 2022-01-28 重庆地质矿产研究院 Method for drawing landslide sensitivity graph based on GIS (geographic information System) multi-criterion decision analysis
CN114036841A (en) * 2021-11-10 2022-02-11 云南大学 Landslide incidence prediction method and system based on semi-supervised support vector machine model
CN115049053A (en) * 2022-06-20 2022-09-13 航天宏图信息技术股份有限公司 Loess region landslide susceptibility assessment method based on TabNet network
CN115830804A (en) * 2022-10-24 2023-03-21 北京中地华安科技股份有限公司 Intelligent early warning negative sample sampling method for pipeline geological disasters under constraint of easily-issued subareas
CN116050120A (en) * 2023-01-06 2023-05-02 中国自然资源航空物探遥感中心 Landslide hidden danger activity remote sensing evaluation modeling method, system and storage medium
CN116108758A (en) * 2023-04-10 2023-05-12 中南大学 Landslide susceptibility evaluation method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160062470A (en) * 2014-11-25 2016-06-02 대한민국(산림청 국립산림과학원장) Apparatus for predicting landslide using logistic regression model and method thereof
CN106408120A (en) * 2016-09-13 2017-02-15 江苏大学 Local region landslide prediction device and local region landslide prediction method
CN109063247A (en) * 2018-06-26 2018-12-21 西安工程大学 Landslide disaster forecasting procedure based on deepness belief network
CN110060273A (en) * 2019-04-16 2019-07-26 湖北省水利水电科学研究院 Remote sensing image landslide plotting method based on deep neural network
CN111858803A (en) * 2020-07-06 2020-10-30 东华理工大学 Landslide land disaster risk zoning map generation method
CN112036424A (en) * 2020-04-30 2020-12-04 自然资源部第一海洋研究所 Submarine landslide hazard analysis method based on unsupervised machine learning
CN112233381A (en) * 2020-10-14 2021-01-15 中国科学院、水利部成都山地灾害与环境研究所 Debris flow early warning method and system based on mechanism and machine learning coupling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160062470A (en) * 2014-11-25 2016-06-02 대한민국(산림청 국립산림과학원장) Apparatus for predicting landslide using logistic regression model and method thereof
CN106408120A (en) * 2016-09-13 2017-02-15 江苏大学 Local region landslide prediction device and local region landslide prediction method
CN109063247A (en) * 2018-06-26 2018-12-21 西安工程大学 Landslide disaster forecasting procedure based on deepness belief network
CN110060273A (en) * 2019-04-16 2019-07-26 湖北省水利水电科学研究院 Remote sensing image landslide plotting method based on deep neural network
CN112036424A (en) * 2020-04-30 2020-12-04 自然资源部第一海洋研究所 Submarine landslide hazard analysis method based on unsupervised machine learning
CN111858803A (en) * 2020-07-06 2020-10-30 东华理工大学 Landslide land disaster risk zoning map generation method
CN112233381A (en) * 2020-10-14 2021-01-15 中国科学院、水利部成都山地灾害与环境研究所 Debris flow early warning method and system based on mechanism and machine learning coupling

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
F. PROVOST: "Automatic classification of endogenous landslide seismicity using the Random Forest supervised classifier", 《HTTPS://AGUPUBS.ONLINELIBRARY.WILEY.COM/DOI/FULL/10.1002/2016GL070709》 *
FAMING HUANG: "Landslide susceptibility prediction based on a semi--", 《HTTPS://LINK.SPRINGER.COM/ARTICLE/10.1007/S10346-020-01473-9》 *
GAELLE DANNEELS: "Automatic landslide detection from remote sensing images using supervised classification methods", 《2007 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM》 *
佚名: "卫星影像地质灾害遥感解译方法和流程", <HTTPS://WWW.SOHU.COM/A/341168474_100123217> *
刘坚等: "基于优化随机森林模型的滑坡易发性评价", 《武汉大学学报(信息科学版)》 *
李利峰: "基于人工神经网络的区域滑坡预测研究", 《气象环境与科学》 *
苏晨旭: "江西省龙南县滑坡易发性评价", 《科技技术与工程》 *
邱维蓉: "基于正例和未标记样本的半监督集成学习方法在滑坡易发性预测中的应用", 《中国石油学会2019年物探技术研讨会》 *
黄发明: "基于聚类分析和支持向量机的滑坡易发性评价", 《岩石力学与工程学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705607B (en) * 2021-07-22 2023-08-22 西安交通大学 Landslide susceptibility evaluation method based on two-step strategy
CN113705607A (en) * 2021-07-22 2021-11-26 西安交通大学 Landslide susceptibility evaluation method based on two-step strategy
CN113780174A (en) * 2021-09-10 2021-12-10 福州大学 High vegetation platform storm and rain type landslide identification method combined with random forest algorithm
CN113780174B (en) * 2021-09-10 2023-09-15 福州大学 Storm-type landslide identification method for high vegetation platform combined with random forest algorithm
CN113987807A (en) * 2021-10-29 2022-01-28 重庆地质矿产研究院 Method for drawing landslide sensitivity graph based on GIS (geographic information System) multi-criterion decision analysis
CN114036841A (en) * 2021-11-10 2022-02-11 云南大学 Landslide incidence prediction method and system based on semi-supervised support vector machine model
CN115049053B (en) * 2022-06-20 2023-03-24 航天宏图信息技术股份有限公司 Loess region landslide susceptibility assessment method based on TabNet network
CN115049053A (en) * 2022-06-20 2022-09-13 航天宏图信息技术股份有限公司 Loess region landslide susceptibility assessment method based on TabNet network
CN115830804A (en) * 2022-10-24 2023-03-21 北京中地华安科技股份有限公司 Intelligent early warning negative sample sampling method for pipeline geological disasters under constraint of easily-issued subareas
CN115830804B (en) * 2022-10-24 2023-08-22 北京中地华安科技股份有限公司 Pipeline geological disaster intelligent early warning negative sample sampling method under constraint of easily-generated partition
CN116050120A (en) * 2023-01-06 2023-05-02 中国自然资源航空物探遥感中心 Landslide hidden danger activity remote sensing evaluation modeling method, system and storage medium
CN116050120B (en) * 2023-01-06 2023-09-01 中国自然资源航空物探遥感中心 Landslide hidden danger activity remote sensing evaluation modeling method, system and storage medium
CN116108758A (en) * 2023-04-10 2023-05-12 中南大学 Landslide susceptibility evaluation method

Similar Documents

Publication Publication Date Title
CN112966722A (en) Regional landslide susceptibility prediction method based on semi-supervised random forest model
CN109272721B (en) Landslide hazard forecasting method based on KPCA-FOA-LSSVM
CN110728411B (en) High-low altitude area combined rainfall prediction method based on convolutional neural network
CN112819207B (en) Geological disaster space prediction method, system and storage medium based on similarity measurement
CN113642849B (en) Geological disaster risk comprehensive evaluation method and device considering spatial distribution characteristics
CN115688404B (en) Rainfall landslide early warning method based on SVM-RF model
CN114036841A (en) Landslide incidence prediction method and system based on semi-supervised support vector machine model
CN111310898A (en) Landslide hazard susceptibility prediction method based on RNN
CN108764527B (en) Screening method for soil organic carbon library time-space dynamic prediction optimal environment variables
CN114201922A (en) Dynamic landslide sensitivity prediction method and system based on InSAR technology
CN116805439A (en) Drought prediction method and system based on artificial intelligence and atmospheric circulation mechanism
Huang et al. Uncertainties of landslide susceptibility prediction: Influences of different spatial resolutions, machine learning models and proportions of training and testing dataset
CN111144637A (en) Regional power grid geological disaster forecasting model construction method based on machine learning
CN113191642B (en) Regional landslide sensitivity analysis method based on optimal combination strategy
CN114219123A (en) Regional collapse probability prediction method based on frequency ratio-random forest model
Chen et al. The application of the genetic adaptive neural network in landslide disaster assessment
CN112233381A (en) Debris flow early warning method and system based on mechanism and machine learning coupling
CN116502748A (en) Landslide space risk early warning decision mechanism research method integrating multiple influencing factors
CN116258279A (en) Landslide vulnerability evaluation method and device based on comprehensive weighting
CN116402951A (en) Digital twinning-based side slope safety management method and system
JP3674707B1 (en) Disaster prevention business plan support system and method
CN114880954A (en) Landslide sensitivity evaluation method based on machine learning
CN114997666A (en) Method for evaluating easiness of region debris flow
Ye et al. Study on Dynamic Stability Prediction Model of Slope in Eastern Tibet Section of Sichuan-Tibet Highway
Pimiento Shallow Landslide Susceptibility Modelling and Validation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210615

RJ01 Rejection of invention patent application after publication