CN114373544A

CN114373544A - Method, system and device for predicting membranous nephropathy based on machine learning

Info

Publication number: CN114373544A
Application number: CN202111585500.2A
Authority: CN
Inventors: 王倩; 董哲毅; 苏仕斌; 陈香美
Original assignee: First Medical Center of PLA General Hospital
Current assignee: First Medical Center of PLA General Hospital
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-19

Abstract

The invention provides a method, a system and a device for predicting membranous nephropathy based on machine learning, which comprises data acquisition, data preprocessing, feature screening, prediction model construction, prediction model testing and evaluation, wherein data information of a patient to be retrieved is acquired and input into the prediction model, and the prediction model predicts membranous nephropathy according to the data information of the patient to be retrieved.

Description

Method, system and device for predicting membranous nephropathy based on machine learning

Technical Field

The invention relates to the technical field of biological detection, in particular to a method, a system and a device for predicting membranous nephropathy based on machine learning.

Background

Membranous Nephropathy (MN), also known as membranous glomerulonephritis (mesangial gloulonephritis), is pathologically characterized by diffuse immune complex deposition under the epithelial cells of the glomerular basement membrane with diffuse thickening of the basement membrane. Nephrotic Syndrome (NS) or asymptomatic proteinuria is the main manifestation in clinic. Membranous nephropathy can be primary or secondary to a variety of diseases, seen in infections (hepatitis b, c virus), systemic diseases (such as lupus erythematosus), drug therapies (such as gold, penicillamine, etc.), and malignancies. The disease has the characteristics of repeated course and chronic persistent disease.

In recent years, with the development of economy and environmental changes, the renal disease spectrum changes, the incidence rate of membranous nephropathy is increased year by year, and the membranous nephropathy accounts for up to 18.42 percent of primary glomerular diseases. Early accurate diagnosis is the basis of treatment, and renal biopsy pathology is still the gold standard of diagnosis of membranous nephropathy, but renal biopsy belongs to invasive operation, has certain medical risk, and the limitation of technical means and development conditions also determines the limitation of application.

Early treatment and recovery of patients would be helpful if predictions could be made in advance, but currently, there is less research on predicting membranous nephropathy.

It should be noted that the above background description is only for the sake of clarity and complete description of the technical solutions of the present invention and for the understanding of those skilled in the art. These technical solutions must not be considered known to the person skilled in the art merely because they have been elucidated in the technical background section of the present invention.

Disclosure of Invention

The invention aims to provide a method, a system and a device for predicting membranous nephropathy based on machine learning, which can develop and verify a prediction model of membranous nephropathy by using the machine learning method, can realize high-efficiency and high-accuracy prediction, and have important research significance and use value for early treatment and prevention of membranous nephropathy.

In order to achieve the purpose, the invention provides the following technical scheme:

the invention provides a method for predicting membranous nephropathy based on machine learning, which comprises the following steps:

step one, data acquisition, namely acquiring data information of a patient subjected to renal biopsy, wherein a detection structure comprises MN and non-MN, the data information of the patient meeting the inclusion condition and the exclusion standard is included according to the inclusion condition and the exclusion standard, and X characteristic indexes are extracted by an SQL method, wherein X is a positive integer;

step two, data preprocessing, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, filling the deletion value by adopting a random forest method, obtaining the patient data information of MN and non-MN, and obtaining Y characteristic indexes, wherein Y is a positive integer; y is less than or equal to X;

step three, characteristic screening, which comprises screening Z characteristic indexes by using a mutual information method screening method, wherein Z is a positive integer Z which is less than or equal to Y; performing dimensionality reduction on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes which are positive integers, wherein M is less than or equal to Z;

step four, constructing a prediction model, namely training and modeling by adopting 70% of patients, verifying by adopting 30% of patients in a verification set, obtaining the prediction model by adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method for the sample labels of the training set including MN and non-MN patients;

testing and evaluating a prediction model, namely testing and evaluating the prediction model for predicting the membranous nephropathy based on machine learning;

acquiring data information of a patient to be retrieved, inputting the data information of the patient to be retrieved into a prediction model, and predicting membranous nephropathy by the prediction model according to the data information of the patient to be retrieved.

Optionally, in the step one, data collection is performed, laboratory detection of the test population and laboratory detection indexes of the patient to be retrieved are collected, and the detection structure includes MN and non-MN.

Optionally, in the data acquisition in step one, the inclusion condition and exclusion criterion include: excluding data from patients with an age below 18 years and/or excluding data from pregnant women and/or excluding data from lactating women and/or excluding data from patients with malignant tumours and/or excluding medical emergencies and/or excluding infectious diseases and/or excluding SMN.

Optionally, in the second step, the data preprocessing further includes abnormal value processing, and when the data is an abnormal value, the corresponding patient data information is deleted.

Optionally, the outlier processing comprises:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained.

Optionally, outlier processing employs a 3sigma principle.

Alternatively,

step one, data acquisition, namely acquiring data information of a patient subjected to renal biopsy, wherein a detection structure comprises MN and non-MN, the data information of the patient meeting the inclusion condition and the exclusion standard is included according to the inclusion condition and the exclusion standard, and X characteristic indexes are extracted by an SQL method, wherein X is a positive integer; the number of patients after the data acquisition process is A, wherein A is a positive integer;

step two, data preprocessing, namely performing data preprocessing on the data information of A patients, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, filling deletion values by adopting a random forest method, obtaining patient data information of MN and non-MN, and obtaining Y characteristic indexes, wherein Y is a positive integer; y is less than or equal to X;

in the third step, the characteristic screening comprises the steps of screening Z characteristic indexes by using a mutual information method screening method, wherein Z is a positive integer and is not more than Y, and performing dimensionality reduction treatment on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes ALB, beta 2-MG, alpha-G, urine red blood cell, LAM, BUN and TP, wherein M is 7; three characteristic indexes of TC, 24-hour urine protein quantification and GRF are added for simplifying the construction of a prediction model, and Z is more than or equal to 7;

and fourthly, simplifying the construction of a prediction model, adopting 70% of patients to train and model, adopting 30% of patients to verify a verification set, adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logic regression method to obtain a simplified prediction model, wherein sample labels of the training set comprise MN and non-MN patients.

Alternatively,

step one, data acquisition, namely acquiring data information of a patient subjected to renal biopsy, wherein a detection structure comprises MN and non-MN, the data information of the patient meeting the inclusion condition and the exclusion standard is included according to the inclusion condition and the exclusion standard, and X characteristic indexes are extracted by an SQL method, wherein X is a positive integer; patient data information containing PLA2R detection is included, the number of patients after the data acquisition process is A ', and A' is a positive integer;

step two, data preprocessing, namely performing data preprocessing on the data information of A' patients, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, filling deletion values by adopting a random forest method, obtaining patient data information of MN and non-MN, and obtaining Y 'characteristic indexes, wherein Y' is a positive integer;

step three, characteristic screening, including screening out Z ' characteristic indexes by using a mutual information method, wherein Z ' is a positive integer, Z ' is less than or equal to Y ', and performing dimensionality reduction treatment on the Z ' characteristic index indexes by using a characteristic elimination method to obtain M ' characteristic indexes PLA2R, ALB and beta 2-MG, wherein M ' is 3; three characteristic indexes of TC, 24-hour urine protein quantification and GRF are increased and used for optimizing the construction of a prediction model, wherein Z' is more than or equal to 3;

and fourthly, constructing an optimized prediction model, adopting 70% of patients to train and model, adopting 30% of patients to verify a verification set, adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logic regression method to obtain the optimized prediction model, wherein sample labels of the training set comprise MN and non-MN patients.

Optionally, the AUC is used as an evaluation index in the fifth step, and the prediction model for predicting membranous nephropathy based on machine learning is tested and evaluated.

The present invention also provides a system for predicting membranous nephropathy based on machine learning, the system comprising:

the data acquisition module 1 is used for acquiring data information of a patient who has performed renal biopsy;

the data preprocessing module 2 is used for cleaning, deleting and filling patient data information, the data preprocessing module 2 comprises a missing value processing module 201, the missing value processing module 201 is used for deleting the patient data information with the missing rate more than 20% through preliminary screening, and filling missing values by adopting a random forest method to obtain the patient data information of MN and non-MN;

the characteristic screening module 3 is used for screening and sorting the characteristic indexes and screening Z characteristic indexes by using a mutual information method screening method, wherein Z is a positive integer Z which is less than or equal to Y; performing dimensionality reduction on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes which are positive integers, wherein M is less than or equal to Z;

the prediction model building module 4 is used for training and modeling 70% of patients and verifying a verification set by 30% of patients, wherein sample labels of the training set comprise MN and non-MN patients, and a prediction model is obtained by using a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method;

the system comprises an automatic prediction module 5, a data acquisition module 1, a data preprocessing module 2, a feature screening module 3, a prediction model building module 4 and the automatic prediction module 5 which are sequentially connected through electric signals, so that data information of a patient to be retrieved is obtained, and the obtained data information of the patient to be retrieved is input to the automatic prediction module 5 for prediction.

Optionally, the system further comprises a SHAP analysis visual output module 6, wherein the SHAP analysis visual output module 6 is in electrical signal connection with the automatic prediction module 5, and is used for obtaining a SHAP value of the feature in the automatic prediction module, and predicting the probability of the membranous nephropathy according to the obtained SHAP value of the feature in the patient data information to be retrieved.

Optionally, the prediction model building module 4 applies a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes, or a traditional logistic regression method

Optionally, the data acquisition module 1 is configured to acquire laboratory detection indexes of a test population and laboratory detection indexes of a patient to be retrieved, where the detection structure includes MN and non-MN.

Optionally, in the data acquisition module 1, the inclusion condition and the exclusion criterion include: excluding data from patients with an age below 18 years and/or excluding data from pregnant women and/or excluding data from lactating women and/or excluding data from patients with malignant tumours and/or excluding medical emergencies and/or excluding infectious diseases and/or excluding SMN.

Optionally, the data preprocessing module 2 includes an abnormal value processing module, and when the data is an abnormal value, the corresponding patient data information is deleted.

Alternatively,

the abnormal value processing module is as follows:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained.

Alternatively,

and the abnormal value processing module adopts a 3sigma principle.

The invention provides a machine learning apparatus for predicting membranous nephropathy based on machine learning, the apparatus comprising a processor and a memory, the memory being configured to store instructions, the processor being configured to execute the instructions to implement the machine learning method according to any one of the preceding claims.

The method and the system for predicting membranous nephropathy based on machine learning comprise the following steps: step one, data acquisition, namely acquiring data information of a patient subjected to renal biopsy, wherein a detection structure comprises MN and non-MN, the data information of the patient meeting the inclusion condition and the exclusion standard is included according to the inclusion condition and the exclusion standard, and X characteristic indexes are extracted by an SQL method, wherein X is a positive integer; step two, data preprocessing, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, filling deletion values by adopting a random forest method, obtaining patient data information of MN and non-MN, and obtaining Y characteristic indexes, wherein Y is a positive integer; y is less than or equal to X; step three, characteristic screening, which comprises screening Z characteristic indexes by using a mutual information method screening method, wherein Z is a positive integer Z which is less than or equal to Y; performing dimensionality reduction on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes which are positive integers, wherein M is less than or equal to Z; step four, constructing a prediction model, namely training and modeling by adopting 70% of patients, verifying by adopting 30% of patients in a verification set, obtaining the prediction model by adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method for the sample labels of the training set including MN and non-MN patients; testing and evaluating a prediction model, namely testing and evaluating the prediction model for predicting the membranous nephropathy based on machine learning; therefore, the data information of the patient to be retrieved is obtained firstly, the data information of the patient to be retrieved is input into the prediction model, and the prediction model predicts the membranous nephropathy according to the data information of the patient to be retrieved, so that the problems in the prior art can be solved, the renal disease spectrum changes along with economic development and environmental changes, the incidence rate of the membranous nephropathy is increased year by year, and the percentage of the membranous nephropathy in primary glomerular diseases is up to 18.42%. The early accurate diagnosis is the foundation of treatment, and the kidney biopsy pathology is still the gold standard of diagnosis of membranous nephropathy, but the kidney biopsy belongs to invasive operation, has certain medical risk, and the limitation of technical means and development conditions also determines the limitation of the application. The method and the system for predicting membranous nephropathy based on machine learning can predict membranous nephropathy in advance, are helpful for early treatment and recovery of patients, and have important research significance and use value.

In a preferred embodiment of the present invention, in the second step, the data preprocessing further includes an abnormal value processing, and when the data is an abnormal value, the corresponding patient data information is deleted. Abnormal values, namely unreasonable values in the data set, also called outliers, can be effectively removed, namely wrong numerical information, and the prediction accuracy of the model can be greatly improved.

In a preferred embodiment of the present invention, the method for processing the abnormal value, such as the simple identification method, includes:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained. It is known from experience that the BMI value should be positive, but if it is found that the BMI value is negative after simple recognition, it is known that there is an unreasonable value in the data set, which is an abnormal value, and it should be deleted. The age indicator is generally lower than 130, but when a simple identification is performed and the value of the age indicator is found to be 200, it can be known that an unreasonable value, which is an abnormal value, exists in the data set and should be deleted.

In a preferred embodiment of the invention, the outlier processing uses the 3sigma (3 δ) principle. 3 δ principle: when the data obeyed a normal distribution: from the definition of a normal distribution, the probability of being outside the 3 δ distance from the mean is P (| x- μ | > 3 δ) ≦ 0.003, which is a very small probability event, and we can assume by default that a sample that is 3 δ away from the mean is absent, and therefore, when a sample is more than 3 δ away from the mean, the sample is assumed to be an outlier. That is, if the continuous variable follows a normal distribution, the probability that the data are all within 3 δ is relatively high, and if the sample distance exceeds the value, the probability that the sample belongs to an abnormal value is relatively high.

In the preferred scheme of the invention, 70% of patients are adopted for training modeling, 30% of patients are adopted for verification of a verification set, a sample label of the training set comprises MN and non-MN patients, a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method is adopted to obtain a prediction model, and the catboost method shows the maximum AUC, recall rate and precision rate higher than other six methods. Thus, the variables are subjected to feature importance ranking (features importances) by the crawler, wherein the feature importance ranking: ALB, beta 2-MG, alpha-G, urine red blood cell, LAM, BUN, TC, 24-hour urine protein quantification, GRF, TP.

The RFE dimensionality reduction was to 3 indices, including the optimization model after PLA 2R: PLA2R, ALB and beta 2-MG, then the quantitative determination of GFR, TC and 24-hour urine protein is carried out according to experience, and finally 6 index structure modeling types are incorporated, seven methods of a support vector machine, a catboost, an XGBoost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method are adopted to obtain a prediction model to construct an MN optimization model, wherein 70% of patients are trained to model, 30% of patients are used for verification, the efficiencies of MN optimization models constructed by different methods are compared, the catboost shows the maximum AUC 0.951, the accuracy of the catboost reaches 0.888, the recall rate is 0.869, the F1 value is 0.832, the accuracy rate is 0.798 which is obviously higher than that of other six methods, and the feature ranking of the optimization model is as follows: PLA2R, ALB,. beta.2-MG, GFR, TC, 24-hour urinary protein quantification. The optimized model after incorporation of PLA2R predicted superior performance to the first catboost prediction model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without any creative effort.

FIG. 1 is a flow chart of a method of the present invention for predicting membranous nephropathy based on machine learning;

FIG. 2 is a flowchart and architecture diagram of a method for predicting membranous nephropathy based on machine learning according to the present invention;

FIG. 3 is a flowchart and an architecture diagram of a method for predicting membranous nephropathy based on machine learning according to an embodiment of the present invention; (ii) a

FIG. 4 is a flowchart and architectural diagram of a method for predicting membranous nephropathy based on machine learning in accordance with a detailed embodiment of the present invention;

FIG. 5 is a graph illustrating the contribution of each feature index to the output of the prediction model in the method for predicting membranous nephropathy based on machine learning according to the embodiment of the present invention;

FIG. 6 is a comparison of the area under the ROC curve (AUC) of the prediction model constructed by seven methods of the simplified model for predicting membranous nephropathy based on machine learning in the embodiment of the present invention;

FIG. 7 is a comparison of the area under the ROC curve (AUC) of the optimization model constructed by seven methods of the simplified model for predicting membranous nephropathy based on machine learning in the embodiment of the present invention;

FIG. 8 is SHAP value of the feature of MN single sample in the method for predicting membranous nephropathy based on machine learning in the embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a system for predicting membranous nephropathy based on machine learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for predicting membranous nephropathy based on machine learning, which comprises the following steps as shown in figures 1-9:

step two, data preprocessing, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, filling deletion values by adopting a random forest method, obtaining patient data information of MN and non-MN, and obtaining Y characteristic indexes, wherein Y is a positive integer; y is less than or equal to X;

acquiring data information of a patient to be retrieved, inputting the data information of the patient to be retrieved into a prediction model, and predicting membranous nephropathy by the prediction model according to the data information of the patient to be retrieved. Therefore, the specific embodiment of the invention firstly obtains the data information of the patient to be retrieved, inputs the data information of the patient to be retrieved into the prediction model, and the prediction model predicts the membranous nephropathy according to the data information of the patient to be retrieved, so that the problems in the prior art can be solved, the renal disease spectrum changes along with the development of economy and environmental changes, the incidence rate of the membranous nephropathy is increased year by year, and the percentage of the membranous nephropathy in the primary glomerular disease is up to 18.42%. Early accurate diagnosis is the basis of treatment, and renal biopsy pathology is still the gold standard of diagnosis of membranous nephropathy, but renal biopsy belongs to invasive operation, has certain medical risk, and the limitation of technical means and development conditions also determines the limitation of application. The method for predicting membranous nephropathy based on machine learning can predict membranous nephropathy in advance, is helpful for early treatment and recovery of patients, and has important research significance and use value.

It should be noted that the method for filling the missing value is not limited to the random forest method, and may be other filling methods, which are within the protection scope of the technical solution of the present invention.

Feature screening is a very important link in the data processing process. After data preprocessing and cleaning, the salient features are screened, and the non-salient features are abandoned, so that useless calculation amount in the model building process is reduced by deleting redundant features. The feature screening focuses on finding a small number of features which greatly improve the performance of the model, and the overall performance and stability of the model can be improved qualitatively by well performing the feature screening. Sometimes enabling simple models to be more effective than complex models.

In the embodiment of the invention, in the first step, data acquisition is carried out, laboratory detection indexes of a tested crowd and laboratory detection indexes of a patient to be retrieved are acquired, and the detection structure comprises MN and non-MN. The richer the laboratory test indexes of the population under test, the more accurate the accuracy of the prediction model obtained by applying the laboratory test indexes.

In an embodiment of the present invention, the data acquisition, the inclusion condition and the exclusion criterion in the first step include: excluding data from patients with an age below 18 years and/or excluding data from pregnant women and/or excluding data from lactating women and/or excluding data from patients with malignant tumours and/or excluding medical emergencies and/or excluding infectious diseases and/or excluding SMN. Collecting data information of a patient subjected to renal biopsy, detecting a structure comprising MN and non-MN, bringing the data information of the patient meeting the inclusion condition and the exclusion standard according to the inclusion condition and the exclusion standard, and extracting corresponding characteristic indexes by an SQL method:

if the age is below 18 years, its corresponding patient data will be culled,

if the pregnant woman is pregnant, the corresponding patient data is rejected,

if the woman is nursing, the corresponding patient data will be eliminated,

if the patient is a malignant tumor patient, the corresponding patient data is rejected,

if the patient is a medical emergency patient, the corresponding patient data is rejected,

if the disease is infectious, the corresponding patient data will be rejected,

if it is SMN, its corresponding patient data will be culled.

SMN refers to secondary membranous nephropathy.

If a patient is suffering from one or more of the above conditions, the corresponding patient data is rejected. Therefore, the condition that the index data are not accurate enough due to the special condition can be prevented, the interference of the special condition on the accuracy of the prediction model can be well avoided, and the accuracy is improved.

In an embodiment of the present invention, in the second step, the data preprocessing further includes an abnormal value processing, and when the data is an abnormal value, the corresponding patient data information is deleted. Abnormal values, namely unreasonable values in the data set, also called outliers, can be effectively removed, namely wrong numerical information, and the prediction accuracy of the model can be greatly improved.

In an embodiment of the present invention, the outlier processing comprises:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained.

It is known from experience that the BMI value should be positive, but if it is found that the BMI value is negative after simple recognition, it is known that there is an unreasonable value in the data set, which is an abnormal value, and it should be deleted. The age indicator is generally lower than 130, but when a simple identification is performed and the value of the age indicator is found to be 200, an unreasonable value, which is an abnormal value, in the data set can be known and should be deleted. If the value of the age index is found to be negative, it can be known that there is an unreasonable value in the data set, which is an abnormal value, and it should be deleted.

Outlier processing may also employ 3sigma principles. 3 δ principle: when the data obeyed a normal distribution: according to the definition of normal distribution, the probability of being 3 δ away from the average is P (| x- μ | > 3 δ) ≦ 0.003, which is a very small probability event, and we can determine by default that a sample with a distance of 3 δ above the average is absent, and therefore, when the sample is more than 3 δ away from the average, the sample is determined to be abnormal. That is, if the continuous variable obeys normal distribution, the probability that the data are all within 3 δ is relatively high, and if the sample distance exceeds the value, the probability that the sample belongs to an abnormal value is relatively high.

In normal distribution, "sigma principle", "2 sigma principle", and "3 sigma principle" are respectively:

sigma principle: the probability of the numerical distribution in (μ - σ, μ + σ) is 0.6826;

2sigma principle: the probability of the numerical distribution in (μ -2 σ, μ +2 σ) is 0.9544;

3sigma principle: the probability of the numerical distribution in (μ -3 σ, μ +3 σ) is 0.9974;

in the normal distribution, σ represents a standard deviation, and μ represents a mean value x — μ is a symmetry axis of the image.

Since "small probability event" and the basic idea of hypothesis testing "small probability event" generally refer to an event that occurs with a probability of less than 5%, it is considered that the event is almost impossible to occur in one trial.

It can be seen that if the probability that the value falls outside (μ -3 σ, μ +3 σ) is less than three thousandths, the corresponding event is considered to be not occurred in the practical problem, and basically, the interval (μ -3 σ, μ +3 σ) can be regarded as the actually possible value interval of the random variable X, which is called the "3 σ" principle of the normal distribution.

In an embodiment of the present invention, as shown in fig. 2-4, the method comprises the following steps:

and fourthly, simplifying the construction of a prediction model, adopting 70% of patients to train and model, adopting 30% of patients to verify a verification set, adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method to obtain the simplified prediction model, wherein sample labels of the training set comprise MN and non-MN patients.

Specifically, in the embodiment of the present invention, in the first step, data acquisition is performed, the number of patients is 10881, the detection structure includes MN and non-MN, the patient data information meeting the inclusion condition and the exclusion standard is included according to the inclusion condition and the exclusion standard, X characteristic indexes are extracted by an SQL method, and X is a positive integer; the number of patients after the data acquisition process is A, wherein A is a positive integer; regarding inclusion conditions and exclusion criteria: wherein:

exclusion age < 18 years C case

Excluding pregnant women in lactation period D

Patient with malignant tumor E case

Exclusion of medical emergencies F

Infectious diseases G were excluded, SMN H. C. D, E, F, G, H are all positive integers. In a specific embodiment of the present invention, regarding inclusion conditions and exclusion criteria: wherein:

632 cases with exclusion age < 18 years

Excluding 21 pregnant women in lactation period

577 cases of patients with malignant tumor

24 cases of infectious diseases were excluded, and 103 cases of SMN.

Patient data information meeting the inclusion condition and the exclusion criterion, and the number of patients 9524 are included according to the inclusion condition and the exclusion criterion;

139 clinical indexes (demographic data, laboratory and pathological indexes) are extracted by an SQL method, and pathological diagnosis is extracted according to the first page of a medical record.

Step two, data preprocessing, namely performing data preprocessing on the data information of A patients, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, filling deletion values by adopting a random forest method, obtaining patient data information of MN and non-MN, and obtaining Y characteristic indexes, wherein Y is a positive integer; y is less than or equal to X; in the embodiment of the present invention, index data and patient information with a deletion rate of more than 20% are deleted, and 8840 (2402 in MN group and 6438 in non-MN group) and 83 indexes are reserved for data padding to fill up missing values, so as to obtain Y characteristic indexes.

Thirdly, screening Z characteristic indexes by using a mutual information method, wherein Z is a positive integer and is not more than Y, and performing dimensionality reduction treatment on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes ALB, beta 2-MG, alpha-G, urine red blood cells, LAM, BUN and TP, wherein M is 7; three characteristic indexes of TC, 24-hour urine protein quantification and GRF are added, 10 indexes are used for simplifying the construction of a prediction model, and Z is more than or equal to 7;

Regarding optimization model construction:

the method comprises the following steps:

step one, data acquisition, wherein the data information of patients who have undergone renal biopsy is acquired, the number of the patients is 10881, a detection structure comprises MN and non-MN, the data information of the patients meeting the inclusion condition and the exclusion standard is included according to the inclusion condition and the exclusion standard, X characteristic indexes are extracted by an SQL method, and X is a positive integer; the patient data information containing PLA2R detection is included, the number of patients after the data acquisition process is A ', and A' is a positive integer; inclusion conditions and exclusion criteria were:

exclusion age < 18 years C case

Excluding pregnant women in lactation period D

Patient with malignant tumor E case

Exclusion of medical emergencies F

632 cases with exclusion age < 18 years

Excluding 21 pregnant women in lactation period

577 cases of patients with malignant tumor

24 cases of infectious diseases were excluded, and 103 cases of SMN.

extracting 139 clinical indexes (demographic data, laboratory and pathological indexes) by SQL method and extracting pathological diagnosis according to the first page of the medical record;

patient data information including PLA2R test, patient number 2527;

step two, data preprocessing, namely performing data preprocessing on the data information of A' patients, wherein the data preprocessing comprises missing value processing, and the missing value processing comprises the following steps: through preliminary screening, patient data information with deletion rate more than 20% is deleted, 2457 cases (785 cases of MN group and 1672 cases of non-MN group) are reserved, random forest method is adopted to fill up missing values, patient data information of MN and non-MN is obtained, 93 indexes are used for data filling up and filling up missing values, Y 'characteristic indexes are obtained, and Y' is a positive integer.

Step three, characteristic screening, which comprises the steps of screening Z ' characteristic indexes by using a mutual information method, wherein Z ' is a positive integer, Z ' is less than or equal to Y ', and performing dimensionality reduction treatment on the Z ' characteristic index indexes by using a characteristic elimination method to obtain M ' characteristic indexes PLA2R, ALB and beta 2-MG, wherein M ' is 3; three characteristic indexes of TC, 24-hour urine protein quantification and GRF are increased and used for optimizing the construction of a prediction model, wherein Z' is more than or equal to 3; specifically, in the specific embodiment of the present invention, 47 characteristic indexes are screened out by a mutual information method screening method, and dimension reduction processing is performed by a characteristic elimination method to obtain 6 characteristic indexes (PLA2R, ALB, β 2-MG) and indexes obtained empirically, TC, 24-hour urinary protein quantification, GRF.

In an embodiment of the present invention, the partial characteristic indexes include: sex, age, body mass index, blood pressure, history of smoking and drinking, history of hypertension, history of coronary heart disease, history of diabetes, absence of carbohydrate network, gout, hyperlipidemia, history of stroke, absence of fatty liver, history of hepatitis B, hepatitis C, etc., red blood cell count (RBC), hemoglobin (Hb), hematocrit (Hct), mean red blood cell volume (MCV), white blood cell count (WBC), neutrophils, lymphocytes, monocytes, eosinophils, basophils, Platelets (PLT), total serum protein (TP), serum Albumin (ALB), alanine Aminotransferase (ALT), aspartate Aminotransferase (AST), gamma-glutamine transferase, alkaline phosphatase, total bilirubin, direct bilirubin, GLUCOSE, creatinine, Blood Urea Nitrogen (BUN), Uric Acid (UA), GFR, Total (TC), Total Cholesterol (TC), cholesterol, and cholesterol, Triglyceride (TG), High Density Lipoprotein (HDL), Low Density Lipoprotein (LDL), potassium ion, sodium ion, calcium ion, phosphorus ion, C-reactive protein, complement C3 assay, complement C4 assay, IgA assay, IgE assay, IgG assay, IgM assay, Ig light chain KAP assay, Ig light chain LAM assay, blood β 2-microglobulin assay, prealbumin assay, β 1 globulin, β 2 globulin, α 1 globulin, α 2 globulin, γ globulin, anti-double-stranded DNA antibody (A-dsDNA), anti-Sm antibody, hepatitis B surface antigen (luminometry), thrombin time assay, plasma activated fraction thrombin time assay, plasma prothrombin activity assay, international normalized ratio, plasma fibrinogen assay, plasma D-dimer assay, plasma antithrombin III assay, plasma thrombin, thrombin inhibitor, and/or a, Ferritin, urine red blood cell, urine protein qualitative test, urine specific gravity, uric acid alkalinity measurement, urine protein quantification, urine alpha-1 microglobulin, urine beta-2 microglobulin, urine IgG, urine Ig light chain KAP, urine Ig light chain LAM, anti-phospholipase A2 receptor antibody (PLA2R), and the like. But is not limited thereto.

In the embodiment of the invention, in the fifth step, AUC is used as an evaluation index, and the prediction model for predicting membranous nephropathy based on machine learning is tested and evaluated.

The invention also provides a system for predicting membranous nephropathy based on machine learning, as shown in fig. 9, the system comprises the following modules:

the data preprocessing module 2 is used for cleaning, deleting and filling the patient data information, the data preprocessing module 2 comprises a missing value processing module 201, the missing value processing module 201 is used for deleting the patient data information with the missing rate larger than 20% through preliminary screening, and filling the missing values by adopting a random forest method to obtain the patient data information of MN and non-MN;

the characteristic screening module 3 is used for screening and sorting the characteristic indexes and screening Z characteristic indexes by using a mutual information method screening method, wherein Z is a positive integer Z which is less than or equal to Y; performing dimensionality reduction on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes which are positive integers, wherein M is less than or equal to Z; z is a

positive integer

1, 2, 3, 4.

the automatic prediction module 5, the data acquisition module 1, the data preprocessing module 2, the feature screening module 3, the prediction model building module 4 and the automatic prediction module 5 are sequentially connected through electric signals to obtain the data information of the patient to be retrieved, and the obtained data information of the patient to be retrieved is input to the automatic prediction module 5 for prediction. Acquiring data information of a patient to be retrieved, inputting the data information of the patient to be retrieved into a prediction model, and predicting membranous nephropathy by the prediction model according to the data information of the patient to be retrieved. Therefore, the specific embodiment of the invention firstly obtains the data information of the patient to be retrieved, inputs the data information of the patient to be retrieved into the prediction model, and the prediction model predicts the membranous nephropathy according to the data information of the patient to be retrieved, so that the problems in the prior art can be solved, the incidence rate of the membranous nephropathy is increased year by year along with the development of economy and the change of environment, and the occupation ratio of the membranous nephropathy in the primary glomerular disease is up to 18.42%. The early accurate diagnosis is the basis of treatment, the kidney biopsy pathology is still the gold standard of the diagnosis of membranous nephropathy, but the kidney biopsy belongs to invasive operation, has certain medical risks, and the limitation of technical means and development conditions also determines the limitation of the application of the kidney biopsy. The method for predicting membranous nephropathy based on machine learning can predict membranous nephropathy in advance, is helpful for early treatment and recovery of patients, and has important research significance and use value.

It should be noted that the system for predicting membranous nephropathy based on machine learning includes a data acquisition module 1, a data preprocessing module 2, a feature screening module 3, a prediction model construction module 4, and an automatic prediction module 5, and the data acquisition module 1, the data preprocessing module 2, the feature screening module 3, the prediction model construction module 4, and the automatic prediction module 5 are not only included in the above modules, which is only a specific embodiment of the technical solution of the present invention, and further expanding and adding some modules are within the protection scope of the present application.

It should be noted that the predictive model building module 4 performs training modeling with 70% of patients and performs validation with 30% of patients in the validation set, which is a specific embodiment of the technical solution of the present invention, and is not limited to performing training modeling with 70% of patients and performing validation with 30% of patients in the validation set, but may also be other proportions.

In an embodiment of the present invention, the system for predicting membranous nephropathy based on machine learning further includes a SHAP analysis visual output module 6, wherein the SHAP analysis visual output module 6 is in electrical signal connection with the automatic prediction module 5, and is configured to obtain a SHAP value of a feature in the automatic prediction module, and predict the probability of membranous nephropathy according to the acquired SHAP value of the feature in the patient data information to be retrieved. And obtaining the SHAP value (SHAPLey) of the feature in the prediction model, wherein the SHAP value method means that the obtained contribution is equal to the contribution, and is a formula. The method is widely used for reasonable benefit distribution in economic activities and the like. Originally proposed by professor reuptapril (Lloyd sharley) at university of california, los angeles, usa. The proposal of the SHAP value method brings great influence to the theoretical important breakthrough of the cooperative game and the later development thereof. In the embodiment of the invention, SHAP values are used for describing the contribution and influence of the characteristics. As shown in fig. 5, the probability of MN is predicted according to the SHAP value of the feature in the data to be predicted.

Negative values on the left side of the abscissa indicate that the probability of being predicted as MN is low, positive values on the right side indicate that the probability of being predicted as MN is high, the ordinate indicates that each feature or variable is, dark colors (red, R) indicate that the variable value is high, light colors (blue, B) indicate that the variable value is low, and the descending order from top to bottom is the contribution of each feature. It can be seen from the figure that the ALB in the prediction model contributes most to the model, and the lower the ALB value is, the higher the risk of MN is; the lower the beta 2-MG value, the higher the risk of MN; the higher the α -G value, the higher the risk of MN; the lower the urinary red blood cell value, the higher the risk of MN; the lower the LAM value, the higher the risk of MN; the lower the BUN value, the higher the risk of MN; the higher the TC value, the higher the risk of MN; the higher the 24-hour urine protein quantitative value, the higher the risk of MN; the higher the GRF value, the higher the risk of MN; the lower the TP value, the higher the risk of MN.

The contribution of ALB and beta 2-MG in the two models is in the first three, wherein the contribution of PLA2R to the model is the largest in the optimized model, and the probability that a patient with a higher PLA2R value is diagnosed as MN is higher; the second is that the lower the ALB, the increased probability that the patient is judged to be MN, i.e. diagnosed as MN.

It should be noted that embodiments of the present invention are predictive, as opposed to diagnostic, for improving reference to clinical and testing strategies. For example, when the probability of MN is predicted to be greater than the threshold, the patient may be advised to take a renal biopsy for decision.

In the embodiment of the present invention, the prediction model construction module 4 applies a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes, or a traditional logistic regression method. The CatBOost is composed of Categorical and Boosting, is a GBDT (Gradient descent tree) framework which is based on a symmetric decision tree (objective trees) algorithm and has few parameters, supports category type variables and has high accuracy, mainly solves the problem that pain points are high-efficiency and reasonable to process category type characteristics, and solves the problems of Gradient deviation (Gradient Bias) and Prediction shift (Prediction shift), so that the occurrence of overfitting is reduced, and the accuracy and generalization capability of the algorithm are improved.

The adaboost (adaptive boosting) algorithm is a boosting method, and combines a plurality of weak classifiers into a strong classifier. The self-adaptation of the method is as follows: the weight of the sample with the error (the weight corresponding to the sample) of the previous weak classifier is strengthened, and the sample with the updated weight is used for training the next new weak classifier again. In each round of training, a new weak classifier is trained with the population (sample population), a new sample weight value and the speaking weight of the weak classifier are generated, and iteration is carried out until a preset error rate is reached or a specified maximum iteration number is reached.

Extreme gradient enhancement (XGBoost) is an upgraded version of Random Forest in which models (decision trees in our experiments) are built in order to minimize errors and maximize the impact of the best model. To minimize the error, a gradient descent algorithm is applied. In addition, over-fit avoidance mechanisms are also applied, such as tree pruning and regularization.

The Support Vector Machine (SVM) algorithm is a binary classifier that maps input data into a very high dimensional feature space through a non-linear transformation (also known as kernel-technique) and applies a linear decision surface in the feature space to distinguish between the two classes.

The Gaussian Naive Bayes method Gaussian Naive Bayes applies Bayes theorem, and when a class value is given, each pair of features has conditional independence.

The neural network model represents a (significant) enhancement of the logistic regression method. Similar to logistic regression, it linearly combines features and applies a (non-linear) transformation on the results. The upgrade is then performed by stacking several such transformations into layers, so as to obtain several hidden layers (in addition to the input (i.e. the element layer) and the output (i.e. the class layer)), each representing a different level of abstraction. In addition, several mechanisms may be applied to improve performance and prevent overfitting, such as regularization, learning by dropping, batch training, and the like.

In the embodiment of the invention, the data acquisition module 1 is used for acquiring laboratory detection indexes of a tested population and a patient to be retrieved, and the detection structure comprises MN and non-MN.

In an embodiment of the present invention, in the data acquisition module 1, the inclusion condition and the exclusion criterion include: excluding data from patients with an age below 18 years and/or excluding data from pregnant women and/or excluding data from lactating women and/or excluding data from patients with malignant tumours and/or excluding medical emergencies and/or excluding infectious diseases and/or excluding SMN.

According to the inclusion condition and the exclusion standard, the patient data information meeting the inclusion condition and the exclusion standard is included, and the corresponding characteristic indexes are extracted by an SQL method:

if the age is below 18 years, its corresponding patient data will be culled,

if the pregnant woman is pregnant, the corresponding patient data is rejected,

if the woman is nursing, the corresponding patient data will be eliminated,

if the disease is infectious, the corresponding patient data will be rejected,

if it is SMN, its corresponding patient data will be culled.

If a patient is suffering from one or more of the above conditions, the corresponding patient data is rejected. Therefore, the method can prevent the index data from being inaccurate due to the special condition, thereby well avoiding the interference of the special condition on the accuracy of the prediction model and improving the accuracy

In a specific embodiment of the present invention, the inclusion conditions and exclusion criteria are:

exclusion age < 18 years C case

Excluding pregnant women in lactation period D

Patient with malignant tumor E case

Exclusion of medical emergencies F

632 cases with exclusion age < 18 years

Excluding 21 pregnant women in lactation period

577 cases of patients with malignant tumor

24 cases of infectious diseases were excluded, and 103 cases of SMN.

patient data information including PLA2R test, patient number 2527;

the data preprocessing module is used for preprocessing data of the data information of A' patients, and the data preprocessing comprises missing value processing, wherein the missing value processing comprises the following steps: through preliminary screening, deleting patient data information with deletion rate more than 20%, keeping 2457 cases (785 cases of MN groups and 1672 cases of non-MN groups), filling the deletion values by adopting a random forest method, obtaining the patient data information of MN and non-MN, filling the deletion values by using 93 indexes, and obtaining Y 'characteristic indexes, wherein Y' is a positive integer.

The characteristic screening module screens Z ' characteristic indexes by a mutual information method, wherein Z ' is a positive integer and Z ' is not more than Y ', and dimension reduction processing is carried out on the Z ' characteristic index indexes by a characteristic elimination method to obtain M ' characteristic indexes PLA2R, ALB and beta 2-MG, wherein M ' is 3; three characteristic indexes of TC, 24-hour urine protein quantification and GRF are increased and are used for optimizing the construction of a prediction model, and Z' is more than or equal to 3; specifically, in the embodiment of the present invention, 47 characteristic indexes are screened out by a mutual information method screening method, and dimension reduction processing is performed by a characteristic elimination method to obtain 6 characteristic indexes (PLA2R, ALB, beta 2-MG) and indexes TC obtained empirically, 24-hour urine protein quantification and GRF.

And (3) a prediction model, wherein 70% of patients are used for training and modeling, 30% of patients are used for verification of a verification set, sample labels of the training set comprise MN and non-MN patients, and an optimized prediction model is obtained by adopting a support vector machine, a catboost, an XGBoost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method.

In an embodiment of the present invention, the data preprocessing module 2 includes an abnormal value processing module, and deletes the corresponding patient data information when the data is an abnormal value. Abnormal values, namely unreasonable values in the data set, also called outliers, can be effectively removed, namely wrong numerical information, and the prediction accuracy of the model can be greatly improved.

In an embodiment of the present invention, the abnormal value processing module is:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained.

It is known from experience that the BMI value should be positive, but if it is found that the BMI value is negative after simple recognition, it is known that there is an unreasonable value in the data set, which is an abnormal value, and it should be deleted. The age index is generally lower than 130, but when a simple identification is performed, if the value of the age index is found to be 200, an unreasonable value, which is an abnormal value, in the data set is known to be deleted, and if the value of the age index is found to be a negative number, an unreasonable value, which is an abnormal value, in the data set is also known to be deleted.

In an embodiment of the invention, the outlier processing module may also employ a 3sigma principle.

3 δ principle: when the data obeyed a normal distribution: from the definition of a normal distribution, the probability of being outside the 3 δ distance from the mean is P (| x- μ | > 3 δ) ≦ 0.003, which is a very small probability event, and we can assume by default that a sample that is 3 δ away from the mean is absent, and therefore, when a sample is more than 3 δ away from the mean, the sample is assumed to be an outlier. That is, if the continuous variable obeys normal distribution, the probability that the data are all within 3 δ is relatively high, and if the sample distance exceeds the value, the probability that the sample belongs to an abnormal value is relatively high.

The embodiment of the invention adopts AUC as the model evaluation index. AUC is an evaluation index for measuring the quality of the two-classification model and represents the probability that a predicted positive example is ranked before a predicted negative example. The confusion matrix is commonly used for measuring the prediction effect of the classification model, the accuracy, the AUC, the recall rate, the F1 score, the accuracy and the like in the confusion matrix are common model evaluation indexes in machine learning, and the structure of the confusion matrix is shown in the following table 1:

TABLE 1

The Accuracy (Accuracy) means the proportion of all correctly classified samples to the number of all samples, and is the proportion of correct samples. Precision (Precision) means the ratio of correctly classified positive samples to the number of predicted positive samples, and is applicable to the case where the accuracy of positive sample prediction is important. Recall (Recall) is calculated as the proportion of correctly classified positive samples to the number of all (true) positive samples. The F1 score (F1-score) is calculated as a harmonic mean value of the precision rate and the recall rate, and when the using significance of the precision rate and the recall rate cannot be distinguished, the F1 score can be used for replacing the harmonic mean value, and the value of the F1 score is close to the smaller value of the precision rate and the recall rate. The Area under the characteristic curve (AUC) of a patient is the calculated Area under the ROC curve, and is a performance index for measuring the performance of a learner. The ROC curve is used for describing the relation between the true rate and the false positive rate of the classification model, and the prediction effect of the model with the AUC value range of 0.5-1 is better. The evaluation comparison results of the seven machine learning models in the embodiment of the present invention are shown in the following table 2.

TABLE 2 Performance comparison of seven methods to construct a prediction model

As can be seen from table 2, the accuracy values of the seven methods, namely, the castboost, the XgBoost, the AdaBoost, the ANN, the Naive Bayes, the SVM, and the Logistic Regression, are respectively: 0.845, 0.832, 0.801, 0.772, 0.798, 0.767, 0.816, AUC: 0.916; 95% CI: [ 0.9040.928 ], 0.899; 95% CI: [ 0.8860.913 ], 0.879; 95% CI: [ 0.8650.893 ], 0.894; 95% CI: [ 0.8810.908 ], 0.853; 95% CI: [ 0.8380.869 ], 0.822; 95% CI: [ 0.8040.839 ], 0.86; 95% CI: [0.8460.875]. Among them, the accuracy of the catboost method reaches 0.845, which is the highest. The catboost method shows the maximum AUC 0.916, the recall rate 0.869, the F1 value 0.757, and the precision rate 0.671 higher than the other six methods.

Therefore, in the embodiment of the invention, the variables are ranked by the importance of features (features imports), the importance of a feature represents the weight of the feature in the overall model, and the higher the importance is, the more important the feature plays a role in predicting the MN. Wherein the characteristic importance ranking is: ALB, beta 2-MG, alpha-G, urine red blood cell, LAM, BUN, TC, 24-hour urine protein quantification, GRF, TP.

As shown in fig. 2-4, if 2527 patients who contain PLA2R for detection are included, the deletion rate > 20% of the indexes and patients are deleted, 2457 (785 in MN group, 1672 in non-MN group) and 92 indexes are reserved for data filling, 47 indexes are screened by MI, and RFE is dimensionality reduced to 3 indexes: PLA2R, ALB and beta 2-MG, and then GFR, TC and 24-hour urinary protein quantification are included according to experience, and finally 6 indexes are included to construct a model. The MN optimization model was constructed using seven methods, of which 1719 (70%) patients were trained and modeled, 738 (30%) patients were used for verification, and the evaluation comparison results of the seven machine learning optimization model are shown in fig. 7. The performance comparison of the seven methods for constructing the optimization model is shown in table 3.

TABLE 3 Performance comparison of seven methods to construct an optimization model

Feature ranking of the optimization model: PLA2R, ALB,. beta.2-MG, GFR, TC, 24-hour urinary protein quantification. Comparing the performances of MN optimization models constructed by different methods, the catboost shows the maximum AUC of 0.951, and the accuracy of the catboost reaches 0.888, the recall rate of 0.869, the F1 value of 0.832 and the accuracy rate of 0.798 which are obviously higher than those of other six methods. The optimized model after incorporation of PLA2R predicted superior performance to the first catboost prediction model.

The SHAPLey (SHAP) abstract diagram is used for describing the height of the feature value in the model of the embodiment of the invention and the contribution degree of each feature to the model, and the SHAP value is used for reflecting the contribution degree of each feature in the cableost model in the embodiment of the invention. The study used the SHAP values of the full and single samples, and the SHAP values for the single sample features are shown in FIG. 8, with the SHAP values of the full and single samples presenting the risk of diagnosis of MN for all patients and for single patients.

The baseline is base value; the dominant contributing features of the dark color (red, R) increase the probability that the sample is predicted to be a positive sample, i.e. the probability of being predicted to MN is high, and the dominant contributing features of the light color (blue, B) decrease the probability of being predicted to MN, with length representing the influence. As is evident from the figure, the main contribution of the dark color of this patient is PLA2R, the main contribution of the light color is eGFR, the dark color is much larger than the light color, and the probability that the individual is predicted to be MN is higher. The larger the shape value of a feature, the more likely it is that MN occurs, and the length represents the influence of the feature on the disease. Through the input of parameters (SHAP visualization), disease diagnosis is realized. From the figure, the contribution degrees of ALB and beta 2-MG are in the first three positions in the two models, wherein the contribution degree of PLA2R to the models is the largest in the optimized models, and the probability that the patient with higher PLA2R value is diagnosed as MN is higher; the second is that the lower the ALB the increased probability that the patient is judged as a positive sample, i.e., the increased probability of being diagnosed as MN. As shown in fig. 8, the main contribution features with dark colors increase the probability that the sample is predicted to be a positive sample, i.e. the probability that the sample is predicted to be MN is high, and the main contribution features with light colors decrease the probability that the sample is predicted to be MN.

The invention also provides a machine learning device for predicting membranous nephropathy based on machine learning, which comprises a memory, a processor, a prediction model and/or an optimized prediction model; the memory is used for storing instructions, the prediction model and the optimized prediction model are obtained by the membranous nephropathy prediction method, and the instructions are used for predicting data to be predicted through the prediction model or the simplified prediction model to obtain a label value; the calculator is used for executing the instructions. But is not limited thereto, and may further include an inputter for obtaining data to be predicted and an outputter for outputting the tag value. Specifically, the prediction device may be a computer.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for predicting membranous nephropathy based on machine learning, comprising:

2. The method for predicting membranous nephropathy based on machine learning of claim 1, wherein the data collection in step one collects laboratory tests of human subjects and laboratory test indexes of patients to be retrieved, and the test structures include MN and non-MN.

3. The machine learning-based method of predicting membranous nephropathy, according to claim 1, wherein in step one data acquisition, said inclusion and exclusion criteria comprise: excluding data from patients with an age below 18 years and/or excluding data from pregnant women and/or excluding data from lactating women and/or excluding data from patients with malignant tumours and/or excluding medical emergencies and/or excluding infectious diseases and/or excluding SMN.

4. The method for predicting membranous nephropathy based on machine learning of claim 1, wherein in step two, said data preprocessing further comprises abnormal value processing, and when said data is abnormal value, deleting the corresponding said patient data information.

5. The method for predicting membranous nephropathy based on machine learning of claim 4, wherein the outlier processing comprises:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained.

6. The method for predicting membranous nephropathy based on machine learning of claim 4, wherein outlier processing employs 3sigma principle.

7. The method for predicting membranous nephropathy based on machine learning of claim 1,

and fourthly, simplifying the construction of a prediction model, adopting 70% of patients to train and model, adopting 30% of patients to verify a verification set, adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method to obtain a simplified prediction model, wherein sample labels of the training set comprise MN and non-MN patients.

8. The method for predicting membranous nephropathy based on machine learning of claim 1,

and fourthly, constructing an optimized prediction model, adopting 70% of patients to train and model, adopting 30% of patients to verify a verification set, adopting a support vector machine, a catboost, an XGboost, an AdaBoost, an artificial neural network, a Naive Bayes or a traditional logistic regression method to obtain the optimized prediction model, wherein sample labels of the training set comprise MN and non-MN patients.

9. The method for predicting membranous nephropathy based on machine learning of claim 1, wherein AUC is used as evaluation index in step five, and the prediction model for predicting membranous nephropathy based on machine learning is tested and evaluated.

10. A system for predicting membranous nephropathy based on machine learning, the system comprising:

the data preprocessing module 2 is used for cleaning, deleting and filling patient data information, the data preprocessing module 2 comprises a missing value processing module 201, the missing value processing module 201 is used for deleting the patient data information with the missing rate larger than 20% through preliminary screening, and filling missing values by adopting a random forest method to obtain the patient data information of MN and non-MN;

the characteristic screening module 3 is used for screening and sorting the characteristic indexes, and screening Z characteristic indexes by using a mutual information method screening method, wherein Z is a positive integer Z not more than Y; performing dimensionality reduction on the Z characteristic index indexes by using a characteristic elimination method to obtain M characteristic indexes which are positive integers, wherein M is less than or equal to Z;

11. The system for predicting membranous nephropathy based on machine learning of claim 10, further comprising a SHAP analysis visualization output module 6, wherein said SHAP analysis visualization output module 6 is in electrical signal connection with said automatic prediction module 5 for obtaining SHAP values of features in the automatic prediction module, and predicting the probability of membranous nephropathy based on said obtained SHAP values of features in the patient data information to be retrieved.

12. The system for predicting membranous nephropathy based on machine learning of claim 11, wherein said prediction model construction module 4 applies support vector machine, catboost, XGboost, AdaBoost, artificial neural network, Naive Bayes or traditional logistic regression methods.

13. The system for predicting membranous nephropathy based on machine learning of claim 10, wherein said data collection module 1 is configured to collect laboratory tests of a human subject population and laboratory test indicators of a patient to be retrieved, and the test structures include MN and non-MN.

14. The system for machine learning-based prediction of membranous nephropathy according to claim 10, wherein said data collection module 1, inclusion conditions and exclusion criteria include: excluding data from patients with an age below 18 years and/or excluding data from pregnant women and/or excluding data from lactating women and/or excluding data from patients with malignant tumours and/or excluding medical emergencies and/or excluding infectious diseases and/or excluding SMN.

15. The system for predicting membranous nephropathy based on machine learning of claim 10, wherein the data preprocessing module 2 comprises an abnormal value processing module for deleting the corresponding patient data information when the data is an abnormal value.

16. The system for machine learning-based prediction of membranous nephropathy according to claim 15,

the abnormal value processing module is as follows:

judging whether the BMI value is negative or not,

if yes, the data is an abnormal value;

if not, the data is retained.

17. The system for machine learning-based prediction of membranous nephropathy according to claim 15, wherein outlier processing module employs 3sigma principle.

18. A machine learning apparatus for predicting membranous nephropathy based on machine learning, the apparatus comprising a processor and a memory, the memory for storing instructions, the processor for executing the instructions to implement the machine learning method of any one of claims 1 to 9.