CN116738297A

CN116738297A - Diabetes typing method and system based on depth self-coding

Info

Publication number: CN116738297A
Application number: CN202311022792.8A
Authority: CN
Inventors: 王伟好; 肖佩; 潘琦; 陈子豪; 李影
Original assignee: Beijing Qs Medical Technology Co ltd
Current assignee: Beijing Qs Medical Technology Co ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-09-12
Anticipated expiration: 2043-08-15
Also published as: CN116738297B

Abstract

The invention provides a diabetes typing method and system based on depth self-coding. The diabetes typing method comprises the following steps: extracting clinical data samples from a diabetes clinical database as training data and verification data; constructing a diabetes parting model based on depth self-coding, and training the diabetes parting model by utilizing the training model to obtain a trained diabetes parting model; wherein, a Kmes clustering module is embedded in a partial depth self-encoder of the diabetes parting model; and verifying the trained diabetes parting model by using verification data, and determining whether the diabetes parting model needs to be adjusted or not based on a verification result to obtain a final diabetes parting model. The diabetes typing system includes a module corresponding to the diabetes typing method.

Description

Diabetes typing method and system based on depth self-coding

Technical Field

The invention provides a diabetes typing method and system based on depth self-coding, and belongs to the technical field of deep learning model establishment.

Background

The current general method for classifying diabetes mellitus classifies diabetes mellitus into type 1 diabetes mellitus and type 2 diabetes mellitus, of which 90% belong to type 2 diabetes mellitus. And the type 2 diabetes can show different manifestations in different individuals in the aspects of etiology, clinical manifestations, prognosis and the like, has higher heterogeneity and different clinical fatalities. Therefore, the current diabetes typing method cannot meet the clinical work requirements, and cannot be used for individual and accurate treatment of diabetics. In this context, there is a need for a disease typing model designed for the diabetic population.

The traditional machine learning clustering method is difficult to accurately evaluate the similarity between samples, and is difficult to effectively cluster high-dimensional data with sparse data distribution and unclear cluster structures; meanwhile, if the neural network is only used as a feature extractor, it does not explicitly incorporate the clustering promotion target in the learning process, so that the learned deep neural network does not necessarily output dimension reduction data suitable for clustering.

Disclosure of Invention

The invention provides a diabetes typing method and a diabetes typing system based on depth self-coding, which are used for solving the problem that the existing diabetes typing model cannot effectively cluster high-dimension data with sparse data distribution and unclear cluster structure, and adopts the following technical scheme:

a depth self-encoding based diabetes typing method, the diabetes typing method comprising:

extracting clinical data samples from a diabetes clinical database as training data and verification data;

constructing a diabetes parting model based on depth self-coding, and training the diabetes parting model by using a training model to obtain a trained diabetes parting model; wherein, a Kmes clustering module is embedded in a partial depth self-encoder of the diabetes parting model;

And verifying the trained diabetes parting model by using verification data, and determining whether the diabetes parting model needs to be adjusted or not based on a verification result to obtain a final diabetes parting model.

Further, extracting clinical data samples from the diabetes clinical database as training data and validation data, comprising:

extracting clinical data samples from the diabetes clinical database;

performing data preprocessing on the clinical data sample to obtain a preprocessed clinical data sample;

dividing the preprocessed clinical data samples according to the data proportion of preset training data and verification data, and obtaining the training data and the verification data corresponding to the data proportion.

Further, performing data preprocessing on the clinical data sample to obtain a preprocessed clinical data sample, including:

removing null values from the clinical data samples to obtain clinical sample data without null values;

removing the null-free clinical sample dataNObtaining clinical sample data without abnormal values by abnormal values outside the standard deviation;

and carrying out continuous variable normalization and classified variable coding treatment on the clinical sample data without abnormal values to obtain a preprocessed clinical data sample.

Further, the clinical sample data without null values is removedNOutlier values outside of the individual standard deviations, obtaining outlier-free clinical sample data, comprising:

carrying out average value calculation and standard deviation calculation on the clinical sample data to obtain an average value and a standard deviation corresponding to the clinical sample data;

determining a threshold coefficient of an outlier using the mean and standard deviation corresponding to the clinical sample dataNAnd pass through the threshold coefficientNDetermining a range of outliers, wherein the threshold coefficientNAnd the range of outliers is obtained by the following formula:

wherein ,Nrepresenting a threshold coefficient;X _p mean values representing clinical sample data;X _c standard deviation representing clinical sample data;Pthe percentile point is indicated as being the percentile point,Pthe value range of (2) is 0.71-0.74;λrepresenting adjustment coefficients whenX _c -（1+P ² ）X _p >At the time of 0, the temperature of the liquid,λ=-（1-P) When (when)X _c -（1+P ² ）X _p <At the time of 0, the temperature of the liquid,λ=1；ΔPrepresenting a first adjustment factor;X _ymax andX _ymin upper limit value sum of range representing abnormal valueA lower limit value;

traversing each data point in the data set, and judging whether the data points exceed the range of abnormal values or not;

when the clinical sample data exceeds the range of the abnormal value, taking the clinical sample data exceeding the range of the abnormal value as the abnormal value;

And acquiring a substitute value of the abnormal value according to the relation between the abnormal value and the range of the abnormal value, replacing the abnormal value and the corresponding position of the abnormal value by the substitute value, and deleting the abnormal value.

Further, the substitution value is obtained by the following formula:

wherein ,X _t representing a substitute value corresponding to the outlier;X _p mean values representing clinical sample data;Xa numerical value corresponding to an original data point representing the clinical sample data;X _c standard deviation representing clinical sample data;Pthe percentile point is indicated as being the percentile point,Pthe value range of (2) is 0.71-0.74;X _ymax andX _ymin the upper limit value and the lower limit value of the range representing the abnormal value.

Further, performing continuous variable normalization and classified variable encoding processing on the denoised clinical sample data to obtain a preprocessed clinical data sample, including:

setting a scaling strategy of continuous variables, wherein the scaling strategy corresponds to the formula as follows:

wherein ,X _s a numerical value corresponding to a data point representing scaled clinical sample data;Xa numerical value corresponding to an original data point representing the clinical sample data;X _min raw data set representing clinical sample dataThe minimum data value of (a); X _max A maximum data value in the raw dataset representing clinical sample data;X _rmin andX _rmax a data lower limit value and a data upper limit value of scaling data preset in variable scaling of the clinical sample data are represented;

scaling and normalizing continuous variables to be normalized in the clinical sample data according to the scaling strategy of the continuous variables to generate continuous variable normalized data information;

and determining a classification variable which needs to be subjected to coding processing in the continuous variable normalized data information, and carrying out classification variable coding processing on the continuous variable normalized data information according to the characteristics of the classification variable to obtain sample data after classification coding conversion, wherein the sample data after class coding conversion is a clinical data sample after preprocessing.

Further, constructing a diabetes parting model based on depth self-coding, training the diabetes parting model by using a training model to obtain a trained diabetes parting model, and the method comprises the following steps of:

constructing a diabetes typing model based on depth self-coding;

training the depth self-encoder by using training data to obtain a trained depth self-encoder;

The trained depth is self-coded in the encoderMDepth self-encoderKmensThe clustering module passes throughKLPerforming joint loss optimization in a divergence mode to form a carrier withKmensA clustered depth self-encoder; wherein is provided withKmensThe diabetes typing model of the clustered depth self-encoder is the trained diabetes typing model, and the depth self-encoderMThe specific value of (2) is obtained by the following formula:

wherein ,Mrepresenting a unionKmensThe number of depth self-encoders of the clustering module, and，Mto round downwards when passing throughCalculated to obtainMWhen=0, letM =1, when passing->Calculated to obtainM>M ₀ Time, orderM =M ₀ -1；A ₀ The number of data representing abnormal values in the clinical sample data;Atotal number of sample data representing clinical sample data;M ₀ the total number of depth self-encoders represented in the depth self-encoding diabetes typing model; deltaMRepresenting a second adjustment factor.

Further, verifying the trained diabetes parting model by using verification data, determining whether the diabetes parting model needs to be adjusted based on a verification result, and obtaining a final diabetes parting model, wherein the method comprises the following steps of:

inputting the verification data into a trained diabetes typing model to obtain a clustering index radar chart after diabetes typing;

Comparing the index data represented in the clustered index radar map with the characteristics of each type of diabetes in the validation data;

when the comparison result shows that the diabetes parting model accords with the characteristic distribution rule range of the verification data, judging that the current trained diabetes parting model is the final diabetes parting model;

when the comparison result shows that the diabetes parting model does not accord with the characteristic distribution rule range of the verification data, the first adjustment factor and the second adjustment factor are utilized to respectively carry out threshold coefficient on the abnormal valueNAnd the number of encodersMAdjusting; and uses the threshold coefficient of the adjusted outlierNAnd the number of encodersMAnd re-acquiring the trained diabetes parting model until the verification result of the trained diabetes parting model accords with the characteristic distribution rule range of the verification data.

Further, the first adjustment factor and the second adjustment factor are obtained by the following formula:

wherein ,ΔPRepresenting a first adjustment factor; deltaMRepresenting a second adjustment factor;Krepresenting the number of data which does not accord with the characteristic distribution rule range of the verification data;X _mi represent the firstiData values which do not conform to the characteristic distribution rule range of the verification data; X _si Represent the firstiScaled data values corresponding to data that do not conform to the characteristic distribution rule range of the verification data;X _h representing the range of feature distribution rules and the firstiData values corresponding to data nearest data points which do not accord with the characteristic distribution rule range of the verification data;X _p mean values representing clinical sample data;X _c standard deviation representing clinical sample data;X _c1 and a numerical value representing the standard deviation corresponding to the verification data.

A depth self-encoding based diabetes typing system, the diabetes typing system comprising:

the data extraction module is used for extracting clinical data samples from the diabetes clinical database as training data and verification data;

the model construction and training module is used for constructing a diabetes parting model based on depth self-coding, and training the diabetes parting model by utilizing the training model to obtain a trained diabetes parting model; wherein, a Kmes clustering module is embedded in a partial depth self-encoder of the diabetes parting model;

and the verification adjustment module is used for verifying the trained diabetes parting model by using verification data, determining whether the diabetes parting model needs to be adjusted or not based on a verification result, and obtaining a final diabetes parting model.

The invention has the beneficial effects that:

according to the diabetes typing method and system based on depth self-coding, a clustering target is added to an optimization process, namely, a pre-trained encoder part of a self-encoder is taken out, and the clustering target and a Kmes clustering module perform joint loss optimization through KL divergence, so that effective clustering can be performed on high-dimensional data with sparse data distribution and unclear cluster structure.

Drawings

FIG. 1 is a flow chart of a method for typing diabetes mellitus according to the present invention;

FIG. 2 is a system block diagram of a diabetes typing system according to the present invention;

FIG. 3 is a model of diabetes mellitus typing according to the present inventionKmensThe clustering module adds a schematic diagram.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides a diabetes typing method based on depth self-coding, as shown in figure 1, comprising the following steps:

S1, extracting clinical data samples from a diabetes clinical database as training data and verification data;

s2, constructing a diabetes parting model based on depth self-coding, and training the diabetes parting model by using a training model to obtain a trained diabetes parting model; wherein, a Kmen clustering module is embedded in a partial depth self-encoder of the diabetes typing model, and the principle of the Kmen clustering module is shown in a figure 3, wherein the abbreviation of English deep coding of DEC means a self-encoding clustering algorithm, the encoder means an encoder, and the decoder means a decoder;

and S3, verifying the trained diabetes parting model by using verification data, and determining whether the diabetes parting model needs to be adjusted or not based on a verification result to obtain a final diabetes parting model.

The working principle of the technical scheme is as follows: s1, extracting clinical data samples from a diabetes clinical database as training data and verification data: in this step, a number of clinical data samples are obtained from the diabetes clinical database. These data samples contain clinical features and signatures associated with diabetes. Training data is used to build the model and validation data is used to evaluate the performance of the model.

S2, constructing a diabetes parting model based on depth self-coding, and training the diabetes parting model by using the training model to obtain a trained diabetes parting model:

in this step, a diabetes typing model is constructed using a depth self-encoding model. Depth self-coding is an unsupervised learning method that encodes and decodes input data through a multi-layer neural network to extract advanced feature representations of the data. By model training the training data, an optimized diabetes typing model can be obtained that includes the ability to meaningfully extract and represent features of the input data.

Wherein a neural network model with multiple depth self-encoders may be employed, the structure of which may be, but is not limited to, the following network model structure:

the following is a constituent structure of neural network models stacked from encoders for diabetes type typing:

input Layer (Input Layer): clinical characteristics of a diabetic patient are received as input.

Encoder (Encoder): a layer of multiple self-encoders, each self-encoder being responsible for learning a different level of abstract feature representation of the input data. Each self-encoder is composed of an encoder section and a decoder section. Wherein each self-encoder comprises the following two parts:

a. An encoder section: comprising one or more hidden layers and an activation function, compresses the input data into a lower dimensional coded representation.

b. A decoder section: symmetrical to the encoder section, containing one or more hidden layers and activation functions, maps the encoded representation back to the original input dimension.

Decoder (Decoder): the last decoder section output from the encoder is the output of the entire model.

Output Layer (Output Layer): a layer consisting of one or more neurons outputs probability distributions belonging to different diabetes types.

S3, verifying the trained diabetes parting model by using verification data, and determining whether the diabetes parting model needs to be adjusted or not based on a verification result to obtain a final diabetes parting model:

in this step, the trained diabetes subtype is evaluated and validated using the validation dataset. By inputting the validation data into the trained model, a model's prediction of the new sample can be obtained. Based on the verification results, the performance and accuracy of the model can be evaluated. If the verification result does not meet the requirement, the diabetes parting model can be adjusted and optimized, such as adjusting the hyper-parameters of the model, increasing the training data amount, and the like, so as to obtain the final diabetes parting model.

The technical scheme has the effects that: according to the diabetes typing method based on the depth self-coding, a clustering target is added to an optimization process, namely, a pre-trained encoder part of a self-encoder is taken out, and the pre-trained encoder part and a Kmen clustering module are subjected to joint loss optimization through KL divergence, so that effective clustering can be performed on high-dimensional data with sparse data distribution and unclear cluster structure, and meanwhile, the final diabetes typing model obtained by the diabetes typing method based on the depth self-coding can ensure that all data output by the depth self-encoder are dimension reduction data suitable for clustering.

In one embodiment of the invention, extracting clinical data samples from a diabetes clinical database as training data and validation data includes:

s101, extracting clinical data samples from the diabetes clinical database;

s102, carrying out data preprocessing on the clinical data sample to obtain a preprocessed clinical data sample;

s103, dividing the preprocessed clinical data sample according to the data proportion of preset training data and verification data, and obtaining the training data and the verification data corresponding to the data proportion.

Wherein, carry on the data preprocessing to the said clinical data sample, obtain the clinical data sample after preprocessing, including:

s1021, removing null values from the clinical data samples to obtain clinical sample data without null values;

s1022, removing the clinical sample data without null valueNObtaining clinical sample data without abnormal values by abnormal values outside the standard deviation;

s1023, carrying out continuous variable normalization and classified variable coding treatment on the clinical sample data without abnormal values to obtain a preprocessed clinical data sample.

The working principle of the technical scheme is as follows: clinical data samples of diabetics are obtained from the database. Data preprocessing is performed on the data samples. Data preprocessing is to clean and prepare the data for subsequent analysis and modeling. In this step, the following sub-steps are performed:

null values are removed from clinical data samples, and in order to process missing values in clinical data, the missing values may affect the results of subsequent analysis and modeling, and therefore samples containing null values need to be processed or culled.

The clinical sample data without null values is stripped of outliers outside of N standard deviations, with the aim of detecting and removing outliers in the data. Outliers may be due to measurement errors or other anomalies, which if left untreated, may adversely affect modeling and analysis. Where N represents a threshold, which may be a multiple of the standard deviation, as the case may be.

Continuous variable normalization and categorical variable encoding processes are performed on outlier-removed clinical sample data in order to properly process different types of features so that they are comparable and usable. Continuous variable normalization can scale different ranges of continuous variables to the same range, common methods include MinMax Scaling or Z-score normalization. The classification variable coding process converts the classification variable into a digital representation, and common methods are One-Hot Encoding (One-Hot Encoding) or Label Encoding (Label Encoding).

Dividing according to the preset proportion of training data and verification data, and mainly dividing the preprocessed clinical data sample into training data and verification data sets according to the preset proportion. Training data is used for training of the model, and validation data is used for evaluating the model performance and making adjustments.

The technical scheme has the effects that: the above-described technical solution of the present embodiment provides clean data samples and data sets for training and verification by performing data preprocessing and partitioning on diabetes clinical data. The technical scheme of the embodiment is beneficial to reducing noise and abnormal values in the data and converting the data into a form suitable for modeling, so that the accuracy and stability of the model are improved.

One embodiment of the invention removes the null-free clinical sample dataNOutlier values outside of the individual standard deviations, obtaining outlier-free clinical sample data, comprising:

step 1, carrying out average value calculation and standard deviation calculation on the clinical sample data to obtain an average value and a standard deviation corresponding to the clinical sample data;

step 2, determining the threshold coefficient of the abnormal value by using the average value and standard deviation corresponding to the clinical sample dataNAnd pass through the threshold coefficientNDetermining a range of outliers, wherein the threshold coefficientNAnd the range of outliers is obtained by the following formula:

wherein ,Nrepresenting a threshold coefficient;X _p mean values representing clinical sample data;X _c standard deviation representing clinical sample data;Pthe percentile point is indicated as being the percentile point,Pthe value range of (2) is 0.71-0.74;λrepresenting adjustment coefficients whenX _c -（1+P ² ）X _p >At the time of 0, the temperature of the liquid,λ=-（1-P) When (when)X _c -（1+P ² ）X _p <At the time of 0, the temperature of the liquid,λ=1；ΔPrepresenting a first adjustment factor;X _ymax andX _ymin an upper limit value and a lower limit value representing a range of abnormal values;

step 3, traversing each data point in the data set, and judging whether the data points exceed the range of abnormal values or not;

step 4, when the clinical sample data exceeds the range of the abnormal value, taking the clinical sample data exceeding the range of the abnormal value as the abnormal value;

And 5, acquiring a substitute value of the abnormal value according to the relation between the abnormal value and the range of the abnormal value, replacing the abnormal value and the corresponding position of the abnormal value with the substitute value, and deleting the abnormal value.

Wherein the substitution value is obtained by the following formula:

wherein ,X _t representing a substitute value corresponding to the outlier;X _p mean values representing clinical sample data;Xa numerical value corresponding to an original data point representing the clinical sample data;X _c standard deviation representing clinical sample data;Pthe percentile point is indicated as being the percentile point,Pthe value range of (2) is 0.71-0.74;X _ymax andX _ymin upper and lower limits representing ranges of outliersA limit value.

The working principle of the technical scheme is as follows: and carrying out average value calculation and standard deviation calculation on the clinical sample data to obtain an average value and a standard deviation corresponding to the clinical sample data. The average value calculation is to average the data samples, and the standard deviation calculation is to measure the discrete degree of the data samples.

And determining a threshold coefficient N of the abnormal value by using the average value and the standard deviation corresponding to the clinical sample data, and determining the range of the abnormal value through the threshold coefficient N. The threshold coefficient N is a critical range for determining outliers from the mean and standard deviation, typically by multiplying N by the standard deviation.

Each data point in the dataset is traversed and a determination is made as to whether it is outside the range of outliers. For each data point, a determination is made as to whether it belongs to an outlier by comparison to a threshold range of outliers.

When the clinical sample data exceeds the range of outliers, the clinical sample data that exceeds the range of outliers is marked as outliers. This step marks data points that exceed the outlier range as outliers for subsequent processing.

And acquiring a substitute value of the abnormal value according to the relation between the abnormal value and the range of the abnormal value, and replacing the abnormal value and the corresponding position thereof with the substitute value. In this step, different strategies may be adopted to replace outliers, such as using means, median or other statistics as replacement values, as the case may be. And meanwhile, deleting the abnormal value from the data set to ensure the accuracy and consistency of the data.

The technical scheme has the effects that: according to the technical scheme, the range of abnormal values is determined by calculating the average value and the standard deviation of clinical sample data, and data points out of the range are marked as the abnormal values. Then, a substitute value for the outlier is obtained from the relationship between the outlier and the range, and the substitute value is substituted for the outlier. This effectively handles outliers in the clinical sample data to ensure the quality and reliability of the data.

Meanwhile, the threshold coefficient N obtained by the formula can effectively improve the accuracy of setting the range of the abnormal value and the matching property between the threshold coefficient N and clinical sample data, so that the problems of reduced screening sensitivity of the abnormal value, reduced accuracy of the abnormal value and further reduced precision and accuracy of a classification model of subsequent training are prevented; meanwhile, the problem that the threshold coefficient N is too small to cause the screening sensitivity of abnormal values to be too high and further cause effective training data to be removed by mistake can be prevented. The threshold coefficient N of the outlier can be determined according to the average value and standard deviation of the clinical sample data and other parameters and adjustment factors by the above formula, and the range of the outlier can be further determined. This helps identify and process outliers in clinical data, improving the accuracy and reliability of data screening.

On the other hand, the replacement value obtained through the formula can be set by combining the average value and standard deviation of clinical sample data and the upper limit value and the lower limit value of the range of the percentile point and the abnormal value, and setting is carried out by combining the actual distribution condition of the data value of each abnormal data, so that the rationality and the accuracy of the replacement value setting are effectively improved, and meanwhile, the non-abnormality of the replacement value can be effectively reduced through the replacement value setting, and the accuracy of the subsequent model training is further improved.

In one embodiment of the present invention, performing continuous variable normalization and classified variable encoding processing on the denoised clinical data sample to obtain a preprocessed clinical data sample, including:

wherein ,X _s a numerical value corresponding to a data point representing scaled clinical sample data;Xa numerical value corresponding to an original data point representing the clinical sample data;X _min a minimum data value in the raw dataset representing clinical sample data;X _max representing the raw number of clinical sample dataThe maximum data value in the dataset;X _rmin andX _rmax a data lower limit value and a data upper limit value of scaling data preset in variable scaling of the clinical sample data are represented;

secondly, scaling and normalizing continuous variables to be normalized in the clinical sample data according to the scaling strategy of the continuous variables to generate continuous variable normalized data information;

and thirdly, determining a classification variable which needs to be subjected to coding processing in the continuous variable normalized data information, and carrying out classification variable coding processing on the continuous variable normalized data information according to the characteristics of the classification variable to obtain sample data subjected to classification coding conversion, wherein the sample data subjected to class coding conversion is a pre-processed clinical data sample.

The working principle of the technical scheme is as follows: the scaling strategy for the continuous variable is set, in which step the scaling strategy to be adopted for the continuous variable needs to be determined, for example using, but not limited to, min-max scaling, normalization, etc.

And scaling and normalizing the continuous variable which needs to be normalized in the clinical sample data according to the scaling strategy of the continuous variable, and generating data information after continuous variable normalization. And according to the selected scaling strategy, carrying out corresponding processing on the continuous variable to ensure that the value of the continuous variable is within a certain range or meets specific distribution characteristics.

And determining the classification variable which needs to be subjected to coding processing in the continuous variable normalized data information. According to the characteristics of the classified variables, determining which variables need to be encoded, for example, converting the classified variables into numerical representations by using methods such as single-hot encoding, tag encoding and the like.

Finally, pre-processed clinical data samples can be obtained by continuous variable normalization and classification variable encoding processes, wherein the continuous variable has been subjected to a scaled normalization process and the classification variable has been converted into a numerical representation.

The technical scheme has the effects that: by the preprocessing, dimensional differences among data can be eliminated, the training effect of the model is improved, and different types of characteristics can be ensured to be correctly input into the model. Meanwhile, the clinical sample data is preprocessed by setting a scaling strategy of the continuous variable, scaling and normalizing the continuous variable and encoding the classified variable. Therefore, the comparability of data and the training effect of the model can be improved, and a better data base is provided for the subsequent establishment of the diabetes type parting model.

On the other hand, the scaled numerical values obtained through the scaling strategy can be reasonable in data scaling, the comprehensiveness of the distribution characteristics of the data in a certain range can be realized to the greatest extent, the quality of subsequent training data and the verification effectiveness of verification data are further improved, the data quality of the training data and the verification data is prevented from being reduced due to unreasonable data scaling, and further the problems that the accuracy of early model training is lower and the accuracy of subsequent model verification is lower are caused.

In one embodiment of the invention, a diabetes parting model based on depth self-coding is constructed, the diabetes parting model is trained by utilizing the training model, and the trained diabetes parting model is obtained, and the method comprises the following steps:

s201, constructing a diabetes typing model based on depth self-coding;

s202, training a depth self-encoder by using training data to obtain a trained depth self-encoder;

s203, the trained depth self-encoder is used for encodingMDepth self-encoderKmensThe clustering module passes throughKLPerforming joint loss optimization in a divergence mode to form a carrier withKmensA clustered depth self-encoder; wherein is provided withKmensThe diabetes typing model of the clustered depth self-encoder is the trained diabetes typing model, and the depth self-encoder MThe specific value of (2) is obtained by the following formula:

wherein ,Mrepresenting a unionKmensThe number of depth self-encoders of the clustering module, and,Mto round downwards when passing throughCalculated to obtainMWhen=0, letM =1, when passing->Calculated to obtainM>M ₀ Time, orderM =M ₀ -1；A ₀ The number of data representing abnormal values in the clinical sample data;Atotal number of sample data representing clinical sample data;M ₀ the total number of depth self-encoders represented in the depth self-encoding diabetes typing model; deltaMRepresenting a second adjustment factor.

The working principle of the technical scheme is as follows: and constructing a diabetes typing model based on depth self-coding. The depth self-encoder is a neural network model, and consists of an encoder and a decoder, and is used for learning the compact representation and reconstruction capability of input data.

Training the depth self-encoder by using training data to obtain the trained depth self-encoder. In this step, the training data is used to train the depth self-encoder, optimizing the model parameters by minimizing the reconstruction error, enabling it to reconstruct the input data better.

And carrying out joint loss optimization on M depth self-encoders in the trained depth self-encoders and a Kmeans clustering module in a KL divergence mode to form the depth self-encoder with Kmeans clustering. In this step, the trained depth self-encoder is combined with the Kmeans clustering module, and the model is optimized by minimizing the KL divergence, so that the encoded representation can be better matched with the Kmeans clustering result.

The technical scheme has the effects that: by the technical scheme, the depth self-encoder with Kmeans clustering is constructed, the model can encode and decode input data, and the data are grouped and classified by a clustering method. The depth self-encoder learns the feature representation of the data, and the Kmeans clustering module classifies the data by a clustering algorithm. Through joint optimization, the model can better type the diabetes data, so that the purpose of diabetes typing is achieved.

Meanwhile, according to the technical scheme, the clustering target is added into the optimization process, namely the pre-trained encoder part of the self-encoder is taken out, and the clustering target and the Kmen clustering module are combined to perform loss optimization through KL divergence, so that effective clustering can be performed on high-dimensional data with sparse data distribution and unclear cluster structure, and meanwhile, the final diabetes typing model obtained through the technical scheme provided by the embodiment can ensure that all data output by the depth self-encoder are dimension reduction data suitable for clustering.

The technical scheme provided by the embodiment constructs a model for diabetes typing by combining a depth self-encoder and a Kmeans clustering module and adopting a combined loss optimization mode. The model can automatically learn the characteristic representation of the data and cluster, and provides an effective method for diabetes typing.

On the other hand, the adjustment factor of the number M of the depth self-encoders calculated by the above formula is used to determine the number of the depth self-encoders in the diabetes typing model. The function of this adjustment factor is to adjust the number of depth self-encoders according to the number of outliers and the total number of samples to accommodate the characteristics and complexity of the data. Meanwhile, the number of the depth self-encoders embedded with the clustering model, which is obtained through the formula, can be combined with the actual situation of sample data, so that the number rationality of the depth self-encoders with the clustering model is effectively improved, the problem that the response speed is reduced due to the fact that the number of the depth self-encoders with the clustering model is too large, and the problem that the clustering effect is poor due to the fact that the number of the depth self-encoders with the clustering model is too small is prevented.

In one embodiment of the present invention, verifying the trained diabetes parting model using verification data, and determining whether the diabetes parting model needs to be adjusted based on a verification result, to obtain a final diabetes parting model, includes:

s301, inputting the verification data into a trained diabetes typing model to obtain a clustering index radar chart after diabetes typing;

S302, comparing index data represented in the clustering index radar graph with characteristics of each type of diabetes in verification data;

s303, when the comparison result shows that the diabetes parting model accords with the characteristic distribution rule range of the verification data, judging that the current trained diabetes parting model is the final diabetes parting model;

s304, when the comparison result shows that the diabetes parting model does not accord with the characteristic distribution rule range of the verification data, using the first adjustment factor and the second adjustment factor to respectively carry out threshold coefficient on the abnormal valueNAnd the number of encodersMAdjusting; and uses the threshold coefficient of the adjusted outlierNAnd the number of encodersMAnd re-acquiring the trained diabetes parting model until the verification result of the trained diabetes parting model accords with the characteristic distribution rule range of the verification data.

The first adjustment factor and the second adjustment factor are obtained through the following formula:

The working principle of the technical scheme is as follows: inputting the verification data into the trained diabetes typing model to obtain a clustering index radar chart after diabetes typing. The cluster index radar chart is used for representing the distribution situation of different indexes on different diabetes types.

The index data in the clustered index radar map is compared with the characteristics of each type of diabetes in the validation data. By comparison, whether the trained diabetes parting model accords with the characteristic distribution rule range of the verification data can be evaluated.

If the comparison result shows that the diabetes parting model accords with the characteristic distribution rule range of the verification data, the current trained diabetes parting model is judged to be the final diabetes parting model.

If the comparison result shows that the diabetes typing model does not accord with the characteristic distribution rule range of the verification data, adjustment is needed. And respectively adjusting the threshold coefficient N of the outlier and the number M of the encoders by using the first adjusting factor and the second adjusting factor. And retraining the diabetes parting model through the adjusted threshold coefficient N and the number M of encoders until the verification result accords with the characteristic distribution rule range of the verification data.

The technical scheme has the effects that: according to the technical scheme, the trained diabetes parting model gradually approaches to the characteristic distribution rule range of the verification data through continuously adjusting the threshold coefficient N of the abnormal value and the number M of the encoders. And through iterative adjustment, the diabetes typing model conforming to the verification data characteristics is finally obtained, and the accuracy and the adaptability of the model are improved.

Meanwhile, the verification data is compared with the trained model, and the threshold coefficient of the abnormal value and the number of encoders are continuously adjusted, so that the diabetes typing model conforming to the characteristic distribution rule of the verification data is finally obtained. The fitting capacity and accuracy of the model are effectively improved, so that the model can be better applied to actual diabetes typing tasks.

The embodiment of the invention provides a diabetes typing system based on depth self-coding, as shown in fig. 2, comprising:

The working principle of the technical scheme is as follows: firstly, extracting clinical data samples from a diabetes clinical database as training data and verification data through a data extraction module;

then, constructing a diabetes parting model based on depth self-coding by using a model construction and training module, and training the diabetes parting model by using the training model to obtain a trained diabetes parting model; wherein, a Kmes clustering module is embedded in a partial depth self-encoder of the diabetes parting model;

and finally, verifying the trained diabetes parting model by using verification data through a verification adjustment module, and determining whether the diabetes parting model needs to be adjusted based on a verification result to obtain a final diabetes parting model.

The technical scheme has the effects that: according to the diabetes typing system based on the depth self-coding, a clustering target is added to an optimization process, namely, a pre-trained encoder part of a self-encoder is taken out, and the diabetes typing system based on the depth self-coding and a Kmen clustering module perform joint loss optimization through KL divergence, so that effective clustering can be performed on high-dimensional data with sparse data distribution and unclear cluster structures, and meanwhile, the final diabetes typing model obtained by the diabetes typing system based on the depth self-coding can ensure that all data output by the depth self-encoder are dimension reduction data suitable for clustering.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A diabetes typing method based on depth self-coding, the diabetes typing method comprising:

2. The method of claim 1, wherein extracting clinical data samples from a diabetes clinical database as training data and validation data comprises:

Extracting clinical data samples from the diabetes clinical database;

3. The method of claim 2, wherein the data preprocessing is performed on the clinical data samples to obtain preprocessed clinical data samples, comprising:

4. A method of typing diabetes according to claim 3, wherein the null-free clinical sample data is removedNOutlier values outside of the individual standard deviations, obtaining outlier-free clinical sample data, comprising:

，

5. The method of claim 4, wherein the surrogate value is obtained by the following formula:

，

6. A method of typing diabetes according to claim 3, wherein performing continuous variable normalization and classification variable encoding processing on the de-outlier clinical sample data to obtain a pre-processed clinical data sample comprises:

，

wherein ,X _s a numerical value corresponding to a data point representing scaled clinical sample data;Xa numerical value corresponding to an original data point representing the clinical sample data;X _min a minimum data value in the raw dataset representing clinical sample data;X _max a maximum data value in the raw dataset representing clinical sample data;X _rmin andX _rmax a data lower limit value and a data upper limit value of scaling data preset in variable scaling of the clinical sample data are represented;

7. The method of claim 1, wherein constructing a depth self-coding based diabetes typing model and training the diabetes typing model with a training model to obtain a trained diabetes typing model comprises:

constructing a diabetes typing model based on depth self-coding;

the trained depth is self-coded in the encoderMDepth self-encoderKmensThe clustering module passes throughKLPerforming joint loss optimization in a divergence mode to form a carrier with KmensA clustered depth self-encoder; wherein is provided withKmensThe diabetes typing model of the clustered depth self-encoder is the trained diabetes typing model, and the depth self-encoderMThe specific value of (2) is obtained by the following formula:

，

8. The method of claim 1, wherein validating the trained diabetes typing model using validation data and determining whether adjustment of the diabetes typing model is required based on validation results to obtain a final diabetes typing model comprises:

When the comparison result shows that the diabetes parting model accords with the characteristic distribution rule range of the verification data, judging that the currently trained diabetes parting model is the final diabetes parting model;

9. The method of claim 8, wherein the first and second adjustment factors are obtained by the following formula:

，

wherein ,ΔPRepresenting a first adjustment factor; deltaMRepresenting a second adjustment factor;Krepresenting the number of data which does not accord with the characteristic distribution rule range of the verification data;X _mi represent the firstiData values which do not conform to the characteristic distribution rule range of the verification data;X _si represent the firstiScaled data values corresponding to data that do not conform to the characteristic distribution rule range of the verification data; X _h Representing the range of feature distribution rules and the firstiData values corresponding to data nearest data points which do not accord with the characteristic distribution rule range of the verification data;X _p mean values representing clinical sample data;X _c standard deviation representing clinical sample data;X _c1 and a numerical value representing the standard deviation corresponding to the verification data.

10. A depth self-encoding based diabetes typing system, the diabetes typing system comprising: