CN113160969A - Soft tissue sarcoma recurrence probability prediction method based on machine learning - Google Patents
Soft tissue sarcoma recurrence probability prediction method based on machine learning Download PDFInfo
- Publication number
- CN113160969A CN113160969A CN202110399327.0A CN202110399327A CN113160969A CN 113160969 A CN113160969 A CN 113160969A CN 202110399327 A CN202110399327 A CN 202110399327A CN 113160969 A CN113160969 A CN 113160969A
- Authority
- CN
- China
- Prior art keywords
- recurrence
- probability
- gray level
- year
- soft tissue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
- G06T7/0012—Biomedical image inspection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H30/00—ICT specially adapted for the handling or processing of medical images
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10072—Tomographic images
- G06T2207/10088—Magnetic resonance imaging [MRI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30004—Biomedical image processing
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Radiology & Medical Imaging (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a soft tissue sarcoma recurrence probability prediction method based on machine learning, belonging to the technical field of medical image processing. The invention mainly comprises the following steps: s1: calculating the recurrence probability of the basic soft tissue sarcoma sample data; s2: screening conventional features and image features in the sample data set; s3: aiming at the sample data set, implementing conventional feature processing, image feature processing and data set division; s4: and (4) combining the BP neural network model and the random forest to construct a recurrence probability prediction model. The invention is based on soft tissue sarcoma patient samples collected by hospitals, utilizes the thinking of sample sampling to calculate the recurrence probability values of the soft tissue sarcoma in three-year period and five-year period for individual samples, combines the recurrence time data to convert the recurrence probability values to obtain accurate and reliable recurrence probability of the individual soft tissue sarcoma patients, and determines a final soft tissue sarcoma recurrence probability prediction model according to the difference of predicted values and true values.
Description
Technical Field
The invention relates to a soft tissue sarcoma recurrence probability prediction method based on machine learning, belonging to the technical field of medical image processing.
Background
The existing prediction method aiming at the recurrence probability of the soft tissue sarcoma mainly has two problems: firstly, doctors observe medical images of the sarcoma according to experience to judge the content of the sarcoma, such as size, histological type, pathological grade and the like, and great difference is caused by different abilities and experiences of the doctors, so that treatment is delayed; secondly, based on some specific characteristic information in soft tissue sarcoma data, a mathematical model can be established to carry out recurrence risk prediction, however, the existing model excessively depends on specific characteristics used in the model, and due to large morphological difference, many characteristics and complexity of soft tissue sarcoma, the prediction accuracy rate is low and the reliability is poor.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a soft tissue sarcoma recurrence probability prediction method based on machine learning, which utilizes nuclear magnetic resonance MRI images which are easy to collect in various cities and hospitals to extract typical characteristics, and comprehensively adopts BP neural network and random forest algorithm to establish a soft tissue sarcoma recurrence probability prediction model, so that the soft tissue sarcoma recurrence risk can be predicted.
The invention relates to a soft tissue sarcoma recurrence probability prediction method based on machine learning, which comprises the following steps:
s1: calculating the recurrence probability based on sample data, namely acquiring the recurrence probability of a single patient by collecting information of soft tissue sarcoma patients and converting the information, and comprising the following steps:
s11: collection of samples { D } of Soft tissue sarcoma patients1,D2,D3,...,DnThe number n of the suggested samples is more than or equal to 100;
s12: calculating the recurrence probability of each sample, including the following specific steps:
s121: for sample i, dividing all subsamples containing sample iEach subsample containingA sample is obtained;
s122: for sub-samplesCalculating the 3-year recurrence probability of the sample i in the subsampleAnd 5 years recurrence probabilityNamely:
in the formula: n is3-r、n5-rAre respectively subsamplesThe number of recurrent diseases in the middle 3 years and the number of recurrent diseases in the 5 yearsThe number of patients with the disease;
s123: calculating the recurrence probability of sample i, namely:
s124: then the { D for all samples is known1,D2,D3,...,DnTriple annual recurrence probabilityAnd probability of recurrence in five years
S125: and (3) converting the three-year relapse probability and the five-year relapse probability by using the relapse time t respectively, namely:
in the formula: the recurrence time t represents the recurrence of the postoperative month, and the t value range [1,60 ];
s2: characteristic screening for soft tissue sarcoma recurrence: screening conventional features and image features in the sample data set;
s3: sample data processing based on features: according to step S1 and step S2, an acquired sample { D }is obtained1,D2,D3,...,DnProcessing the conventional characteristics and the image characteristics of all samples in the data, including the following small samples, corresponding to the conventional characteristics, the image characteristics, the 3-year recurrence probability and the 5-year recurrence probabilityThe method comprises the following steps:
s31: processing conventional characteristics;
s32: image feature processing: for sample { D1,D2,D3,...,DnAll image features ofStandardized processing is carried out, and each image feature needs to be treatedIts characteristic valueNormalization is performed, namely:
s32: data set partitioning: dividing the test set and the training set, wherein: the training set is used for training the machine learning algorithm, the test set is used for testing the quality of the machine learning algorithm, the data sets are sorted from high to low according to the 3-year recurrence probability or the 5-year recurrence probability, samples with a certain rule are selected as the test set according to the sequence numbers, and the rest data are used as the training set;
s4: recurrence probability prediction based on machine learning model: according to the steps S1, S1 and S2, obtaining a complete data set of all samples, and realizing the mapping of the sample characteristics and the recurrence probability by adopting a BP neural network and a random forest, wherein the method comprises the following steps:
s41: model training: the method comprises a BP neural network and a random forest, wherein:
s411: a BP neural network;
s412: random forests;
s42: model evaluation and determination: will correspond to the probability of relapse in three yearsAnd probability of recurrence in five yearsRespectively inputting the trained neural network and random forest to obtain the recurrence probability predicted value in three yearsAnd the five-year recurrence probability prediction value
The difference v between the predicted and true values for the three and five years3And v5The calculation is carried out, namely:
parameter v3,v5The larger the value is, the larger the difference between the representative predicted value and the true value is, namely the larger the error of the corresponding model is, the better the effect is;
parameters v for all modelsANN、vRFSelecting the minimum value min { v } of the minimum valuesANN,vRFAnd the corresponding model is the soft tissue sarcoma recurrence probability prediction model.
Preferably, in step S11, the collecting sample information of the soft tissue sarcoma patient includes: personal information, pathological characteristics, image characteristics, whether the patient relapses in 3 years after the operation and whether the patient relapses in 5 years after the operation.
Preferably, in step S2, the characteristics of soft tissue sarcoma recurrence include:
s21: routine characteristics include gender, age, and post-operative time;
s22: and image characteristics are extracted by using MRI images obtained by the nuclear magnetic resonance equipment.
Preferably, in step S22, the MRI images obtained by the MRI apparatus are classified into T1-weighted imaging and T2-weighted imaging according to different imaging modes.
Preferably, in the step S22, T1 weighted imaging includes the following cases:
the first condition is as follows: in wavelet-low frequency subband imaging mode:
(a) large-area high-gray-level factor characteristics of the gray-level area matrix;
(b) a small area high gray level factor characteristic of the gray level area matrix;
case two: in wavelet-low high frequency sub-band imaging mode:
(a) roughness characteristics of adjacent gray level difference matrices;
(b) total energy characteristics of the first order statistics;
case three: in wavelet-high-low-frequency subband imaging mode:
(a) a small dependence low gray level factor characteristic of the gray level correlation matrix;
case four: in wavelet-high-low-high-frequency sub-band imaging mode:
(b) large-area high-gray-level factor characteristics of the gray-level area matrix;
(c) a small area high gray level factor characteristic of the gray level area matrix;
case five: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
(c) a kurtosis characteristic of the first order statistic;
case six: under a 15mm Laplacian three-dimensional imaging mode:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) a kurtosis characteristic of the first order statistic;
case seven: in the original imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) the large dependence high gray level factor characteristic of the gray difference matrix;
(c) large area high gray level factor characteristics of the gray area matrix.
Preferably, in the step S22, T2 weighted imaging includes the following cases:
the first condition is as follows: in the original imaging mode:
(a) elongation characteristics of the shape;
(b) the inverse variance characteristic of the gray level co-occurrence matrix;
(c) the large dependence high gray level factor characteristic of the gray difference matrix;
case two: in wavelet-high frequency sub-band imaging mode:
(a) contrast characteristics of adjacent gray level difference matrices;
(b) non-uniform normalization of gray levels of the gray level area matrix;
(c) long-run high-gray-scale factor characteristics of the gray-scale run matrix;
(d) mean feature of first order statistics
Case three: under a 15mm Laplacian three-dimensional imaging mode:
(a) a 90 quantile value feature of the first order statistic;
(b) a kurtosis characteristic of the first order statistic;
case four: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
case five: in wavelet-high-low-high-frequency sub-band imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) clustering shadow features of the gray level co-occurrence matrix;
case six: in wavelet-low frequency subband imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) a small area of the gray scale region matrix is characteristic of a high gray level factor.
Preferably, in step S31, the conventional feature processing includes the following steps:
a) sex: male 1 and female 0;
b) age: 0.1 year-10, 0.2 year-10, 0.3 year-30, 0.4 year-30, 0.5 year-40, 0.6 year-50, 0.7 year-60, 0.8 year-70, 0.9 year-80, 0.9 year-90, and 1 year-90 or more years old;
c) the time after the operation: the actual number of months m is divided by 60.
Preferably, in step S32, the data set partition selects an arithmetic progression, i.e., the 3 rd, 6 th, 9 th, 12 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th and 30 … th samples as a test set according to the sequence number, and the rest data as a training set.
Preferably, in step S411, the BP neural network includes the following contents:
a) selecting a 5-layer network structure: namely an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer Lin,Ly1,Ly2,Ly3,Lout;
b) Number of neurons in 5 layers: respectively as follows: sin,sy1,sy2,sy3,soutWherein: sy1Value range of [16,30 ]],sy2Value range [8,12 ]],sy3Value range [3, 5]];
c) Network initial weight: taking a random value;
d) activation function: the activation function adopts sigmoid function, and the calculation formula is
e) Error function: using sum variance SSE;
f) learning rate: the value range is [0.1,0.5 ].
Preferably, in step S412, the key parameters involved in the random forest are set as follows:
the variable sampling value of each iteration is set to be 10;
the number of decision trees contained in the random forest was set to 3000.
The invention has the beneficial effects that:
(1) based on soft tissue sarcoma patient samples collected by a hospital, calculating recurrence probability values of soft tissue sarcoma in three-year period and five-year period by using the thinking of sample sampling, and converting the recurrence probability values by combining recurrence time data to obtain accurate and reliable recurrence probability of individual soft tissue sarcoma patients;
(2) the method comprises the steps of extracting 33 typical characteristics such as age, sex and Magnetic Resonance Imaging (MRI) images by using a data set of a soft tissue sarcoma patient, establishing a BP neural network and a random forest model to realize mapping of the characteristics and recurrence probability values, and determining a final soft tissue sarcoma recurrence probability prediction model according to the difference between a predicted value and a true value.
Drawings
FIG. 1 is a flow diagram of the present invention.
FIG. 2 is a flow diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
as shown in fig. 1 and fig. 2, the method for predicting probability of recurrence of soft tissue sarcoma based on machine learning according to the present invention is mainly described. Firstly, calculating the recurrence probability of basic soft tissue sarcoma sample data; secondly, screening conventional features and image features in the sample data set; thirdly, implementing conventional feature processing, image feature processing and data set division for the sample data set; and finally, combining the BP neural network model and the random forest to construct a recurrence probability prediction model.
The invention specifically comprises the following steps:
step S1: calculating the recurrence probability based on sample data:
for a single soft tissue sarcoma patient, it is difficult to know the accurate recurrence probability, so enough information of the soft tissue sarcoma patient is collected and converted to know the recurrence probability of the single patient.
First, a sample of a patient with soft tissue sarcoma { D }was collected1,D2,D3,...,DnThe sample number is more than 100(n ≧ 100), and the data sample information contains: personal information, pathological characteristics, image characteristics, whether the patient relapses in 3 years after the operation, whether the patient relapses in 5 years after the operation and the like.
Second, the probability of recurrence was calculated for each sample as follows:
(1) for sample i, dividing all subsamples containing sample iEach subsample containingAnd (4) sampling.
(2) For sub-samplesThe probability of 3-year recurrence of sample i within this subsample can be calculatedAnd 5 years recurrence probabilityIn a manner thatAndwherein n is3-rAnd n5-rAre respectively subsamplesThe number of recurrent diseases in the middle 3 years and 5The number of recurring diseases in the year;
(3) the probability of recurrence for sample i was calculated as follows:
(4) then the { D for all samples can be known1,D2,D3,...,DnTriple annual recurrence probabilityAnd probability of recurrence in five years
(5) Using the time of recurrence t (t)1,t2,...,tn∈[1,60]When t is 10, representing postoperative recurrence of 10 months), the probability of recurrence in three years and the probability of recurrence in five years are respectively converted, and the formula is as follows:
step S2: characteristic screening for soft tissue sarcoma recurrence:
the features for recurrence of soft tissue sarcoma mainly include two categories, one is conventional and the other is medical imaging. The screening of the invention as the basis characteristics of the soft tissue sarcoma recurrence probability calculation comprises the following steps:
general characteristics
(1) Gender, (2) age, (3) postoperative time (month)
(II) medical imaging features
The invention utilizes MRI images obtained by nuclear magnetic resonance equipment to extract 30 image characteristics. The method specifically comprises the following steps:
in T1 weighted imaging, in Wavelet-low frequency sub-band (Wavelet-LLL) imaging mode
(1) Large Area High Gray Level factor (Large Area High Gray Level Emphasis) characteristic of Gray Level Area matrix (GLSZM);
(2) a small-Area High Gray Level factor (Samll Area High Gray Level Emphasis) feature of the Gray Level Area matrix (GLSZM);
in T1 weighted imaging, in Wavelet-low high frequency sub-band (Wavelet-LLH) imaging mode
(3) Roughness (coarsense) characteristics of adjacent gray difference matrices (NGTDM);
(4) total Energy (Total Energy) characteristic of the First Order statistic (First Order);
in T1 weighted imaging, in a Wavelet-high-low frequency subband (Wavelet-HLL) imaging mode
(5) Small dependent Low Gray Level factor (Small dependency Low Gray Level email) characteristics of a Gray Level correlation matrix (GLDM);
in T1 weighted imaging, in Wavelet-high-low-high frequency sub-band (Wavelet-HLH) imaging mode
(6) Large Area High Gray Level factor (Large Area High Gray Level Emphasis) characteristic of Gray Level Area matrix (GLSZM);
(7) a small-Area High Gray Level factor (Samll Area High Gray Level Emphasis) feature of the Gray Level Area matrix (GLSZM);
in T1 weighted imaging, the 5mm Laplacian is under a three-dimensional (log-sigma-0-5-mm-3D) imaging mode
(8) A dependent Non-Uniformity Normalized (dependent Non-Uniformity Normalized) feature of a gray level difference matrix (gldm);
(9) -a Mausus Correlation Coefficient (MCC) characteristic of the gray level co-occurrence matrix (glcm);
(10) kurtosis (Kurtosis) characteristic of the first order statistic (firstorder);
in T1 weighted imaging, 15mm Laplacian three-dimensional (log-sigma-1-5-mm-3D) imaging mode
(11) A dependent Non-Uniformity Normalized (dependent Non-Uniformity Normalized) feature of a gray level difference matrix (gldm);
(12) kurtosis (Kurtosis) characteristic of the first order statistic (firstorder);
original (original) imaging mode in T1 weighted imaging
(13) Inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(14) a Large Dependence High Gray Level factor (Large dependency High Gray Level email) feature of a Gray Level difference matrix (gldm);
(15) large Area High Gray Level factor (Large Area High Gray Level algorithm) characteristics of a Gray Area matrix (GLSZM);
original (original) imaging mode in T2 weighted imaging
(16) Elongation (elongation) characteristics of the shape (shape);
(17) inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(18) a Large Dependence High Gray Level factor (Large dependency High Gray Level email) feature of a Gray Level difference matrix (gldm);
in T2 weighted imaging, in Wavelet-high frequency sub-band (Wavelet-HHH) imaging mode
(19) Contrast (contrast) characteristics of adjacent gray difference matrices (NGTDM);
(20) a Gray Level Non-Uniformity Normalized (Gray Level Non-Uniformity Normalized) feature of a Gray Level area matrix (GLSZM);
(21) long Run High Gray Level factor (Long Run High Gray Level) characteristics of the Gray Run matrix (glrlm);
(22) mean feature of first order statistics (firstorder)
In T2 weighted imaging, 15mm Laplacian three-dimensional (log-sigma-1-5-mm-3D) imaging mode
(23) A 90 quantile (90Percentile) feature of the first order statistic (firstorder);
(24) kurtosis (Kurtosis) characteristic of the first order statistic (firstorder);
in T2 weighted imaging, the 5mm Laplacian is under a three-dimensional (log-sigma-0-5-mm-3D) imaging mode
(25) A dependent Non-Uniformity Normalized (dependent Non-Uniformity Normalized) feature of a gray level difference matrix (gldm);
(26) -a Mausus Correlation Coefficient (MCC) characteristic of the gray level co-occurrence matrix (glcm);
in T2 weighted imaging, in Wavelet-high-low-high frequency sub-band (Wavelet-HLH) imaging mode
(27) Inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(28) cluster shadow (cluster shade) feature of gray level co-occurrence matrix (glcm);
in T2 weighted imaging, in Wavelet-low frequency sub-band (Wavelet-LLL) imaging mode
(29) Inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(30) a small-Area High Gray Level factor (Samll Area High Gray Level Emphasis) feature of the Gray Level Area matrix (GLSZM);
step S3: sample data processing based on features:
from the contents of steps S1 and S2, an acquired sample ({ D)1,D2,D3,...,Dn}) conventional features, image features, 3-year recurrence probability and 5-year recurrence probability corresponding to all samples in the set. The conventional features and the image features are processed as follows:
(1) routine feature processing
a) Sex: male 1 and female 0
b) Age: 0.1 in 0-10 years old, 0.2 in 10-20 years old, 0.3 in 20-30 years old, 0.4 in 30-40 years old, 0.5 in 40-50 years old, 0.6 in 50-60 years old, 0.7 in 60-70 years old, 0.8 in 70-80 years old, 0.9 in 80-90 years old, 1 in over 90 years old
c) The time after the operation: actual number of months m divided by 60(m/60)
(2) Image feature processing
For sample ({ D)1,D2,D3,...,Dn}) ofAll image characteristicsStandardized processing is carried out, and each image feature needs to be treatedIts characteristic valueNormalization is performed, the formula is as follows:
(3) data set partitioning
The test set is divided into a training set, the training set is used for training the machine learning algorithm, and the test set is used for checking the quality of the machine learning algorithm.
Sorting the data sets from large to small according to the 3-year recurrence probability or the 5-year recurrence probability, selecting samples of No. 3, No. 6, No. 9, No. 12, No. 15, No. 18, No. 21, No. 24, No. 27 and No. 30 … (arithmetic progression) as test sets according to sequence numbers, and using the rest data as training sets.
Step S4: recurrence probability prediction based on machine learning model:
according to the contents of the steps S1, S2 and S3, a complete data set of all samples can be obtained, and the method adopts a BP Neural Network (Back Propagation Neural Network) and a random forest (Ramdom forest) to realize the mapping of the sample characteristics (including conventional characteristics and image characteristics) and the 3-year relapse probability (or the 5-year relapse probability).
(1) Model training
1) BP neural network
a) Selecting 5-layer network structure, i.e. input layer, hidden layer 1, hidden layer 2, hidden layer 3 and output layer Lin,Ly1,Ly2,Ly3,Lout;
b) The number of neurons in each layer is respectively: sin,sy1,sy2,sy3,sout. Wherein s isin=33、soutCorresponding to 33 eigenvalues and 1 output (probability of recurrence in 3-year or 5-year), respectively, s1y1Value range of [16,30 ]],sy2Value range [8,12 ]],sy3Value range [3, 5]];
c) Network initial weight: taking a random value;
d) activation function: the activation function adopts sigmoid function, and the calculation formula is
e) Error function: using sum variance (SSE);
f) learning rate: the value range is [0.1,0.5 ].
2) Random forest
The key parameter settings involved in the algorithm are as follows:
the variable sampling value of each iteration is set to be 10;
the number of decision trees contained in the random forest is set to 3000;
(2) model evaluation and determination
All test samples (corresponding to three years recurrence probability) were combinedAnd probability of recurrence in five years) Respectively inputting the trained neural network and random forest to obtain the recurrence probability prediction value in three yearsAnd the five-year recurrence probability prediction value
The difference v between the predicted and true values for the three and five years3And v5The calculation is carried out according to the following formula:
parameter v3,v5The larger the value is, the larger the difference between the predicted value and the true value is, namely the larger the error of the corresponding model (neural network or random forest) is, the worse the effect is.
Parameters v for all modelsANN、vRFSelecting the minimum value min { v } of the minimum valuesANN,vRFThe corresponding model is the soft tissue sarcoma recurrence probability prediction model of the invention. The method can be popularized and applied to other fields, areas and samples.
The invention has the following effects: (1) based on soft tissue sarcoma patient samples collected by a hospital, the recurrence probability values of the soft tissue sarcoma in three-year period and five-year period are calculated for the individual samples by using the thinking of sample sampling, and the recurrence probability values are converted by combining the recurrence time data, so that the accurate and reliable recurrence probability of the individual soft tissue sarcoma patient is obtained. (2) The method comprises the steps of extracting 33 typical characteristics such as age, sex and Magnetic Resonance Imaging (MRI) images by using a data set of a soft tissue sarcoma patient, establishing a BP neural network and a random forest model to realize mapping of the characteristics and recurrence probability values, and determining a final soft tissue sarcoma recurrence probability prediction model according to the difference between a predicted value and a true value.
The invention can be widely applied to medical image processing occasions.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (10)
1. A soft tissue sarcoma recurrence probability prediction method based on machine learning is characterized by comprising the following steps:
s1: calculating the recurrence probability based on sample data, namely acquiring the recurrence probability of a single patient by collecting information of soft tissue sarcoma patients and converting the information, and comprising the following steps:
s11: collection of samples { D } of Soft tissue sarcoma patients1,D2,D3,...,DnThe number n of the suggested samples is more than or equal to 100;
s12: calculating the recurrence probability of each sample, including the following specific steps:
s121: for sample i, dividing all subsamples containing sample iEach subsample containingA sample is obtained;
s122: for sub-samplesCalculating the 3-year recurrence probability of the sample i in the subsampleAnd 5 years recurrence probabilityNamely:
in the formula: n is3-r、n5-rAre respectively subsamplesThe recurrence rate in the middle 3 years and the recurrence rate in the 5 years;
s123: calculating the recurrence probability of sample i, namely:
s124: then the { D for all samples is known1,D2,D3,...,DnTriple annual recurrence probabilityAnd probability of recurrence in five years
S125: and (3) converting the three-year relapse probability and the five-year relapse probability by using the relapse time t respectively, namely:
in the formula: the recurrence time t represents the recurrence of the postoperative month, and the t value range [1,60 ];
s2: characteristic screening for soft tissue sarcoma recurrence: screening conventional features and image features in the sample data set;
s3: sample data processing based on features: according to step S1 and step S2, an acquired sample { D }is obtained1,D2,D3,...,DnProcessing the conventional characteristics and the image characteristics of all samples corresponding to the conventional characteristics, the image characteristics, the 3-year relapse probability and the 5-year relapse probability, wherein the processing comprises the following steps:
s31: processing conventional characteristics;
s32: image feature processing: for sample { D1,D2,D3,...,DnAll image features ofStandardized processing is carried out, and each image feature needs to be treatedIts characteristic valueNormalization is performed, namely:
s32: data set partitioning: dividing the test set and the training set, wherein: the training set is used for training the machine learning algorithm, the test set is used for testing the quality of the machine learning algorithm, the data sets are sorted from high to low according to the 3-year recurrence probability or the 5-year recurrence probability, samples with a certain rule are selected as the test set according to the sequence numbers, and the rest data are used as the training set;
s4: recurrence probability prediction based on machine learning model: according to the steps S1, S1 and S2, obtaining a complete data set of all samples, and realizing the mapping of the sample characteristics and the recurrence probability by adopting a BP neural network and a random forest, wherein the method comprises the following steps:
s41: model training: the method comprises a BP neural network and a random forest, wherein:
s411: a BP neural network;
s412: random forests;
s42: model evaluation and determination: will correspond to the probability of relapse in three yearsAnd probability of recurrence in five yearsRespectively inputting the trained neural network and random forest to obtain the recurrence probability predicted value in three yearsAnd the five-year recurrence probability prediction value
The difference v between the predicted and true values for the three and five years3And v5The calculation is carried out, namely:
parameter v3,v5The larger the value is, the larger the difference between the representative predicted value and the true value is, namely the larger the error of the corresponding model is, the better the effect is;
parameters v for all modelsANN、vRFSelecting the minimum value min { v } of the minimum valuesANN,vRFAnd the corresponding model is the soft tissue sarcoma recurrence probability prediction model.
2. The method according to claim 1, wherein the step S11 of collecting sample information of soft tissue sarcoma patient includes: personal information, pathological characteristics, image characteristics, whether the patient relapses in 3 years after the operation and whether the patient relapses in 5 years after the operation.
3. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein the characteristics of recurrence of soft tissue sarcoma in step S2 include:
s21: routine characteristics include gender, age, and post-operative time;
s22: and image characteristics are extracted by using MRI images obtained by the nuclear magnetic resonance equipment.
4. The method as claimed in claim 3, wherein the MRI images obtained by MRI apparatus in step S22 are divided into T1 weighted imaging and T2 weighted imaging according to different imaging modes.
5. The method of claim 3, wherein in step S22, T1 weighted imaging includes the following steps:
the first condition is as follows: in wavelet-low frequency subband imaging mode:
(a) large-area high-gray-level factor characteristics of the gray-level area matrix;
(b) a small area high gray level factor characteristic of the gray level area matrix;
case two: in wavelet-low high frequency sub-band imaging mode:
(a) roughness characteristics of adjacent gray level difference matrices;
(b) total energy characteristics of the first order statistics;
case three: in wavelet-high-low-frequency subband imaging mode:
(a) a small dependence low gray level factor characteristic of the gray level correlation matrix;
case four: in wavelet-high-low-high-frequency sub-band imaging mode:
(b) large-area high-gray-level factor characteristics of the gray-level area matrix;
(c) a small area high gray level factor characteristic of the gray level area matrix;
case five: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
(c) a kurtosis characteristic of the first order statistic;
case six: under a 15mm Laplacian three-dimensional imaging mode:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) a kurtosis characteristic of the first order statistic;
case seven: in the original imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) the large dependence high gray level factor characteristic of the gray difference matrix;
(c) large area high gray level factor characteristics of the gray area matrix.
6. The method of claim 3, wherein in step S22, T2 weighted imaging includes the following steps:
the first condition is as follows: in the original imaging mode:
(a) elongation characteristics of the shape;
(b) the inverse variance characteristic of the gray level co-occurrence matrix;
(c) the large dependence high gray level factor characteristic of the gray difference matrix;
case two: in wavelet-high frequency sub-band imaging mode:
(a) contrast characteristics of adjacent gray level difference matrices;
(b) non-uniform normalization of gray levels of the gray level area matrix;
(c) long-run high-gray-scale factor characteristics of the gray-scale run matrix;
(d) mean feature of first order statistics
Case three: under a 15mm Laplacian three-dimensional imaging mode:
(a) a 90 quantile value feature of the first order statistic;
(b) a kurtosis characteristic of the first order statistic;
case four: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
case five: in wavelet-high-low-high-frequency sub-band imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) clustering shadow features of the gray level co-occurrence matrix;
case six: in wavelet-low frequency subband imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) a small area of the gray scale region matrix is characteristic of a high gray level factor.
7. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein the routine characteristic processing in step S31 comprises the following steps:
a) sex: male 1 and female 0;
b) age: 0.1 year-10, 0.2 year-10, 0.3 year-30, 0.4 year-30, 0.5 year-40, 0.6 year-50, 0.7 year-60, 0.8 year-70, 0.9 year-80, 0.9 year-90, and 1 year-90 or more years old;
c) the time after the operation: the actual number of months m is divided by 60.
8. The method of claim 7, wherein in step S32, the data set division selects the arithmetic progression according to the sequence number, i.e. the 3 rd, 6 th, 9 th, 12 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th 30 … samples as the test set, and the rest data as the training set.
9. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein in step S411, the BP neural network comprises the following contents:
a) selecting a 5-layer network structure: namely an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer Lin,Ly1,Ly2,Ly3,Lout;
b) Number of neurons in 5 layers: respectively as follows: sin,sy1,sy2,sy3,soutWherein: sy1Value range of [16,30 ]],sy2Value range [8,12 ]],sy3Value range [3, 5]];
c) Network initial weight: taking a random value;
d) activation function: the activation function adopts sigmoid function, and the calculation formula is
e) Error function: using sum variance SSE;
f) learning rate: the value range is [0.1,0.5 ].
10. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein in step S412, the key parameters involved in random forest are set as follows:
the variable sampling value of each iteration is set to be 10;
the number of decision trees contained in the random forest was set to 3000.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110399327.0A CN113160969A (en) | 2021-04-14 | 2021-04-14 | Soft tissue sarcoma recurrence probability prediction method based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110399327.0A CN113160969A (en) | 2021-04-14 | 2021-04-14 | Soft tissue sarcoma recurrence probability prediction method based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113160969A true CN113160969A (en) | 2021-07-23 |
Family
ID=76890329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110399327.0A Pending CN113160969A (en) | 2021-04-14 | 2021-04-14 | Soft tissue sarcoma recurrence probability prediction method based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113160969A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105368925A (en) * | 2014-08-23 | 2016-03-02 | 陈铁华 | Biomarker used for prognosis of lung cancer |
CN109599181A (en) * | 2019-01-09 | 2019-04-09 | 中国医学科学院肿瘤医院 | A kind of Prediction of survival system and prediction technique being directed to T3-LARC patient before the treatment |
CN110660481A (en) * | 2019-09-27 | 2020-01-07 | 颐保医疗科技(上海)有限公司 | Artificial intelligence technology-based primary liver cancer recurrence prediction method |
CN112489035A (en) * | 2020-12-14 | 2021-03-12 | 青岛大学附属医院 | Soft tissue sarcoma grade judgment method based on machine learning |
CN112561869A (en) * | 2020-12-09 | 2021-03-26 | 深圳大学 | Pancreatic neuroendocrine tumor postoperative recurrence risk prediction method |
-
2021
- 2021-04-14 CN CN202110399327.0A patent/CN113160969A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105368925A (en) * | 2014-08-23 | 2016-03-02 | 陈铁华 | Biomarker used for prognosis of lung cancer |
CN109599181A (en) * | 2019-01-09 | 2019-04-09 | 中国医学科学院肿瘤医院 | A kind of Prediction of survival system and prediction technique being directed to T3-LARC patient before the treatment |
CN110660481A (en) * | 2019-09-27 | 2020-01-07 | 颐保医疗科技(上海)有限公司 | Artificial intelligence technology-based primary liver cancer recurrence prediction method |
CN112561869A (en) * | 2020-12-09 | 2021-03-26 | 深圳大学 | Pancreatic neuroendocrine tumor postoperative recurrence risk prediction method |
CN112489035A (en) * | 2020-12-14 | 2021-03-12 | 青岛大学附属医院 | Soft tissue sarcoma grade judgment method based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xue et al. | An application of transfer learning and ensemble learning techniques for cervical histopathology image classification | |
CN108806792B (en) | Deep learning face diagnosis system | |
CN109544518B (en) | Method and system applied to bone maturity assessment | |
Lange et al. | A joint model for multistate disease processes and random informative observation times, with applications to electronic medical records data | |
CN111080579B (en) | Bone age assessment method for realizing image segmentation and classification based on deep learning | |
CN104424386A (en) | Multi-parameter magnetic resonance image based prostate cancer computer auxiliary identification system | |
CN114464322B (en) | Female pelvic floor dysfunction disease risk early warning model and construction method and system thereof | |
Tang et al. | Improving generalization of deep learning models for diagnostic pathology by increasing variability in training data: experiments on osteosarcoma subtypes | |
CN113593708A (en) | Sepsis prognosis prediction method based on integrated learning algorithm | |
Nagadeepa et al. | Artificial Intelligence based Cervical Cancer Risk Prediction Using M1 Algorithms | |
CN110766665A (en) | Tongue picture data analysis method based on strong supervision algorithm and deep learning network | |
CN116864062B (en) | Health physical examination report data analysis management system based on Internet | |
CN112233742B (en) | Medical record document classification system, equipment and storage medium based on clustering | |
CN111329467A (en) | Heart disease auxiliary detection method based on artificial intelligence | |
CN117195027A (en) | Cluster weighted clustering integration method based on member selection | |
US20230060794A1 (en) | Diagnostic Tool | |
CN113160969A (en) | Soft tissue sarcoma recurrence probability prediction method based on machine learning | |
CN115131628A (en) | Mammary gland image classification method and equipment based on typing auxiliary information | |
CN114863425A (en) | Urine red blood cell classification method based on supervised contrast learning | |
CN110689961B (en) | Gastric cancer disease risk detection device based on big data analysis technology | |
CN114445374A (en) | Image feature processing method and system based on diffusion kurtosis imaging MK image | |
CN114141360A (en) | Breast cancer prediction method based on punished COX regression | |
CN112132790A (en) | DAC-GAN model construction method and application in mammary gland MR image | |
CN108346471A (en) | A kind of analysis method and device of pathological data | |
Sarikoc et al. | An automated prognosis system for estrogen hormone status assessment in breast cancer tissue samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210723 |
|
RJ01 | Rejection of invention patent application after publication |