CN113160969A - Soft tissue sarcoma recurrence probability prediction method based on machine learning - Google Patents

Soft tissue sarcoma recurrence probability prediction method based on machine learning Download PDF

Info

Publication number
CN113160969A
CN113160969A CN202110399327.0A CN202110399327A CN113160969A CN 113160969 A CN113160969 A CN 113160969A CN 202110399327 A CN202110399327 A CN 202110399327A CN 113160969 A CN113160969 A CN 113160969A
Authority
CN
China
Prior art keywords
recurrence
probability
gray level
year
soft tissue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110399327.0A
Other languages
Chinese (zh)
Inventor
王鹤翔
杨海强
郝大鹏
刘银华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University
Affiliated Hospital of University of Qingdao
Original Assignee
Qingdao University
Affiliated Hospital of University of Qingdao
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University, Affiliated Hospital of University of Qingdao filed Critical Qingdao University
Priority to CN202110399327.0A priority Critical patent/CN113160969A/en
Publication of CN113160969A publication Critical patent/CN113160969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10088Magnetic resonance imaging [MRI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Radiology & Medical Imaging (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a soft tissue sarcoma recurrence probability prediction method based on machine learning, belonging to the technical field of medical image processing. The invention mainly comprises the following steps: s1: calculating the recurrence probability of the basic soft tissue sarcoma sample data; s2: screening conventional features and image features in the sample data set; s3: aiming at the sample data set, implementing conventional feature processing, image feature processing and data set division; s4: and (4) combining the BP neural network model and the random forest to construct a recurrence probability prediction model. The invention is based on soft tissue sarcoma patient samples collected by hospitals, utilizes the thinking of sample sampling to calculate the recurrence probability values of the soft tissue sarcoma in three-year period and five-year period for individual samples, combines the recurrence time data to convert the recurrence probability values to obtain accurate and reliable recurrence probability of the individual soft tissue sarcoma patients, and determines a final soft tissue sarcoma recurrence probability prediction model according to the difference of predicted values and true values.

Description

Soft tissue sarcoma recurrence probability prediction method based on machine learning
Technical Field
The invention relates to a soft tissue sarcoma recurrence probability prediction method based on machine learning, belonging to the technical field of medical image processing.
Background
The existing prediction method aiming at the recurrence probability of the soft tissue sarcoma mainly has two problems: firstly, doctors observe medical images of the sarcoma according to experience to judge the content of the sarcoma, such as size, histological type, pathological grade and the like, and great difference is caused by different abilities and experiences of the doctors, so that treatment is delayed; secondly, based on some specific characteristic information in soft tissue sarcoma data, a mathematical model can be established to carry out recurrence risk prediction, however, the existing model excessively depends on specific characteristics used in the model, and due to large morphological difference, many characteristics and complexity of soft tissue sarcoma, the prediction accuracy rate is low and the reliability is poor.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a soft tissue sarcoma recurrence probability prediction method based on machine learning, which utilizes nuclear magnetic resonance MRI images which are easy to collect in various cities and hospitals to extract typical characteristics, and comprehensively adopts BP neural network and random forest algorithm to establish a soft tissue sarcoma recurrence probability prediction model, so that the soft tissue sarcoma recurrence risk can be predicted.
The invention relates to a soft tissue sarcoma recurrence probability prediction method based on machine learning, which comprises the following steps:
s1: calculating the recurrence probability based on sample data, namely acquiring the recurrence probability of a single patient by collecting information of soft tissue sarcoma patients and converting the information, and comprising the following steps:
s11: collection of samples { D } of Soft tissue sarcoma patients1,D2,D3,...,DnThe number n of the suggested samples is more than or equal to 100;
s12: calculating the recurrence probability of each sample, including the following specific steps:
s121: for sample i, dividing all subsamples containing sample i
Figure BDA0003019802250000011
Each subsample containing
Figure BDA0003019802250000012
A sample is obtained;
s122: for sub-samples
Figure BDA0003019802250000013
Calculating the 3-year recurrence probability of the sample i in the subsample
Figure BDA0003019802250000014
And 5 years recurrence probability
Figure BDA0003019802250000015
Namely:
Figure BDA0003019802250000016
Figure BDA0003019802250000017
in the formula: n is3-r、n5-rAre respectively subsamples
Figure BDA0003019802250000021
The number of recurrent diseases in the middle 3 years and the number of recurrent diseases in the 5 yearsThe number of patients with the disease;
s123: calculating the recurrence probability of sample i, namely:
Figure BDA0003019802250000022
Figure BDA0003019802250000023
s124: then the { D for all samples is known1,D2,D3,...,DnTriple annual recurrence probability
Figure BDA0003019802250000024
And probability of recurrence in five years
Figure BDA0003019802250000025
S125: and (3) converting the three-year relapse probability and the five-year relapse probability by using the relapse time t respectively, namely:
Figure BDA0003019802250000026
Figure BDA0003019802250000027
in the formula: the recurrence time t represents the recurrence of the postoperative month, and the t value range [1,60 ];
s2: characteristic screening for soft tissue sarcoma recurrence: screening conventional features and image features in the sample data set;
s3: sample data processing based on features: according to step S1 and step S2, an acquired sample { D }is obtained1,D2,D3,...,DnProcessing the conventional characteristics and the image characteristics of all samples in the data, including the following small samples, corresponding to the conventional characteristics, the image characteristics, the 3-year recurrence probability and the 5-year recurrence probabilityThe method comprises the following steps:
s31: processing conventional characteristics;
s32: image feature processing: for sample { D1,D2,D3,...,DnAll image features of
Figure BDA0003019802250000028
Standardized processing is carried out, and each image feature needs to be treated
Figure BDA0003019802250000029
Its characteristic value
Figure BDA00030198022500000210
Normalization is performed, namely:
Figure BDA00030198022500000211
s32: data set partitioning: dividing the test set and the training set, wherein: the training set is used for training the machine learning algorithm, the test set is used for testing the quality of the machine learning algorithm, the data sets are sorted from high to low according to the 3-year recurrence probability or the 5-year recurrence probability, samples with a certain rule are selected as the test set according to the sequence numbers, and the rest data are used as the training set;
s4: recurrence probability prediction based on machine learning model: according to the steps S1, S1 and S2, obtaining a complete data set of all samples, and realizing the mapping of the sample characteristics and the recurrence probability by adopting a BP neural network and a random forest, wherein the method comprises the following steps:
s41: model training: the method comprises a BP neural network and a random forest, wherein:
s411: a BP neural network;
s412: random forests;
s42: model evaluation and determination: will correspond to the probability of relapse in three years
Figure BDA0003019802250000031
And probability of recurrence in five years
Figure BDA0003019802250000032
Respectively inputting the trained neural network and random forest to obtain the recurrence probability predicted value in three years
Figure BDA0003019802250000033
And the five-year recurrence probability prediction value
Figure BDA0003019802250000034
The difference v between the predicted and true values for the three and five years3And v5The calculation is carried out, namely:
Figure BDA0003019802250000035
Figure BDA0003019802250000036
parameter v3,v5The larger the value is, the larger the difference between the representative predicted value and the true value is, namely the larger the error of the corresponding model is, the better the effect is;
parameters v for all modelsANN、vRFSelecting the minimum value min { v } of the minimum valuesANN,vRFAnd the corresponding model is the soft tissue sarcoma recurrence probability prediction model.
Preferably, in step S11, the collecting sample information of the soft tissue sarcoma patient includes: personal information, pathological characteristics, image characteristics, whether the patient relapses in 3 years after the operation and whether the patient relapses in 5 years after the operation.
Preferably, in step S2, the characteristics of soft tissue sarcoma recurrence include:
s21: routine characteristics include gender, age, and post-operative time;
s22: and image characteristics are extracted by using MRI images obtained by the nuclear magnetic resonance equipment.
Preferably, in step S22, the MRI images obtained by the MRI apparatus are classified into T1-weighted imaging and T2-weighted imaging according to different imaging modes.
Preferably, in the step S22, T1 weighted imaging includes the following cases:
the first condition is as follows: in wavelet-low frequency subband imaging mode:
(a) large-area high-gray-level factor characteristics of the gray-level area matrix;
(b) a small area high gray level factor characteristic of the gray level area matrix;
case two: in wavelet-low high frequency sub-band imaging mode:
(a) roughness characteristics of adjacent gray level difference matrices;
(b) total energy characteristics of the first order statistics;
case three: in wavelet-high-low-frequency subband imaging mode:
(a) a small dependence low gray level factor characteristic of the gray level correlation matrix;
case four: in wavelet-high-low-high-frequency sub-band imaging mode:
(b) large-area high-gray-level factor characteristics of the gray-level area matrix;
(c) a small area high gray level factor characteristic of the gray level area matrix;
case five: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
(c) a kurtosis characteristic of the first order statistic;
case six: under a 15mm Laplacian three-dimensional imaging mode:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) a kurtosis characteristic of the first order statistic;
case seven: in the original imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) the large dependence high gray level factor characteristic of the gray difference matrix;
(c) large area high gray level factor characteristics of the gray area matrix.
Preferably, in the step S22, T2 weighted imaging includes the following cases:
the first condition is as follows: in the original imaging mode:
(a) elongation characteristics of the shape;
(b) the inverse variance characteristic of the gray level co-occurrence matrix;
(c) the large dependence high gray level factor characteristic of the gray difference matrix;
case two: in wavelet-high frequency sub-band imaging mode:
(a) contrast characteristics of adjacent gray level difference matrices;
(b) non-uniform normalization of gray levels of the gray level area matrix;
(c) long-run high-gray-scale factor characteristics of the gray-scale run matrix;
(d) mean feature of first order statistics
Case three: under a 15mm Laplacian three-dimensional imaging mode:
(a) a 90 quantile value feature of the first order statistic;
(b) a kurtosis characteristic of the first order statistic;
case four: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
case five: in wavelet-high-low-high-frequency sub-band imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) clustering shadow features of the gray level co-occurrence matrix;
case six: in wavelet-low frequency subband imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) a small area of the gray scale region matrix is characteristic of a high gray level factor.
Preferably, in step S31, the conventional feature processing includes the following steps:
a) sex: male 1 and female 0;
b) age: 0.1 year-10, 0.2 year-10, 0.3 year-30, 0.4 year-30, 0.5 year-40, 0.6 year-50, 0.7 year-60, 0.8 year-70, 0.9 year-80, 0.9 year-90, and 1 year-90 or more years old;
c) the time after the operation: the actual number of months m is divided by 60.
Preferably, in step S32, the data set partition selects an arithmetic progression, i.e., the 3 rd, 6 th, 9 th, 12 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th and 30 … th samples as a test set according to the sequence number, and the rest data as a training set.
Preferably, in step S411, the BP neural network includes the following contents:
a) selecting a 5-layer network structure: namely an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer Lin,Ly1,Ly2,Ly3,Lout
b) Number of neurons in 5 layers: respectively as follows: sin,sy1,sy2,sy3,soutWherein: sy1Value range of [16,30 ]],sy2Value range [8,12 ]],sy3Value range [3, 5]];
c) Network initial weight: taking a random value;
d) activation function: the activation function adopts sigmoid function, and the calculation formula is
Figure BDA0003019802250000051
e) Error function: using sum variance SSE;
f) learning rate: the value range is [0.1,0.5 ].
Preferably, in step S412, the key parameters involved in the random forest are set as follows:
the variable sampling value of each iteration is set to be 10;
the number of decision trees contained in the random forest was set to 3000.
The invention has the beneficial effects that:
(1) based on soft tissue sarcoma patient samples collected by a hospital, calculating recurrence probability values of soft tissue sarcoma in three-year period and five-year period by using the thinking of sample sampling, and converting the recurrence probability values by combining recurrence time data to obtain accurate and reliable recurrence probability of individual soft tissue sarcoma patients;
(2) the method comprises the steps of extracting 33 typical characteristics such as age, sex and Magnetic Resonance Imaging (MRI) images by using a data set of a soft tissue sarcoma patient, establishing a BP neural network and a random forest model to realize mapping of the characteristics and recurrence probability values, and determining a final soft tissue sarcoma recurrence probability prediction model according to the difference between a predicted value and a true value.
Drawings
FIG. 1 is a flow diagram of the present invention.
FIG. 2 is a flow diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
as shown in fig. 1 and fig. 2, the method for predicting probability of recurrence of soft tissue sarcoma based on machine learning according to the present invention is mainly described. Firstly, calculating the recurrence probability of basic soft tissue sarcoma sample data; secondly, screening conventional features and image features in the sample data set; thirdly, implementing conventional feature processing, image feature processing and data set division for the sample data set; and finally, combining the BP neural network model and the random forest to construct a recurrence probability prediction model.
The invention specifically comprises the following steps:
step S1: calculating the recurrence probability based on sample data:
for a single soft tissue sarcoma patient, it is difficult to know the accurate recurrence probability, so enough information of the soft tissue sarcoma patient is collected and converted to know the recurrence probability of the single patient.
First, a sample of a patient with soft tissue sarcoma { D }was collected1,D2,D3,...,DnThe sample number is more than 100(n ≧ 100), and the data sample information contains: personal information, pathological characteristics, image characteristics, whether the patient relapses in 3 years after the operation, whether the patient relapses in 5 years after the operation and the like.
Second, the probability of recurrence was calculated for each sample as follows:
(1) for sample i, dividing all subsamples containing sample i
Figure BDA0003019802250000071
Each subsample containing
Figure BDA0003019802250000072
And (4) sampling.
(2) For sub-samples
Figure BDA0003019802250000073
The probability of 3-year recurrence of sample i within this subsample can be calculated
Figure BDA0003019802250000074
And 5 years recurrence probability
Figure BDA0003019802250000075
In a manner that
Figure BDA0003019802250000076
And
Figure BDA0003019802250000077
wherein n is3-rAnd n5-rAre respectively subsamples
Figure BDA0003019802250000078
The number of recurrent diseases in the middle 3 years and 5The number of recurring diseases in the year;
(3) the probability of recurrence for sample i was calculated as follows:
Figure BDA0003019802250000079
Figure BDA00030198022500000710
(4) then the { D for all samples can be known1,D2,D3,...,DnTriple annual recurrence probability
Figure BDA00030198022500000711
And probability of recurrence in five years
Figure BDA00030198022500000712
(5) Using the time of recurrence t (t)1,t2,...,tn∈[1,60]When t is 10, representing postoperative recurrence of 10 months), the probability of recurrence in three years and the probability of recurrence in five years are respectively converted, and the formula is as follows:
Figure BDA00030198022500000713
Figure BDA00030198022500000714
step S2: characteristic screening for soft tissue sarcoma recurrence:
the features for recurrence of soft tissue sarcoma mainly include two categories, one is conventional and the other is medical imaging. The screening of the invention as the basis characteristics of the soft tissue sarcoma recurrence probability calculation comprises the following steps:
general characteristics
(1) Gender, (2) age, (3) postoperative time (month)
(II) medical imaging features
The invention utilizes MRI images obtained by nuclear magnetic resonance equipment to extract 30 image characteristics. The method specifically comprises the following steps:
in T1 weighted imaging, in Wavelet-low frequency sub-band (Wavelet-LLL) imaging mode
(1) Large Area High Gray Level factor (Large Area High Gray Level Emphasis) characteristic of Gray Level Area matrix (GLSZM);
(2) a small-Area High Gray Level factor (Samll Area High Gray Level Emphasis) feature of the Gray Level Area matrix (GLSZM);
in T1 weighted imaging, in Wavelet-low high frequency sub-band (Wavelet-LLH) imaging mode
(3) Roughness (coarsense) characteristics of adjacent gray difference matrices (NGTDM);
(4) total Energy (Total Energy) characteristic of the First Order statistic (First Order);
in T1 weighted imaging, in a Wavelet-high-low frequency subband (Wavelet-HLL) imaging mode
(5) Small dependent Low Gray Level factor (Small dependency Low Gray Level email) characteristics of a Gray Level correlation matrix (GLDM);
in T1 weighted imaging, in Wavelet-high-low-high frequency sub-band (Wavelet-HLH) imaging mode
(6) Large Area High Gray Level factor (Large Area High Gray Level Emphasis) characteristic of Gray Level Area matrix (GLSZM);
(7) a small-Area High Gray Level factor (Samll Area High Gray Level Emphasis) feature of the Gray Level Area matrix (GLSZM);
in T1 weighted imaging, the 5mm Laplacian is under a three-dimensional (log-sigma-0-5-mm-3D) imaging mode
(8) A dependent Non-Uniformity Normalized (dependent Non-Uniformity Normalized) feature of a gray level difference matrix (gldm);
(9) -a Mausus Correlation Coefficient (MCC) characteristic of the gray level co-occurrence matrix (glcm);
(10) kurtosis (Kurtosis) characteristic of the first order statistic (firstorder);
in T1 weighted imaging, 15mm Laplacian three-dimensional (log-sigma-1-5-mm-3D) imaging mode
(11) A dependent Non-Uniformity Normalized (dependent Non-Uniformity Normalized) feature of a gray level difference matrix (gldm);
(12) kurtosis (Kurtosis) characteristic of the first order statistic (firstorder);
original (original) imaging mode in T1 weighted imaging
(13) Inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(14) a Large Dependence High Gray Level factor (Large dependency High Gray Level email) feature of a Gray Level difference matrix (gldm);
(15) large Area High Gray Level factor (Large Area High Gray Level algorithm) characteristics of a Gray Area matrix (GLSZM);
original (original) imaging mode in T2 weighted imaging
(16) Elongation (elongation) characteristics of the shape (shape);
(17) inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(18) a Large Dependence High Gray Level factor (Large dependency High Gray Level email) feature of a Gray Level difference matrix (gldm);
in T2 weighted imaging, in Wavelet-high frequency sub-band (Wavelet-HHH) imaging mode
(19) Contrast (contrast) characteristics of adjacent gray difference matrices (NGTDM);
(20) a Gray Level Non-Uniformity Normalized (Gray Level Non-Uniformity Normalized) feature of a Gray Level area matrix (GLSZM);
(21) long Run High Gray Level factor (Long Run High Gray Level) characteristics of the Gray Run matrix (glrlm);
(22) mean feature of first order statistics (firstorder)
In T2 weighted imaging, 15mm Laplacian three-dimensional (log-sigma-1-5-mm-3D) imaging mode
(23) A 90 quantile (90Percentile) feature of the first order statistic (firstorder);
(24) kurtosis (Kurtosis) characteristic of the first order statistic (firstorder);
in T2 weighted imaging, the 5mm Laplacian is under a three-dimensional (log-sigma-0-5-mm-3D) imaging mode
(25) A dependent Non-Uniformity Normalized (dependent Non-Uniformity Normalized) feature of a gray level difference matrix (gldm);
(26) -a Mausus Correlation Coefficient (MCC) characteristic of the gray level co-occurrence matrix (glcm);
in T2 weighted imaging, in Wavelet-high-low-high frequency sub-band (Wavelet-HLH) imaging mode
(27) Inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(28) cluster shadow (cluster shade) feature of gray level co-occurrence matrix (glcm);
in T2 weighted imaging, in Wavelet-low frequency sub-band (Wavelet-LLL) imaging mode
(29) Inverse variance (invertebrance) characteristics of gray level co-occurrence matrix (glcm);
(30) a small-Area High Gray Level factor (Samll Area High Gray Level Emphasis) feature of the Gray Level Area matrix (GLSZM);
step S3: sample data processing based on features:
from the contents of steps S1 and S2, an acquired sample ({ D)1,D2,D3,...,Dn}) conventional features, image features, 3-year recurrence probability and 5-year recurrence probability corresponding to all samples in the set. The conventional features and the image features are processed as follows:
(1) routine feature processing
a) Sex: male 1 and female 0
b) Age: 0.1 in 0-10 years old, 0.2 in 10-20 years old, 0.3 in 20-30 years old, 0.4 in 30-40 years old, 0.5 in 40-50 years old, 0.6 in 50-60 years old, 0.7 in 60-70 years old, 0.8 in 70-80 years old, 0.9 in 80-90 years old, 1 in over 90 years old
c) The time after the operation: actual number of months m divided by 60(m/60)
(2) Image feature processing
For sample ({ D)1,D2,D3,...,Dn}) ofAll image characteristics
Figure BDA0003019802250000101
Standardized processing is carried out, and each image feature needs to be treated
Figure BDA0003019802250000102
Its characteristic value
Figure BDA0003019802250000103
Normalization is performed, the formula is as follows:
Figure BDA0003019802250000104
(3) data set partitioning
The test set is divided into a training set, the training set is used for training the machine learning algorithm, and the test set is used for checking the quality of the machine learning algorithm.
Sorting the data sets from large to small according to the 3-year recurrence probability or the 5-year recurrence probability, selecting samples of No. 3, No. 6, No. 9, No. 12, No. 15, No. 18, No. 21, No. 24, No. 27 and No. 30 … (arithmetic progression) as test sets according to sequence numbers, and using the rest data as training sets.
Step S4: recurrence probability prediction based on machine learning model:
according to the contents of the steps S1, S2 and S3, a complete data set of all samples can be obtained, and the method adopts a BP Neural Network (Back Propagation Neural Network) and a random forest (Ramdom forest) to realize the mapping of the sample characteristics (including conventional characteristics and image characteristics) and the 3-year relapse probability (or the 5-year relapse probability).
(1) Model training
1) BP neural network
a) Selecting 5-layer network structure, i.e. input layer, hidden layer 1, hidden layer 2, hidden layer 3 and output layer Lin,Ly1,Ly2,Ly3,Lout
b) The number of neurons in each layer is respectively: sin,sy1,sy2,sy3,sout. Wherein s isin=33、soutCorresponding to 33 eigenvalues and 1 output (probability of recurrence in 3-year or 5-year), respectively, s1y1Value range of [16,30 ]],sy2Value range [8,12 ]],sy3Value range [3, 5]];
c) Network initial weight: taking a random value;
d) activation function: the activation function adopts sigmoid function, and the calculation formula is
Figure BDA0003019802250000111
e) Error function: using sum variance (SSE);
f) learning rate: the value range is [0.1,0.5 ].
2) Random forest
The key parameter settings involved in the algorithm are as follows:
the variable sampling value of each iteration is set to be 10;
the number of decision trees contained in the random forest is set to 3000;
(2) model evaluation and determination
All test samples (corresponding to three years recurrence probability) were combined
Figure BDA0003019802250000112
And probability of recurrence in five years
Figure BDA0003019802250000113
) Respectively inputting the trained neural network and random forest to obtain the recurrence probability prediction value in three years
Figure BDA0003019802250000114
And the five-year recurrence probability prediction value
Figure BDA0003019802250000115
The difference v between the predicted and true values for the three and five years3And v5The calculation is carried out according to the following formula:
Figure BDA0003019802250000116
Figure BDA0003019802250000117
parameter v3,v5The larger the value is, the larger the difference between the predicted value and the true value is, namely the larger the error of the corresponding model (neural network or random forest) is, the worse the effect is.
Parameters v for all modelsANN、vRFSelecting the minimum value min { v } of the minimum valuesANN,vRFThe corresponding model is the soft tissue sarcoma recurrence probability prediction model of the invention. The method can be popularized and applied to other fields, areas and samples.
The invention has the following effects: (1) based on soft tissue sarcoma patient samples collected by a hospital, the recurrence probability values of the soft tissue sarcoma in three-year period and five-year period are calculated for the individual samples by using the thinking of sample sampling, and the recurrence probability values are converted by combining the recurrence time data, so that the accurate and reliable recurrence probability of the individual soft tissue sarcoma patient is obtained. (2) The method comprises the steps of extracting 33 typical characteristics such as age, sex and Magnetic Resonance Imaging (MRI) images by using a data set of a soft tissue sarcoma patient, establishing a BP neural network and a random forest model to realize mapping of the characteristics and recurrence probability values, and determining a final soft tissue sarcoma recurrence probability prediction model according to the difference between a predicted value and a true value.
The invention can be widely applied to medical image processing occasions.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A soft tissue sarcoma recurrence probability prediction method based on machine learning is characterized by comprising the following steps:
s1: calculating the recurrence probability based on sample data, namely acquiring the recurrence probability of a single patient by collecting information of soft tissue sarcoma patients and converting the information, and comprising the following steps:
s11: collection of samples { D } of Soft tissue sarcoma patients1,D2,D3,...,DnThe number n of the suggested samples is more than or equal to 100;
s12: calculating the recurrence probability of each sample, including the following specific steps:
s121: for sample i, dividing all subsamples containing sample i
Figure FDA0003019802240000011
Each subsample containing
Figure FDA0003019802240000012
A sample is obtained;
s122: for sub-samples
Figure FDA0003019802240000013
Calculating the 3-year recurrence probability of the sample i in the subsample
Figure FDA0003019802240000014
And 5 years recurrence probability
Figure FDA0003019802240000015
Namely:
Figure FDA0003019802240000016
Figure FDA0003019802240000017
in the formula: n is3-r、n5-rAre respectively subsamples
Figure FDA0003019802240000018
The recurrence rate in the middle 3 years and the recurrence rate in the 5 years;
s123: calculating the recurrence probability of sample i, namely:
Figure FDA0003019802240000019
Figure FDA00030198022400000110
s124: then the { D for all samples is known1,D2,D3,...,DnTriple annual recurrence probability
Figure FDA00030198022400000111
And probability of recurrence in five years
Figure FDA00030198022400000112
S125: and (3) converting the three-year relapse probability and the five-year relapse probability by using the relapse time t respectively, namely:
Figure FDA00030198022400000113
Figure FDA00030198022400000114
in the formula: the recurrence time t represents the recurrence of the postoperative month, and the t value range [1,60 ];
s2: characteristic screening for soft tissue sarcoma recurrence: screening conventional features and image features in the sample data set;
s3: sample data processing based on features: according to step S1 and step S2, an acquired sample { D }is obtained1,D2,D3,...,DnProcessing the conventional characteristics and the image characteristics of all samples corresponding to the conventional characteristics, the image characteristics, the 3-year relapse probability and the 5-year relapse probability, wherein the processing comprises the following steps:
s31: processing conventional characteristics;
s32: image feature processing: for sample { D1,D2,D3,...,DnAll image features of
Figure FDA0003019802240000021
Standardized processing is carried out, and each image feature needs to be treated
Figure FDA0003019802240000022
Its characteristic value
Figure FDA0003019802240000023
Normalization is performed, namely:
Figure FDA0003019802240000024
s32: data set partitioning: dividing the test set and the training set, wherein: the training set is used for training the machine learning algorithm, the test set is used for testing the quality of the machine learning algorithm, the data sets are sorted from high to low according to the 3-year recurrence probability or the 5-year recurrence probability, samples with a certain rule are selected as the test set according to the sequence numbers, and the rest data are used as the training set;
s4: recurrence probability prediction based on machine learning model: according to the steps S1, S1 and S2, obtaining a complete data set of all samples, and realizing the mapping of the sample characteristics and the recurrence probability by adopting a BP neural network and a random forest, wherein the method comprises the following steps:
s41: model training: the method comprises a BP neural network and a random forest, wherein:
s411: a BP neural network;
s412: random forests;
s42: model evaluation and determination: will correspond to the probability of relapse in three years
Figure FDA0003019802240000025
And probability of recurrence in five years
Figure FDA0003019802240000026
Respectively inputting the trained neural network and random forest to obtain the recurrence probability predicted value in three years
Figure FDA0003019802240000027
And the five-year recurrence probability prediction value
Figure FDA0003019802240000028
The difference v between the predicted and true values for the three and five years3And v5The calculation is carried out, namely:
Figure FDA0003019802240000029
Figure FDA00030198022400000210
parameter v3,v5The larger the value is, the larger the difference between the representative predicted value and the true value is, namely the larger the error of the corresponding model is, the better the effect is;
parameters v for all modelsANN、vRFSelecting the minimum value min { v } of the minimum valuesANN,vRFAnd the corresponding model is the soft tissue sarcoma recurrence probability prediction model.
2. The method according to claim 1, wherein the step S11 of collecting sample information of soft tissue sarcoma patient includes: personal information, pathological characteristics, image characteristics, whether the patient relapses in 3 years after the operation and whether the patient relapses in 5 years after the operation.
3. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein the characteristics of recurrence of soft tissue sarcoma in step S2 include:
s21: routine characteristics include gender, age, and post-operative time;
s22: and image characteristics are extracted by using MRI images obtained by the nuclear magnetic resonance equipment.
4. The method as claimed in claim 3, wherein the MRI images obtained by MRI apparatus in step S22 are divided into T1 weighted imaging and T2 weighted imaging according to different imaging modes.
5. The method of claim 3, wherein in step S22, T1 weighted imaging includes the following steps:
the first condition is as follows: in wavelet-low frequency subband imaging mode:
(a) large-area high-gray-level factor characteristics of the gray-level area matrix;
(b) a small area high gray level factor characteristic of the gray level area matrix;
case two: in wavelet-low high frequency sub-band imaging mode:
(a) roughness characteristics of adjacent gray level difference matrices;
(b) total energy characteristics of the first order statistics;
case three: in wavelet-high-low-frequency subband imaging mode:
(a) a small dependence low gray level factor characteristic of the gray level correlation matrix;
case four: in wavelet-high-low-high-frequency sub-band imaging mode:
(b) large-area high-gray-level factor characteristics of the gray-level area matrix;
(c) a small area high gray level factor characteristic of the gray level area matrix;
case five: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
(c) a kurtosis characteristic of the first order statistic;
case six: under a 15mm Laplacian three-dimensional imaging mode:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) a kurtosis characteristic of the first order statistic;
case seven: in the original imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) the large dependence high gray level factor characteristic of the gray difference matrix;
(c) large area high gray level factor characteristics of the gray area matrix.
6. The method of claim 3, wherein in step S22, T2 weighted imaging includes the following steps:
the first condition is as follows: in the original imaging mode:
(a) elongation characteristics of the shape;
(b) the inverse variance characteristic of the gray level co-occurrence matrix;
(c) the large dependence high gray level factor characteristic of the gray difference matrix;
case two: in wavelet-high frequency sub-band imaging mode:
(a) contrast characteristics of adjacent gray level difference matrices;
(b) non-uniform normalization of gray levels of the gray level area matrix;
(c) long-run high-gray-scale factor characteristics of the gray-scale run matrix;
(d) mean feature of first order statistics
Case three: under a 15mm Laplacian three-dimensional imaging mode:
(a) a 90 quantile value feature of the first order statistic;
(b) a kurtosis characteristic of the first order statistic;
case four: under the three-dimensional imaging mode of the 5mm Laplacian:
(a) the dependency unevenness normalization characteristic of the gray difference matrix;
(b) the Mazis correlation coefficient characteristics of the gray level co-occurrence matrix;
case five: in wavelet-high-low-high-frequency sub-band imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) clustering shadow features of the gray level co-occurrence matrix;
case six: in wavelet-low frequency subband imaging mode:
(a) the inverse variance characteristic of the gray level co-occurrence matrix;
(b) a small area of the gray scale region matrix is characteristic of a high gray level factor.
7. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein the routine characteristic processing in step S31 comprises the following steps:
a) sex: male 1 and female 0;
b) age: 0.1 year-10, 0.2 year-10, 0.3 year-30, 0.4 year-30, 0.5 year-40, 0.6 year-50, 0.7 year-60, 0.8 year-70, 0.9 year-80, 0.9 year-90, and 1 year-90 or more years old;
c) the time after the operation: the actual number of months m is divided by 60.
8. The method of claim 7, wherein in step S32, the data set division selects the arithmetic progression according to the sequence number, i.e. the 3 rd, 6 th, 9 th, 12 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th 30 … samples as the test set, and the rest data as the training set.
9. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein in step S411, the BP neural network comprises the following contents:
a) selecting a 5-layer network structure: namely an input layer, a hidden layer 1, a hidden layer 2, a hidden layer 3 and an output layer Lin,Ly1,Ly2,Ly3,Lout
b) Number of neurons in 5 layers: respectively as follows: sin,sy1,sy2,sy3,soutWherein: sy1Value range of [16,30 ]],sy2Value range [8,12 ]],sy3Value range [3, 5]];
c) Network initial weight: taking a random value;
d) activation function: the activation function adopts sigmoid function, and the calculation formula is
Figure FDA0003019802240000051
e) Error function: using sum variance SSE;
f) learning rate: the value range is [0.1,0.5 ].
10. The method of predicting probability of recurrence of soft tissue sarcoma based on machine learning of claim 1, wherein in step S412, the key parameters involved in random forest are set as follows:
the variable sampling value of each iteration is set to be 10;
the number of decision trees contained in the random forest was set to 3000.
CN202110399327.0A 2021-04-14 2021-04-14 Soft tissue sarcoma recurrence probability prediction method based on machine learning Pending CN113160969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110399327.0A CN113160969A (en) 2021-04-14 2021-04-14 Soft tissue sarcoma recurrence probability prediction method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110399327.0A CN113160969A (en) 2021-04-14 2021-04-14 Soft tissue sarcoma recurrence probability prediction method based on machine learning

Publications (1)

Publication Number Publication Date
CN113160969A true CN113160969A (en) 2021-07-23

Family

ID=76890329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110399327.0A Pending CN113160969A (en) 2021-04-14 2021-04-14 Soft tissue sarcoma recurrence probability prediction method based on machine learning

Country Status (1)

Country Link
CN (1) CN113160969A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105368925A (en) * 2014-08-23 2016-03-02 陈铁华 Biomarker used for prognosis of lung cancer
CN109599181A (en) * 2019-01-09 2019-04-09 中国医学科学院肿瘤医院 A kind of Prediction of survival system and prediction technique being directed to T3-LARC patient before the treatment
CN110660481A (en) * 2019-09-27 2020-01-07 颐保医疗科技(上海)有限公司 Artificial intelligence technology-based primary liver cancer recurrence prediction method
CN112489035A (en) * 2020-12-14 2021-03-12 青岛大学附属医院 Soft tissue sarcoma grade judgment method based on machine learning
CN112561869A (en) * 2020-12-09 2021-03-26 深圳大学 Pancreatic neuroendocrine tumor postoperative recurrence risk prediction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105368925A (en) * 2014-08-23 2016-03-02 陈铁华 Biomarker used for prognosis of lung cancer
CN109599181A (en) * 2019-01-09 2019-04-09 中国医学科学院肿瘤医院 A kind of Prediction of survival system and prediction technique being directed to T3-LARC patient before the treatment
CN110660481A (en) * 2019-09-27 2020-01-07 颐保医疗科技(上海)有限公司 Artificial intelligence technology-based primary liver cancer recurrence prediction method
CN112561869A (en) * 2020-12-09 2021-03-26 深圳大学 Pancreatic neuroendocrine tumor postoperative recurrence risk prediction method
CN112489035A (en) * 2020-12-14 2021-03-12 青岛大学附属医院 Soft tissue sarcoma grade judgment method based on machine learning

Similar Documents

Publication Publication Date Title
Xue et al. An application of transfer learning and ensemble learning techniques for cervical histopathology image classification
CN108806792B (en) Deep learning face diagnosis system
CN109544518B (en) Method and system applied to bone maturity assessment
Lange et al. A joint model for multistate disease processes and random informative observation times, with applications to electronic medical records data
CN111080579B (en) Bone age assessment method for realizing image segmentation and classification based on deep learning
CN104424386A (en) Multi-parameter magnetic resonance image based prostate cancer computer auxiliary identification system
CN114464322B (en) Female pelvic floor dysfunction disease risk early warning model and construction method and system thereof
Tang et al. Improving generalization of deep learning models for diagnostic pathology by increasing variability in training data: experiments on osteosarcoma subtypes
CN113593708A (en) Sepsis prognosis prediction method based on integrated learning algorithm
Nagadeepa et al. Artificial Intelligence based Cervical Cancer Risk Prediction Using M1 Algorithms
CN110766665A (en) Tongue picture data analysis method based on strong supervision algorithm and deep learning network
CN116864062B (en) Health physical examination report data analysis management system based on Internet
CN112233742B (en) Medical record document classification system, equipment and storage medium based on clustering
CN111329467A (en) Heart disease auxiliary detection method based on artificial intelligence
CN117195027A (en) Cluster weighted clustering integration method based on member selection
US20230060794A1 (en) Diagnostic Tool
CN113160969A (en) Soft tissue sarcoma recurrence probability prediction method based on machine learning
CN115131628A (en) Mammary gland image classification method and equipment based on typing auxiliary information
CN114863425A (en) Urine red blood cell classification method based on supervised contrast learning
CN110689961B (en) Gastric cancer disease risk detection device based on big data analysis technology
CN114445374A (en) Image feature processing method and system based on diffusion kurtosis imaging MK image
CN114141360A (en) Breast cancer prediction method based on punished COX regression
CN112132790A (en) DAC-GAN model construction method and application in mammary gland MR image
CN108346471A (en) A kind of analysis method and device of pathological data
Sarikoc et al. An automated prognosis system for estrogen hormone status assessment in breast cancer tissue samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210723

RJ01 Rejection of invention patent application after publication