CN102760197A - Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab - Google Patents
Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab Download PDFInfo
- Publication number
- CN102760197A CN102760197A CN2011101047340A CN201110104734A CN102760197A CN 102760197 A CN102760197 A CN 102760197A CN 2011101047340 A CN2011101047340 A CN 2011101047340A CN 201110104734 A CN201110104734 A CN 201110104734A CN 102760197 A CN102760197 A CN 102760197A
- Authority
- CN
- China
- Prior art keywords
- data
- requirement according
- serum
- prediction
- partial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a method for predicting spectroscopic test data of serum of cancer patients by using a partial least square regression method under Matlab. The method comprises the steps of: developing a batch input tool program of the spectroscopic test data, developing a tool program for converting the data to an ASCII (American Standard Code for Information Interchange) file automatically, establishing a data optimization model, establishing a partial least square regression model, and performing actual testing and prediction, wherein the process of establishing the data optimization model comprises a process of calculating a standard deviation and a process of performing a T distribution test. The process of establishing the partial least square regression model comprises selection of a special value as well as prediction and test of the special value.
Description
Technical field
The present invention is the Forecasting Methodology on a kind of statistics, and concrete is through Matlab cancer patient ultraviolet detection data to be carried out forecast method, and its method belongs to field of biometrics.
Background knowledge
At present; Use in early diagnosis of cancer that the most ripe method is the iconography method, comprising: X-ray sheet, CT; MR; Angiography and intervention radiology etc., this iconography method need the tumor tissues size must possess certain size could effectively to be detected, and this is one type and organizes the detection on the stratification levels.In addition, be exactly that tumor markers detects as the method for complementary inspection, though belong to the molecule stratification levels, because the non-singularity of tumor markers correspondence makes that the specificity of this method is not high.Therefore, need a kind of more efficiently method that can on the molecule stratification levels, realize early diagnosis of cancer, serum spectroscopy inspection provided by the invention is exactly the new method that can on the molecule stratification levels, realize early diagnosis of cancer.But, owing to divide the complicacy of subconstiuent in the human serum, and need information extraction from multiple sample, making needs a good algorithm model carry out data analysis and processing, to reach the purpose that cancer patient is predicted.For this system that has multiple correlation, there is unforeseen relation in each index for selection, directly cause great noise and uncertainty in the system.Through the comparison to common multiple regression algorithm and PLS, we find to utilize PLS can from all indexs, choose with the closest index of dependent variable relation, through dimensionality reduction, thereby reduce noise, overcome the ill-effect of multiple correlation.These characteristics make PLS on the very few problem of reply multiple correlation and sample point, be superior to other common regression algorithm.For this reason, we adopt PLS to come the serum spectrum of cancer patient is analyzed and handled, and reach the purpose that on the molecule stratification levels, realizes early diagnosis of cancer.
Summary of the invention
The partial least-squares regression method that the present invention is based on Matlab carries out the prediction that spectroscopy detects to cancer patient.Its flow process is as shown in Figure 1.
The present invention at first need gather normal person and cancer patient venous samples can, and blood sample is carried out centrifugal treating, then the buffer solution of the serum after the centrifugal treating with different pH is diluted; Adopt the spectroscopy instrument that the blood serum sample of different pH is carried out spectral detection, to obtain their spectral detection data.
The present invention is the spectroscopic data of batch detection input Matlab and be converted into the ASCII character file and through the least square regression method spectroscopic data that obtains carried out pre-service and further optimizes, and sets up forecast model at last and data are predicted.
It is following that the present invention carries out pretreated process to the spectroscopic data that obtains:
1, sets up a kind of disposable selection data and batch input tool based on recycle design;
2, set up a kind of automatic identification document medium ultraviolet and detect data content, remove unnecessary literal, and set up the instrument of ASCII document;
3, set up the second derivative spectrum model and also choose the particular value on the collection of illustrative plates automatically.
The present invention is following to the process that pretreated result further optimizes:
1, data are carried out standard variance and T check, the output survey report is used for the stability of judgment data;
Its standard variance formula is suc as formula (1):
Standard variance={ [∑ (X
n-X)
2]/n} (1)
Wherein Xn is the particular value of each sample, and X is the mean value of all samples.
Making each sample particular value is [X
1, X
2, X
3... X
n], mean value is X, substitution formula (1) is carried out computing, just can obtain standard variance.Come the difference of a data in the judgement sample big or small based on standard variance, if it is great fluctuation process is big more more for values of disparity, just more unstable;
2, T distributional assumption check is based on μ (population average) and σ (population standard deviation) launches, and they have determined the position and the form of normal distribution.In normal distribution is overall so that fixedly n extracts several samples (general spectral detection is tested n≤200), so the distribution of sample average is Normal Distribution still, promptly N (μ, σ).Because in real work, often σ is unknown, sample standard deviation commonly used is tested as the estimated value of σ.That is to say, but whether difference is remarkable between the T distribution check data, the probability height that small probability event takes place, so the T distribution value can be known data stability intuitively.
The present invention set up partial least-squares regressive analysis method forecast model and process that data are predicted following:
The model that the partial least-squares regressive analysis method is set up is a bilinear model; Wherein for comprising external block (X standalone module and Y standalone module); And home block between the two (the contact module of X and Y); Employing model is in the present invention revised the latent variable of X, makes the covariance of itself and Y reach maximum, promptly is bordering on eigenwert zero data deletion.Program implement is following:
[10] establishing regression model is Y=XB; B=W (P wherein
TW)
-1Q
T(W is a weight, and P is the loading matrix of X, and Q is the loading matrix of Y);
[20] Y=UQ
T+ F=u
aq
T a(U be Y sub matrix, u
aBe the score vector of Y, q
aBe load vector, F is a residual error);
[30] X=TP
T+ E=t
ap
T a(T be X sub matrix, t
aBe the score vector of X, p
aBe load vector, E is a residual error);
[40] extract Y, the latent vector of X, when dimension l=0, X=X
Original-x (x is an average); Y=Y
Original-y (y is an average);
[50] through PCA to dimension l=1 to the l=d estimation that circulates:
[60] first of Y is listed as as initial score vector u, i.e. u=y
1
[70] weight of calculating X: w
T=u
TX/u
TU;
[80] to weight standardization: w
T=w
T/ (w
TW)
1/2
[90] score vector of estimation X matrix: t=Xw
T
[100] the load vector of calculating X: q
T=t
T/ t
TT;
[110] score vector of calculating Y matrix: u=Yq/q
TQ, if || u
New-u
Old||<(|| u
New||-threshold values) then be tending towards convergence, stop circulation, threshold values is by the precision decision of computing machine;
[120] calculate internal correlation vector: b=u
TT/t
TT;
[130] the load vector of calculating Y matrix: p
T=t
TX/t
TT;
[140] residual error of calculating X and Y matrix: E=X-tp
T, F=Y-uq
TBasis of calculation variance R
Ev,, obtain final B if it greater than anticipate accuracy, then obtains best dimension.
Description of drawings
Fig. 1: the partial least-squares regression method that the present invention is based on Matlab detects the process flow diagram that data are predicted to cancer patient spectroscopy.
Embodiment
Be to realize foregoing invention, will through following technology with realization:
Handle ultraviolet spectrum with Matlab and detect data, optimize data and data are used for the process of least square method prediction following:
Because the unified standard of different ultraviolet detection instrument output data neither ones, so must remove central literal and be converted into the ASCII character file, the problem that runs in the middle of the conversion mainly is non-Data Labels ' NaN ' and the above comma of thousands digit; As ' 1,000 ', ' 2; 000 ', wherein ' NaN ' will cause the least square method computing to report an error, and comma will cause numeral and separately input of numeral afterwards on the kilobit; As ' 2,300 ' will be input as ' 2 ' with ' 300 ' two numbers, will revise in the data in input; Wherein ' NaN ' will use ' 0 ' to replace, and comma will be removed, the ASCII character file of output self-control suffix .output by name.
To spectroscopy detect data-optimized be to be data conversion second derivative spectrum, each sample be that the difference of pH value is divided into not on the same group according to detecting, according to the characteristic wavelength such as the 450nm of spectroscopy detection data, 280nm; 260nm, 217nm, 197nm, wavelength are respectively protoheme in the biomacromolecule; Albumen, nucleic acid, the characteristic wavelength of materials such as protein βZhe Die; Therefore all uniform datas choose 410/450,280/260, and the ratio of data is as the data of next step optimization on 217/197.
A last step data is divided into many groups again; Normal person's serum ultraviolet detection data are one group, and various cancers are different types of for not on the same group, because normal person's serum ultraviolet detection data will be as training set; So must guarantee the stable of its each sample; The present invention detects through two kinds of methods, at first detects through standard variance, is about to batch data substitution standard variance formula s
2=1/n [(x
1-m)
2+ (x
2-m)
2+ ...+(x
n-m)
2] (wherein S is a standard variance, X
nBe sample data, m is an average) in detect.And utilize T to distribute and detect the conspicuousness of sample differences.
According to the quality of otherness, select best particular value, to train in the substitution PLS algorithm, the X matrix is the numbering of ultraviolet detection data, and the particular value of Y matrix for selecting gets differently according to pH value, and same numbering has three class values.Obtain three class values behind the operating software; Be respectively the match standard deviation; Cross validation standard deviation and prediction standard deviation; According to above-mentioned three values can the analyzing and training collection the match quality, and test set can be used for predicting the serum characteristic of various cancers and the difference of normal human serum characteristic with the difference between the training set.
Claims (10)
1. prediction that cancer patient spectroscopy is detected data based on the PLS of Matlab; It is characterized in that it need gather normal person and cancer patient venous samples can; Carry out blood sample and handle and spectral detection, and on the Matlab platform, utilize partial least-square regression method to be optimized to the detection data that obtain and handle and prediction.
2. requirement according to claim 1 is carried out centrifugal treating to the blood sample of gathering.
3. requirement according to claim 2 is diluted the buffer solution of the serum after the centrifugal treating with pH4.00, pH6.86, pH9.18.
4. requirement according to claim 3 is with the spectroscopic data of ultraviolet-visual spectrometer collection serum.
5. requirement according to claim 4 is carried out the too development of data input process to the spectrum that obtains, and the data input tool performance history comprises identification and proposes the nonnumerical information in the TXT file, revises error messages such as " NaN " and CSV numeral.
6. requirement according to claim 5 is optimized processing to the data of importing, and optimization process comprises output second derivative spectrum and carries out standard variance and calculate and the T Distribution calculation.
7. requirement according to claim 1, the prediction of the PLS homing method of data comprises the program of choosing of setting up particular value.It is characterized in that, use this Forecasting Methodology, with the ascii text file of output serum ultraviolet detection data, its suffix is called .output.
8. requirement according to claim 7; It is characterized in that, use this Forecasting Methodology, with output data results of optimization file; Comprise the image file of second derivative spectrum and comprise judgment data T distribution inspection significant difference to deny; Fiducial interval, the small probability event probability of happening, standard variance is at interior text.
9. requirement according to claim 8 is characterized in that, the particular values of choosing is 240nm at wavelength respectively, 260nm, 280nm, 410nm, the ratio between the 450nm.
10. Forecasting Methodology according to claim 9 is characterized in that, the serum dilution of pH4.00, pH6.86, pH9.18 is 240nm at wavelength; 260nm, 280nm, 410nm; The following numbering that three texts all comprise the serum data of exporting of ratio between the 450nm; And ratio information, output file is called pH4.0.txt, pH6.86.txt and pH9.18.txt.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101047340A CN102760197A (en) | 2011-04-26 | 2011-04-26 | Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101047340A CN102760197A (en) | 2011-04-26 | 2011-04-26 | Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102760197A true CN102760197A (en) | 2012-10-31 |
Family
ID=47054654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011101047340A Pending CN102760197A (en) | 2011-04-26 | 2011-04-26 | Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102760197A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104949936A (en) * | 2015-07-13 | 2015-09-30 | 东北大学 | Sample component determination method based on optimizing partial least squares regression model |
CN107037001A (en) * | 2017-06-15 | 2017-08-11 | 中国科学院半导体研究所 | A kind of corn monoploid seed discrimination method based on near-infrared spectrum technique |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004038602A1 (en) * | 2002-10-24 | 2004-05-06 | Warner-Lambert Company, Llc | Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications |
CN101825567A (en) * | 2010-04-02 | 2010-09-08 | 南开大学 | Screening method for near infrared spectrum wavelength and Raman spectrum wavelength |
-
2011
- 2011-04-26 CN CN2011101047340A patent/CN102760197A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004038602A1 (en) * | 2002-10-24 | 2004-05-06 | Warner-Lambert Company, Llc | Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications |
CN101825567A (en) * | 2010-04-02 | 2010-09-08 | 南开大学 | Screening method for near infrared spectrum wavelength and Raman spectrum wavelength |
Non-Patent Citations (3)
Title |
---|
冷爱民等: "激光拉曼光谱在胃癌研究中的应用", 《中国现代医学杂志》, vol. 19, no. 13, 15 July 2009 (2009-07-15) * |
夏柏杨等: "近红外光谱分析技术的一些数据处理方法的讨论", 《光谱实验室》, vol. 22, no. 3, 25 May 2005 (2005-05-25), pages 629 - 634 * |
朱玉平: "激光诱导血浆自体荧光光谱识别胃癌的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 3, 15 March 2007 (2007-03-15) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104949936A (en) * | 2015-07-13 | 2015-09-30 | 东北大学 | Sample component determination method based on optimizing partial least squares regression model |
CN104949936B (en) * | 2015-07-13 | 2017-10-24 | 东北大学 | Sample component assay method based on optimization Partial Least-Squares Regression Model |
CN107037001A (en) * | 2017-06-15 | 2017-08-11 | 中国科学院半导体研究所 | A kind of corn monoploid seed discrimination method based on near-infrared spectrum technique |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Feldesman | Classification trees as an alternative to linear discriminant analysis | |
US20060059112A1 (en) | Machine learning with robust estimation, bayesian classification and model stacking | |
KR20190113924A (en) | Methods and devices for building scoring models and evaluating user credit | |
CN114579380B (en) | Artificial intelligence detection system and method for computer system faults | |
US20160026917A1 (en) | Ranking of random batches to identify predictive features | |
CN108595657B (en) | Data table classification mapping method and device of HIS (hardware-in-the-system) | |
CN115688760B (en) | Intelligent diagnosis guiding method, device, equipment and storage medium | |
JP2016200435A (en) | Mass spectrum analysis system, method, and program | |
CN113053535A (en) | Medical information prediction system and medical information prediction method | |
WO2021120587A1 (en) | Method and apparatus for retina classification based on oct, computer device, and storage medium | |
CN112837799A (en) | Remote internet big data intelligent medical system based on block chain | |
CN102760197A (en) | Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab | |
KR101771042B1 (en) | Apparatus and Method for selection of disease associated gene | |
CN115148284B (en) | Pre-processing method and system of gene data | |
CN116825192A (en) | Interpretation method of ncRNA gene mutation, storage medium and terminal | |
CN115662595A (en) | User information management method and system based on online diagnosis and treatment system | |
CN112382395B (en) | Integrated modeling system based on machine learning | |
CN114743690A (en) | Infectious disease early warning method, infectious disease early warning device, infectious disease early warning medium and electronic equipment | |
CN114496196A (en) | Automatic auditing system for clinical biochemical inspection in medical laboratory | |
JP2024503317A (en) | Neural network output analysis method and system therefor | |
CN115335911A (en) | Screening systems and methods for obtaining and processing genomic information to generate gene variant interpretations | |
CN112562854A (en) | Accurate medical care service recommendation method and system for elderly people | |
CN113257380B (en) | Method and device for difference checking and difference checking rule making | |
Hund et al. | Extending cluster lot quality assurance sampling designs for surveillance programs | |
CN113448955B (en) | Data set quality evaluation method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20121031 |