CN102760197A - Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab - Google Patents

Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab Download PDF

Info

Publication number
CN102760197A
CN102760197A CN2011101047340A CN201110104734A CN102760197A CN 102760197 A CN102760197 A CN 102760197A CN 2011101047340 A CN2011101047340 A CN 2011101047340A CN 201110104734 A CN201110104734 A CN 201110104734A CN 102760197 A CN102760197 A CN 102760197A
Authority
CN
China
Prior art keywords
data
requirement according
serum
prediction
partial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101047340A
Other languages
Chinese (zh)
Inventor
曾红娟
陈启宏
王鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN2011101047340A priority Critical patent/CN102760197A/en
Publication of CN102760197A publication Critical patent/CN102760197A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method for predicting spectroscopic test data of serum of cancer patients by using a partial least square regression method under Matlab. The method comprises the steps of: developing a batch input tool program of the spectroscopic test data, developing a tool program for converting the data to an ASCII (American Standard Code for Information Interchange) file automatically, establishing a data optimization model, establishing a partial least square regression model, and performing actual testing and prediction, wherein the process of establishing the data optimization model comprises a process of calculating a standard deviation and a process of performing a T distribution test. The process of establishing the partial least square regression model comprises selection of a special value as well as prediction and test of the special value.

Description

Cancer patient spectroscopy is detected the prediction of data based on the PLS of Matlab
Technical field
The present invention is the Forecasting Methodology on a kind of statistics, and concrete is through Matlab cancer patient ultraviolet detection data to be carried out forecast method, and its method belongs to field of biometrics.
Background knowledge
At present; Use in early diagnosis of cancer that the most ripe method is the iconography method, comprising: X-ray sheet, CT; MR; Angiography and intervention radiology etc., this iconography method need the tumor tissues size must possess certain size could effectively to be detected, and this is one type and organizes the detection on the stratification levels.In addition, be exactly that tumor markers detects as the method for complementary inspection, though belong to the molecule stratification levels, because the non-singularity of tumor markers correspondence makes that the specificity of this method is not high.Therefore, need a kind of more efficiently method that can on the molecule stratification levels, realize early diagnosis of cancer, serum spectroscopy inspection provided by the invention is exactly the new method that can on the molecule stratification levels, realize early diagnosis of cancer.But, owing to divide the complicacy of subconstiuent in the human serum, and need information extraction from multiple sample, making needs a good algorithm model carry out data analysis and processing, to reach the purpose that cancer patient is predicted.For this system that has multiple correlation, there is unforeseen relation in each index for selection, directly cause great noise and uncertainty in the system.Through the comparison to common multiple regression algorithm and PLS, we find to utilize PLS can from all indexs, choose with the closest index of dependent variable relation, through dimensionality reduction, thereby reduce noise, overcome the ill-effect of multiple correlation.These characteristics make PLS on the very few problem of reply multiple correlation and sample point, be superior to other common regression algorithm.For this reason, we adopt PLS to come the serum spectrum of cancer patient is analyzed and handled, and reach the purpose that on the molecule stratification levels, realizes early diagnosis of cancer.
Summary of the invention
The partial least-squares regression method that the present invention is based on Matlab carries out the prediction that spectroscopy detects to cancer patient.Its flow process is as shown in Figure 1.
The present invention at first need gather normal person and cancer patient venous samples can, and blood sample is carried out centrifugal treating, then the buffer solution of the serum after the centrifugal treating with different pH is diluted; Adopt the spectroscopy instrument that the blood serum sample of different pH is carried out spectral detection, to obtain their spectral detection data.
The present invention is the spectroscopic data of batch detection input Matlab and be converted into the ASCII character file and through the least square regression method spectroscopic data that obtains carried out pre-service and further optimizes, and sets up forecast model at last and data are predicted.
It is following that the present invention carries out pretreated process to the spectroscopic data that obtains:
1, sets up a kind of disposable selection data and batch input tool based on recycle design;
2, set up a kind of automatic identification document medium ultraviolet and detect data content, remove unnecessary literal, and set up the instrument of ASCII document;
3, set up the second derivative spectrum model and also choose the particular value on the collection of illustrative plates automatically.
The present invention is following to the process that pretreated result further optimizes:
1, data are carried out standard variance and T check, the output survey report is used for the stability of judgment data;
Its standard variance formula is suc as formula (1):
Standard variance={ [∑ (X n-X) 2]/n} (1)
Wherein Xn is the particular value of each sample, and X is the mean value of all samples.
Making each sample particular value is [X 1, X 2, X 3... X n], mean value is X, substitution formula (1) is carried out computing, just can obtain standard variance.Come the difference of a data in the judgement sample big or small based on standard variance, if it is great fluctuation process is big more more for values of disparity, just more unstable;
2, T distributional assumption check is based on μ (population average) and σ (population standard deviation) launches, and they have determined the position and the form of normal distribution.In normal distribution is overall so that fixedly n extracts several samples (general spectral detection is tested n≤200), so the distribution of sample average is Normal Distribution still, promptly N (μ, σ).Because in real work, often σ is unknown, sample standard deviation commonly used is tested as the estimated value of σ.That is to say, but whether difference is remarkable between the T distribution check data, the probability height that small probability event takes place, so the T distribution value can be known data stability intuitively.
The present invention set up partial least-squares regressive analysis method forecast model and process that data are predicted following:
The model that the partial least-squares regressive analysis method is set up is a bilinear model; Wherein for comprising external block (X standalone module and Y standalone module); And home block between the two (the contact module of X and Y); Employing model is in the present invention revised the latent variable of X, makes the covariance of itself and Y reach maximum, promptly is bordering on eigenwert zero data deletion.Program implement is following:
[10] establishing regression model is Y=XB; B=W (P wherein TW) -1Q T(W is a weight, and P is the loading matrix of X, and Q is the loading matrix of Y);
[20] Y=UQ T+ F=u aq T a(U be Y sub matrix, u aBe the score vector of Y, q aBe load vector, F is a residual error);
[30] X=TP T+ E=t ap T a(T be X sub matrix, t aBe the score vector of X, p aBe load vector, E is a residual error);
[40] extract Y, the latent vector of X, when dimension l=0, X=X Original-x (x is an average); Y=Y Original-y (y is an average);
[50] through PCA to dimension l=1 to the l=d estimation that circulates:
[60] first of Y is listed as as initial score vector u, i.e. u=y 1
[70] weight of calculating X: w T=u TX/u TU;
[80] to weight standardization: w T=w T/ (w TW) 1/2
[90] score vector of estimation X matrix: t=Xw T
[100] the load vector of calculating X: q T=t T/ t TT;
[110] score vector of calculating Y matrix: u=Yq/q TQ, if || u New-u Old||<(|| u New||-threshold values) then be tending towards convergence, stop circulation, threshold values is by the precision decision of computing machine;
[120] calculate internal correlation vector: b=u TT/t TT;
[130] the load vector of calculating Y matrix: p T=t TX/t TT;
[140] residual error of calculating X and Y matrix: E=X-tp T, F=Y-uq TBasis of calculation variance R Ev,, obtain final B if it greater than anticipate accuracy, then obtains best dimension.
Description of drawings
Fig. 1: the partial least-squares regression method that the present invention is based on Matlab detects the process flow diagram that data are predicted to cancer patient spectroscopy.
Embodiment
Be to realize foregoing invention, will through following technology with realization:
Handle ultraviolet spectrum with Matlab and detect data, optimize data and data are used for the process of least square method prediction following:
Because the unified standard of different ultraviolet detection instrument output data neither ones, so must remove central literal and be converted into the ASCII character file, the problem that runs in the middle of the conversion mainly is non-Data Labels ' NaN ' and the above comma of thousands digit; As ' 1,000 ', ' 2; 000 ', wherein ' NaN ' will cause the least square method computing to report an error, and comma will cause numeral and separately input of numeral afterwards on the kilobit; As ' 2,300 ' will be input as ' 2 ' with ' 300 ' two numbers, will revise in the data in input; Wherein ' NaN ' will use ' 0 ' to replace, and comma will be removed, the ASCII character file of output self-control suffix .output by name.
To spectroscopy detect data-optimized be to be data conversion second derivative spectrum, each sample be that the difference of pH value is divided into not on the same group according to detecting, according to the characteristic wavelength such as the 450nm of spectroscopy detection data, 280nm; 260nm, 217nm, 197nm, wavelength are respectively protoheme in the biomacromolecule; Albumen, nucleic acid, the characteristic wavelength of materials such as protein βZhe Die; Therefore all uniform datas choose 410/450,280/260, and the ratio of data is as the data of next step optimization on 217/197.
A last step data is divided into many groups again; Normal person's serum ultraviolet detection data are one group, and various cancers are different types of for not on the same group, because normal person's serum ultraviolet detection data will be as training set; So must guarantee the stable of its each sample; The present invention detects through two kinds of methods, at first detects through standard variance, is about to batch data substitution standard variance formula s 2=1/n [(x 1-m) 2+ (x 2-m) 2+ ...+(x n-m) 2] (wherein S is a standard variance, X nBe sample data, m is an average) in detect.And utilize T to distribute and detect the conspicuousness of sample differences.
According to the quality of otherness, select best particular value, to train in the substitution PLS algorithm, the X matrix is the numbering of ultraviolet detection data, and the particular value of Y matrix for selecting gets differently according to pH value, and same numbering has three class values.Obtain three class values behind the operating software; Be respectively the match standard deviation; Cross validation standard deviation and prediction standard deviation; According to above-mentioned three values can the analyzing and training collection the match quality, and test set can be used for predicting the serum characteristic of various cancers and the difference of normal human serum characteristic with the difference between the training set.

Claims (10)

1. prediction that cancer patient spectroscopy is detected data based on the PLS of Matlab; It is characterized in that it need gather normal person and cancer patient venous samples can; Carry out blood sample and handle and spectral detection, and on the Matlab platform, utilize partial least-square regression method to be optimized to the detection data that obtain and handle and prediction.
2. requirement according to claim 1 is carried out centrifugal treating to the blood sample of gathering.
3. requirement according to claim 2 is diluted the buffer solution of the serum after the centrifugal treating with pH4.00, pH6.86, pH9.18.
4. requirement according to claim 3 is with the spectroscopic data of ultraviolet-visual spectrometer collection serum.
5. requirement according to claim 4 is carried out the too development of data input process to the spectrum that obtains, and the data input tool performance history comprises identification and proposes the nonnumerical information in the TXT file, revises error messages such as " NaN " and CSV numeral.
6. requirement according to claim 5 is optimized processing to the data of importing, and optimization process comprises output second derivative spectrum and carries out standard variance and calculate and the T Distribution calculation.
7. requirement according to claim 1, the prediction of the PLS homing method of data comprises the program of choosing of setting up particular value.It is characterized in that, use this Forecasting Methodology, with the ascii text file of output serum ultraviolet detection data, its suffix is called .output.
8. requirement according to claim 7; It is characterized in that, use this Forecasting Methodology, with output data results of optimization file; Comprise the image file of second derivative spectrum and comprise judgment data T distribution inspection significant difference to deny; Fiducial interval, the small probability event probability of happening, standard variance is at interior text.
9. requirement according to claim 8 is characterized in that, the particular values of choosing is 240nm at wavelength respectively, 260nm, 280nm, 410nm, the ratio between the 450nm.
10. Forecasting Methodology according to claim 9 is characterized in that, the serum dilution of pH4.00, pH6.86, pH9.18 is 240nm at wavelength; 260nm, 280nm, 410nm; The following numbering that three texts all comprise the serum data of exporting of ratio between the 450nm; And ratio information, output file is called pH4.0.txt, pH6.86.txt and pH9.18.txt.
CN2011101047340A 2011-04-26 2011-04-26 Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab Pending CN102760197A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101047340A CN102760197A (en) 2011-04-26 2011-04-26 Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101047340A CN102760197A (en) 2011-04-26 2011-04-26 Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab

Publications (1)

Publication Number Publication Date
CN102760197A true CN102760197A (en) 2012-10-31

Family

ID=47054654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101047340A Pending CN102760197A (en) 2011-04-26 2011-04-26 Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab

Country Status (1)

Country Link
CN (1) CN102760197A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104949936A (en) * 2015-07-13 2015-09-30 东北大学 Sample component determination method based on optimizing partial least squares regression model
CN107037001A (en) * 2017-06-15 2017-08-11 中国科学院半导体研究所 A kind of corn monoploid seed discrimination method based on near-infrared spectrum technique

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
CN101825567A (en) * 2010-04-02 2010-09-08 南开大学 Screening method for near infrared spectrum wavelength and Raman spectrum wavelength

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
CN101825567A (en) * 2010-04-02 2010-09-08 南开大学 Screening method for near infrared spectrum wavelength and Raman spectrum wavelength

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
冷爱民等: "激光拉曼光谱在胃癌研究中的应用", 《中国现代医学杂志》, vol. 19, no. 13, 15 July 2009 (2009-07-15) *
夏柏杨等: "近红外光谱分析技术的一些数据处理方法的讨论", 《光谱实验室》, vol. 22, no. 3, 25 May 2005 (2005-05-25), pages 629 - 634 *
朱玉平: "激光诱导血浆自体荧光光谱识别胃癌的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 3, 15 March 2007 (2007-03-15) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104949936A (en) * 2015-07-13 2015-09-30 东北大学 Sample component determination method based on optimizing partial least squares regression model
CN104949936B (en) * 2015-07-13 2017-10-24 东北大学 Sample component assay method based on optimization Partial Least-Squares Regression Model
CN107037001A (en) * 2017-06-15 2017-08-11 中国科学院半导体研究所 A kind of corn monoploid seed discrimination method based on near-infrared spectrum technique

Similar Documents

Publication Publication Date Title
Feldesman Classification trees as an alternative to linear discriminant analysis
US20060059112A1 (en) Machine learning with robust estimation, bayesian classification and model stacking
KR20190113924A (en) Methods and devices for building scoring models and evaluating user credit
CN114579380B (en) Artificial intelligence detection system and method for computer system faults
US20160026917A1 (en) Ranking of random batches to identify predictive features
CN108595657B (en) Data table classification mapping method and device of HIS (hardware-in-the-system)
CN115688760B (en) Intelligent diagnosis guiding method, device, equipment and storage medium
JP2016200435A (en) Mass spectrum analysis system, method, and program
CN113053535A (en) Medical information prediction system and medical information prediction method
WO2021120587A1 (en) Method and apparatus for retina classification based on oct, computer device, and storage medium
CN112837799A (en) Remote internet big data intelligent medical system based on block chain
CN102760197A (en) Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab
KR101771042B1 (en) Apparatus and Method for selection of disease associated gene
CN115148284B (en) Pre-processing method and system of gene data
CN116825192A (en) Interpretation method of ncRNA gene mutation, storage medium and terminal
CN115662595A (en) User information management method and system based on online diagnosis and treatment system
CN112382395B (en) Integrated modeling system based on machine learning
CN114743690A (en) Infectious disease early warning method, infectious disease early warning device, infectious disease early warning medium and electronic equipment
CN114496196A (en) Automatic auditing system for clinical biochemical inspection in medical laboratory
JP2024503317A (en) Neural network output analysis method and system therefor
CN115335911A (en) Screening systems and methods for obtaining and processing genomic information to generate gene variant interpretations
CN112562854A (en) Accurate medical care service recommendation method and system for elderly people
CN113257380B (en) Method and device for difference checking and difference checking rule making
Hund et al. Extending cluster lot quality assurance sampling designs for surveillance programs
CN113448955B (en) Data set quality evaluation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20121031