CN102760197A

CN102760197A - Prediction method for spectroscopic test data of cancer patients by partial least square method based on Matlab

Info

Publication number: CN102760197A
Application number: CN2011101047340A
Authority: CN
Inventors: 曾红娟; 陈启宏; 王鑫
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2011-04-26
Filing date: 2011-04-26
Publication date: 2012-10-31

Abstract

The invention discloses a method for predicting spectroscopic test data of serum of cancer patients by using a partial least square regression method under Matlab. The method comprises the steps of: developing a batch input tool program of the spectroscopic test data, developing a tool program for converting the data to an ASCII (American Standard Code for Information Interchange) file automatically, establishing a data optimization model, establishing a partial least square regression model, and performing actual testing and prediction, wherein the process of establishing the data optimization model comprises a process of calculating a standard deviation and a process of performing a T distribution test. The process of establishing the partial least square regression model comprises selection of a special value as well as prediction and test of the special value.

Description

Cancer patient spectroscopy is detected the prediction of data based on the PLS of Matlab

Technical field

The present invention is the Forecasting Methodology on a kind of statistics, and concrete is through Matlab cancer patient ultraviolet detection data to be carried out forecast method, and its method belongs to field of biometrics.

Background knowledge

At present; Use in early diagnosis of cancer that the most ripe method is the iconography method, comprising: X-ray sheet, CT; MR; Angiography and intervention radiology etc., this iconography method need the tumor tissues size must possess certain size could effectively to be detected, and this is one type and organizes the detection on the stratification levels.In addition, be exactly that tumor markers detects as the method for complementary inspection, though belong to the molecule stratification levels, because the non-singularity of tumor markers correspondence makes that the specificity of this method is not high.Therefore, need a kind of more efficiently method that can on the molecule stratification levels, realize early diagnosis of cancer, serum spectroscopy inspection provided by the invention is exactly the new method that can on the molecule stratification levels, realize early diagnosis of cancer.But, owing to divide the complicacy of subconstiuent in the human serum, and need information extraction from multiple sample, making needs a good algorithm model carry out data analysis and processing, to reach the purpose that cancer patient is predicted.For this system that has multiple correlation, there is unforeseen relation in each index for selection, directly cause great noise and uncertainty in the system.Through the comparison to common multiple regression algorithm and PLS, we find to utilize PLS can from all indexs, choose with the closest index of dependent variable relation, through dimensionality reduction, thereby reduce noise, overcome the ill-effect of multiple correlation.These characteristics make PLS on the very few problem of reply multiple correlation and sample point, be superior to other common regression algorithm.For this reason, we adopt PLS to come the serum spectrum of cancer patient is analyzed and handled, and reach the purpose that on the molecule stratification levels, realizes early diagnosis of cancer.

Summary of the invention

The partial least-squares regression method that the present invention is based on Matlab carries out the prediction that spectroscopy detects to cancer patient.Its flow process is as shown in Figure 1.

The present invention at first need gather normal person and cancer patient venous samples can, and blood sample is carried out centrifugal treating, then the buffer solution of the serum after the centrifugal treating with different pH is diluted; Adopt the spectroscopy instrument that the blood serum sample of different pH is carried out spectral detection, to obtain their spectral detection data.

The present invention is the spectroscopic data of batch detection input Matlab and be converted into the ASCII character file and through the least square regression method spectroscopic data that obtains carried out pre-service and further optimizes, and sets up forecast model at last and data are predicted.

It is following that the present invention carries out pretreated process to the spectroscopic data that obtains:

1, sets up a kind of disposable selection data and batch input tool based on recycle design;

2, set up a kind of automatic identification document medium ultraviolet and detect data content, remove unnecessary literal, and set up the instrument of ASCII document;

3, set up the second derivative spectrum model and also choose the particular value on the collection of illustrative plates automatically.

The present invention is following to the process that pretreated result further optimizes:

1, data are carried out standard variance and T check, the output survey report is used for the stability of judgment data;

Its standard variance formula is suc as formula (1):

Standard variance={ [∑ (X _n-X) ²]/n} (1)

Wherein Xn is the particular value of each sample, and X is the mean value of all samples.

Making each sample particular value is [X ₁, X ₂, X ₃... X _n], mean value is X, substitution formula (1) is carried out computing, just can obtain standard variance.Come the difference of a data in the judgement sample big or small based on standard variance, if it is great fluctuation process is big more more for values of disparity, just more unstable;

2, T distributional assumption check is based on μ (population average) and σ (population standard deviation) launches, and they have determined the position and the form of normal distribution.In normal distribution is overall so that fixedly n extracts several samples (general spectral detection is tested n≤200), so the distribution of sample average is Normal Distribution still, promptly N (μ, σ).Because in real work, often σ is unknown, sample standard deviation commonly used is tested as the estimated value of σ.That is to say, but whether difference is remarkable between the T distribution check data, the probability height that small probability event takes place, so the T distribution value can be known data stability intuitively.

The present invention set up partial least-squares regressive analysis method forecast model and process that data are predicted following:

The model that the partial least-squares regressive analysis method is set up is a bilinear model; Wherein for comprising external block (X standalone module and Y standalone module); And home block between the two (the contact module of X and Y); Employing model is in the present invention revised the latent variable of X, makes the covariance of itself and Y reach maximum, promptly is bordering on eigenwert zero data deletion.Program implement is following:

[10] establishing regression model is Y=XB; B=W (P wherein ^TW) ^-1Q ^T(W is a weight, and P is the loading matrix of X, and Q is the loading matrix of Y);

[20] Y=UQ ^T+ F=u _aq ^T _a(U be Y sub matrix, u _aBe the score vector of Y, q _aBe load vector, F is a residual error);

[30] X=TP ^T+ E=t _ap ^T _a(T be X sub matrix, t _aBe the score vector of X, p _aBe load vector, E is a residual error);

[40] extract Y, the latent vector of X, when dimension l=0, X=X _Original-x (x is an average); Y=Y _Original-y (y is an average);

[50] through PCA to dimension l=1 to the l=d estimation that circulates:

[60] first of Y is listed as as initial score vector u, i.e. u=y ₁

[70] weight of calculating X: w ^T=u ^TX/u ^TU;

[80] to weight standardization: w ^T=w ^T/ (w ^TW) ^1/2

[90] score vector of estimation X matrix: t=Xw ^T

[100] the load vector of calculating X: q ^T=t ^T/ t ^TT;

[110] score vector of calculating Y matrix: u=Yq/q ^TQ, if || u _New-u _Old||＜(|| u _New||-threshold values) then be tending towards convergence, stop circulation, threshold values is by the precision decision of computing machine;

[120] calculate internal correlation vector: b=u ^TT/t ^TT;

[130] the load vector of calculating Y matrix: p ^T=t ^TX/t ^TT;

[140] residual error of calculating X and Y matrix: E=X-tp ^T, F=Y-uq ^TBasis of calculation variance R _Ev,, obtain final B if it greater than anticipate accuracy, then obtains best dimension.

Description of drawings

Fig. 1: the partial least-squares regression method that the present invention is based on Matlab detects the process flow diagram that data are predicted to cancer patient spectroscopy.

Embodiment

Be to realize foregoing invention, will through following technology with realization:

Handle ultraviolet spectrum with Matlab and detect data, optimize data and data are used for the process of least square method prediction following:

Because the unified standard of different ultraviolet detection instrument output data neither ones, so must remove central literal and be converted into the ASCII character file, the problem that runs in the middle of the conversion mainly is non-Data Labels ' NaN ' and the above comma of thousands digit; As ' 1,000 ', ' 2; 000 ', wherein ' NaN ' will cause the least square method computing to report an error, and comma will cause numeral and separately input of numeral afterwards on the kilobit; As ' 2,300 ' will be input as ' 2 ' with ' 300 ' two numbers, will revise in the data in input; Wherein ' NaN ' will use ' 0 ' to replace, and comma will be removed, the ASCII character file of output self-control suffix .output by name.

To spectroscopy detect data-optimized be to be data conversion second derivative spectrum, each sample be that the difference of pH value is divided into not on the same group according to detecting, according to the characteristic wavelength such as the 450nm of spectroscopy detection data, 280nm; 260nm, 217nm, 197nm, wavelength are respectively protoheme in the biomacromolecule; Albumen, nucleic acid, the characteristic wavelength of materials such as protein βZhe Die; Therefore all uniform datas choose 410/450,280/260, and the ratio of data is as the data of next step optimization on 217/197.

A last step data is divided into many groups again; Normal person's serum ultraviolet detection data are one group, and various cancers are different types of for not on the same group, because normal person's serum ultraviolet detection data will be as training set; So must guarantee the stable of its each sample; The present invention detects through two kinds of methods, at first detects through standard variance, is about to batch data substitution standard variance formula s ²=1/n [(x ₁-m) ²+ (x ₂-m) ²+ ...+(x _n-m) ²] (wherein S is a standard variance, X _nBe sample data, m is an average) in detect.And utilize T to distribute and detect the conspicuousness of sample differences.

According to the quality of otherness, select best particular value, to train in the substitution PLS algorithm, the X matrix is the numbering of ultraviolet detection data, and the particular value of Y matrix for selecting gets differently according to pH value, and same numbering has three class values.Obtain three class values behind the operating software; Be respectively the match standard deviation; Cross validation standard deviation and prediction standard deviation; According to above-mentioned three values can the analyzing and training collection the match quality, and test set can be used for predicting the serum characteristic of various cancers and the difference of normal human serum characteristic with the difference between the training set.

Claims

1. prediction that cancer patient spectroscopy is detected data based on the PLS of Matlab; It is characterized in that it need gather normal person and cancer patient venous samples can; Carry out blood sample and handle and spectral detection, and on the Matlab platform, utilize partial least-square regression method to be optimized to the detection data that obtain and handle and prediction.

2. requirement according to claim 1 is carried out centrifugal treating to the blood sample of gathering.

3. requirement according to claim 2 is diluted the buffer solution of the serum after the centrifugal treating with pH4.00, pH6.86, pH9.18.

4. requirement according to claim 3 is with the spectroscopic data of ultraviolet-visual spectrometer collection serum.

5. requirement according to claim 4 is carried out the too development of data input process to the spectrum that obtains, and the data input tool performance history comprises identification and proposes the nonnumerical information in the TXT file, revises error messages such as " NaN " and CSV numeral.

6. requirement according to claim 5 is optimized processing to the data of importing, and optimization process comprises output second derivative spectrum and carries out standard variance and calculate and the T Distribution calculation.

7. requirement according to claim 1, the prediction of the PLS homing method of data comprises the program of choosing of setting up particular value.It is characterized in that, use this Forecasting Methodology, with the ascii text file of output serum ultraviolet detection data, its suffix is called .output.

8. requirement according to claim 7; It is characterized in that, use this Forecasting Methodology, with output data results of optimization file; Comprise the image file of second derivative spectrum and comprise judgment data T distribution inspection significant difference to deny; Fiducial interval, the small probability event probability of happening, standard variance is at interior text.

9. requirement according to claim 8 is characterized in that, the particular values of choosing is 240nm at wavelength respectively, 260nm, 280nm, 410nm, the ratio between the 450nm.

10. Forecasting Methodology according to claim 9 is characterized in that, the serum dilution of pH4.00, pH6.86, pH9.18 is 240nm at wavelength; 260nm, 280nm, 410nm; The following numbering that three texts all comprise the serum data of exporting of ratio between the 450nm; And ratio information, output file is called pH4.0.txt, pH6.86.txt and pH9.18.txt.