CN101215602A - Method for screening gene chip difference expression gene - Google Patents

Method for screening gene chip difference expression gene Download PDF

Info

Publication number
CN101215602A
CN101215602A CNA2007101735861A CN200710173586A CN101215602A CN 101215602 A CN101215602 A CN 101215602A CN A2007101735861 A CNA2007101735861 A CN A2007101735861A CN 200710173586 A CN200710173586 A CN 200710173586A CN 101215602 A CN101215602 A CN 101215602A
Authority
CN
China
Prior art keywords
gene
screening
sigma
chip
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101735861A
Other languages
Chinese (zh)
Other versions
CN101215602B (en
Inventor
刘极龙
曾华宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cluster Biotech Co., Ltd.
Original Assignee
SHANGHAI SENSICHIP TECH&INFOR CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI SENSICHIP TECH&INFOR CO Ltd filed Critical SHANGHAI SENSICHIP TECH&INFOR CO Ltd
Priority to CN2007101735861A priority Critical patent/CN101215602B/en
Publication of CN101215602A publication Critical patent/CN101215602A/en
Application granted granted Critical
Publication of CN101215602B publication Critical patent/CN101215602B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a process of screening gene chip diversity expression gene, relating to an algorism for screening diversity expression gene in gene chip diversity expression analysis. The invention comprises following steps that firstly normalized processing chip data, secondly building up linear mold of logarithmic proportionality x ij= mu + mu j + epsilon, thirdly calculating globe mean value mu and the values of column effect mu j and variance sigma, fourthly calculating 2*ln (odd ratio) of each gene through utilizing mu, mu j and sigma, and fifthly setting localized value of X2 cutoff and n, and the gene of n is defined as diversity expression gene when the value of 2*ln (odd ratio) in the fourth step is larger than X2 cutoff. The invention designs appropriate statistic through building up statistical mode, and finally provides a significant probability numerical value for each gene as the standard of screening gene through utilizing the method of hypothesis test. The process overcomes the weaknesses that the conventional multiple process is lack of statistics basis and is difficult to assess the sensibility and specificity of the algorism self.

Description

A kind of method of screening-gene chip differences expressing gene
Technical field
This patent relates to a kind of algorithm of difference expression gene screening in a kind of gene chip data analysis.This algorithm is applicable to the gene chip experiment design that lacks the multiple small sample.
Background technology
Gene chip claims gene microarray (microarray) again, be meant many known array oligonucleotide or cDNA fragment are arranged on the substrate regularly, with behind the sample mark to be measured with chip on nucleotide sequence hybridize by the base complementrity pair principle.By fluorescence detecting system chip is scanned, and be equipped with computer system and the fluorescent signal on each probe is made detected and relatively, can draw experimental result rapidly.Utilize gene chip once carrying out fast, to detect accurately and efficiently up to ten thousand kinds of expression of gene levels in the experiment, and the requirement of sample can significantly reduce.Biochip technology be present gene studies aspect most advanced, also be one of effective means, at life science and put into practice, every field such as medical research and clinical, medicinal design, environment protection, agricultural, military affairs has a wide range of applications.
The screening of difference expression gene is that gene chip is analyzed a most key step.No multiple chip data for two samples, can use method of multiplicity (Gerhold D, Lu M, Xu J, Austin C, Caskey CT, Rushmore T.Monitoring expression of genesinvolved in drug metabolism and toxicology using DNAmicroarrays.Physiol Genomics 2001; 5:161-170) or z-score (Cheadle C, Vawter MP, Freed WJ and Becker KG.Analysis ofmicroarray data using z score trahsformation.J Mol Diagn2003:5, method 73-81); The multiple chip data that has for two samples then can use method of multiplicity or t check (Baldi P, Long AD.A Bayesian framework for theanalysis of microarray expression data:regularized t-testand statistical inferences of gene changes.Bioinformatics2001; Method such as 17:509-519).The multiple chip data that has for a plurality of groupings then can user's difference analysis (Pavlidis P.Using ANOVA for gene selectionfrom microarray studies of the nervous system.Methods2003:31 (4): 282-9).Generally can use fitting of a curve (StoreyJD for long time series, Xiao W, Leek JT, Tompkins RG, Davis RW.Significanceanalysis of time course microarray experiments.Proc Natl AcadSci U S A.2005,102 (36): method 12837-4).
But in actual applications, because the expense costliness of gene chip, the investigator often can only bear the chip design (sample size<6) of a small amount of sample, and each sample also just carry out single or twice technology repeats, this shortage multiple small sample gene chip experiment design is very very general at present.This class chip data does not have good analytical procedure at present, mainly be to adopt method of multiplicity, and method of multiplicity is a kind of empirical algorithms that the sensitivity and specificity of algorithm itself is to be difficult to estimate, can cause bigger experimental error.In order to remedy this technological gap, we propose a kind of new algorithm based on statistical model this patent, come small sample, and the differential gene that no multiple gene chip produces screens.
Summary of the invention
The invention provides a kind of by setting up the method that statistical model comes screening-gene.
The present invention is achieved in that and mainly comprises following flow process: step 1, chip data is carried out normalized; Step 2 is set up logarithm ratio x Ij=μ+μ j+ ε linear model; Step 3 calculates overall average μ, column effect μ jValue with variances sigma; Step 4 is utilized μ, μ jAnd σ, calculate the 2 * ln (odd ratio) of each gene; Step 5 is set thresholding χ 2 Cutoff, n, 2 * ln in step 4 (odd ratio) is worth greater than χ 2 Cutoff, nGene be decided to be difference expression gene.
Annotate: ln () is for being the natural logarithm at the end with e.
The invention has the advantages that: by setting up statistical model, design suitable statistic, use the method for test of hypothesis to give the probability numbers of a significance of each gene at last, as the standard of screening-gene.This method has overcome the weakness that conventional method of multiplicity lacks statistical basis and the sensitivity and specificity of algorithm own is difficult to estimate.
Description of drawings
Fig. 1 is the schema of the method for screening-gene chip differences expressing gene of the present invention.
Embodiment
Concrete grammar is described below:
At first use chip scanning image processing software (for example GenePix pro 4.0) to obtain the expression values data of gene level.Then chip data is carried out the normalized of chip chamber.Then the signal value of chip results is converted to ratio with respect to control experiment.Get the logarithm (get with e is to be good the end) of ratio.We are with the basis of this logarithm ratio (ln ratio) as analysis.
Suppose that we have n to open gene chip (a corresponding n sample, typical, 1<n≤5), every chip has m gene.We obtain a numerical matrix like this:
Figure S2007101735861D00031
X wherein IjBe that ((1≤j≤n) opens the ln ratio numerical value in the chip to the individual gene of 1≤i≤m) to i at j.
Then we set up a linear model:
x ij=μ+μ j+ε ②
Wherein μ is the average of the overall situation, μ jBe column effect, ε is a residual error.We suppose ε~N (0, σ 2).Suppose promptly that in different chips it is 0 that residual epsilon meets average, variance is the normal distribution of σ.As variances sigma, what it embodied is the average of all chips " in the chip " variance.μ j is as column effect, expression be the parameter of difference between the different chips.μ is the average of the overall situation and since generally speaking in the chip expression values of most gene be constant, so μ approaches 0.This model promptly is that the expression values of a gene on a chip is decomposed into overall effect, row (chip) effect and residual error.
Parameter to model is estimated:
Utilize maximum likelihood estimation, the estimated value of μ is the average of the overall situation, promptly
μ ^ = Σ i = 1 m Σ j = 1 n x ij mn
μ jBe column effect, the mean value of promptly every row (being every chip) is (in the following formula
Figure S2007101735861D00033
Approach 0):
μ ^ j = Σ i = 1 m x ij m - Σ i = 1 m Σ j = 1 n x ij mn
The estimated value of σ is got " in the group " variance:
σ ^ = Σ j = 1 n Σ i = 1 n ( x ij - Σ i = 1 m x ij m ) 2 mn
Set up test-hypothesis:
For each gene i,
H 0: x I1, x I2... x InIt is an example of above-mentioned linear model.
H 1: x I1, x I2... x InIndependent fully with above-mentioned linear model.
We use p (x I1, x I2... x In| μ, μ j, σ, H 0) represent that gene i is the probability of an example of this linear model (being population distribution), with p (x I1, x I2... x iN| μ, μ j, σ, H 1) represent that gene i comes from this linear model, but come from the probability of any other model (distribution).Traditionally, we represent the departure degree of data to model with odds ratio odd ratio,
oddratio = p ( x i 1 , x i 2 , . . . x in | μ , μ j , σ , H 1 ) p ( x i 1 , x i 2 , . . . x in | μ , μ j , σ , H 0 )
As seen the value of odds ratio odd ratio is big more, and it is obvious more to illustrate that gene i departs from population distribution, might be the differential gene that we will seek more.
For gene i, 2 * ln (odd ratio) then can write:
OR i = 2 ln [ p ( x i 1 , x i 2 , . . . x in | μ , μ j , σ , H 1 ) p ( x i 1 , x i 2 , . . . x in | μ , μ j , σ , H 0 ) ]
= 2 ln [ Π j = 1 n 1 2 π σ e - 1 2 ( x ij - x ij σ ) 2 Π j = 1 n 1 2 π σ e - 1 2 ( x ij - μ - μ i σ ) 2 ] = Σ j = 1 n ( x ij - μ - μ j σ ) 2
In the top formula, use the value of joint probability calculation odds ratio odd ratio.From final our statistic OR as can be seen as a result iMeet the χ that degree of freedom is n 2Distribute.Therefore, the tolerance that is used as the differential expression of gene with 2 * ln (odd ratio) is rational, and its significant result can use χ 2Check provides, that is:
Set certain threshold value (cutoff), the cutoff optimum value is 0.01, if OR i>χ 2 Cutoff, n, promptly p<0.01 can think that so then i gene is difference expression gene.
By calculating the OR of each gene iValue is with χ 2The threshold value χ that distributes 2 Cutoff, nCompare, can filter out all difference expression genes.
One, there not to be the expression profiles of gene chip data instance of the Affymetrix company that repeats 4 samples:
Obtain the gene level expression data.Convert the signal value of chip results to ratio with respect to control experiment.Get the logarithm of ratio.
Set up linear model x Ij=μ+μ j+ ε, j=1...4.μ is the average of the overall situation, μ jBe column effect, ε is a residual error, and ε~N (0, σ 2).
Calculate above-mentioned PARAMETERS IN THE LINEAR MODEL μ, μ jEstimated value with σ
Figure S2007101735861D00053
With
Figure S2007101735861D00054
These estimated values will be used for the calculating of statistic 2 * ln (odd ratio).
For each gene i, utilize formula Σ j = 1 n ( x ij - μ - μ j σ ) 2 , calculate 2 * ln (odd ratio) value of each gene.This value has reflected the departure degree of gene i expression data and population distribution, and this value meets the χ that degree of freedom is n=4 simultaneously 2Distribute.
Set cutoff=0.01, look into χ 2Distribution table obtains χ 2 0.01,4=13.28.Promptly when statistic greater than 13.28 the time, p<0.01.
Screening 2 * ln (odd ratio) value is difference expression gene greater than the gene of 13.28 (being equivalent to p<0.01).
Two, to repeat the gene chip data instance of 5 samples for 2 times:
Obtain the gene level expression data.Convert the signal value of chip results to ratio with respect to control experiment.Get the logarithm of ratio.
Set up linear model x Ij=μ+μ j+ ε, j=1...5.
Calculate above-mentioned PARAMETERS IN THE LINEAR MODEL μ, μ jEstimated value with σ.
For each gene i, utilize formula Σ j = 1 n ( x ij - μ - μ j σ ) 2 , calculate 2 * ln (odd ratio) value of each gene.
Set cutoff=0.01, look into χ 2Distribution table obtains χ 2 0.01,10=23.21.
Screening 2 * ln (odd ratio) is worth the gene greater than 23.21, is difference expression gene.
More than be the description of this invention and non-limiting, based on other embodiment of inventive concept, all among protection scope of the present invention.

Claims (3)

1. the method for a screening-gene chip differences expressing gene is characterized in that this method includes following steps:
Step 1 is carried out normalized to chip data;
Step 2 is set up logarithm ratio x Ij=μ+μ j+ ε linear model;
Step 3 calculates overall average μ, column effect μ jValue with variances sigma;
Step 4 is utilized μ, μ jAnd σ, calculate the 2 * ln (odd ratio) of each gene;
Step 5 is set thresholding χ 2 Cutoff, n, 2 * ln in step 4 (odd ratio) is worth greater than χ 2 Cutoff, nGene be decided to be difference expression gene.
2. the method for a kind of screening-gene chip differences expressing gene according to claim 1 is characterized in that: in step 1, the sample number when carrying out the normalized of gene chip data is between 1~5 example.
3. the method for a kind of screening-gene chip differences expressing gene according to claim 1 is characterized in that: in step 3 or step 4,
oddratio = p ( x i 1 , x i 2 , . . . x in | μ , μ j , σ , H 1 ) p ( x i 1 , x i 2 , . . . x in | μ , μ j , σ , H 0 ) .
CN2007101735861A 2007-12-28 2007-12-28 Method for screening gene chip difference expression gene Expired - Fee Related CN101215602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101735861A CN101215602B (en) 2007-12-28 2007-12-28 Method for screening gene chip difference expression gene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101735861A CN101215602B (en) 2007-12-28 2007-12-28 Method for screening gene chip difference expression gene

Publications (2)

Publication Number Publication Date
CN101215602A true CN101215602A (en) 2008-07-09
CN101215602B CN101215602B (en) 2013-01-23

Family

ID=39622109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101735861A Expired - Fee Related CN101215602B (en) 2007-12-28 2007-12-28 Method for screening gene chip difference expression gene

Country Status (1)

Country Link
CN (1) CN101215602B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102286464A (en) * 2011-06-30 2011-12-21 眭维国 Uremia long-chain non-coding ribonucleic acid difference expression spectrum model and construction method thereof
CN106777870A (en) * 2016-11-18 2017-05-31 邹欣 A kind of noise reducing algorithm for unicellular transcript profile data
CN110880355A (en) * 2019-11-26 2020-03-13 苏州大学 Sensitive gene discovery method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KIM SY等: "Comparison of various……microarray data.", 《STAT METHODS MED RES》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102286464A (en) * 2011-06-30 2011-12-21 眭维国 Uremia long-chain non-coding ribonucleic acid difference expression spectrum model and construction method thereof
CN102286464B (en) * 2011-06-30 2013-07-17 眭维国 Uremia long-chain non-coding ribonucleic acid difference expression spectrum model and construction method thereof
CN106777870A (en) * 2016-11-18 2017-05-31 邹欣 A kind of noise reducing algorithm for unicellular transcript profile data
CN110880355A (en) * 2019-11-26 2020-03-13 苏州大学 Sensitive gene discovery method, device and storage medium
CN110880355B (en) * 2019-11-26 2023-08-01 苏州大学 Sensitivity gene discovery method, device and storage medium

Also Published As

Publication number Publication date
CN101215602B (en) 2013-01-23

Similar Documents

Publication Publication Date Title
Cope et al. A benchmark for Affymetrix GeneChip expression measures
US20190311780A1 (en) Methods and computer software for detecting splice variants
Michaelson et al. Detection and interpretation of expression quantitative trait loci (eQTL)
Nadon et al. Statistical issues with microarrays: processing and analysis
Heslot et al. Impact of marker ascertainment bias on genomic selection accuracy and estimates of genetic diversity
Šášik et al. Statistical analysis of high-density oligonucleotide arrays: a multiplicative noise model
EP3518974A1 (en) Noninvasive prenatal screening using dynamic iterative depth optimization
Bauer et al. The average mutual information profile as a genomic signature
Ghaffari et al. Modeling the next generation sequencing sample processing pipeline for the purposes of classification
US20030236633A1 (en) Methods for oligonucleotide probe design
CN101215602B (en) Method for screening gene chip difference expression gene
Attia et al. Detecting genotyping error using measures of degree of Hardy-Weinberg disequilibrium
US20070172833A1 (en) Gene expression profile retrieving apparatus, gene expression profile retrieving method, and program
Toh et al. System for automatically inferring a genetic netwerk from expression profiles
Zhang et al. One read per cell per gene is optimal for single-cell RNA-Seq
Tan et al. Microarray data mining: a novel optimization-based approach to uncover biologically coherent structures
Nicolae Quantifying the amount of missing information in genetic association studies
US20090088345A1 (en) Necessary and sufficient reagent sets for chemogenomic analysis
Han et al. Using matrix of thresholding partial correlation coefficients to infer regulatory network
Sottile et al. Penalized classification for optimal statistical selection of markers from high-throughput genotyping: application in sheep breeds
Gusnanto et al. Fold-change estimation of differentially expressed genes using mixture mixed-model
US20020069033A1 (en) Method for determining measurement error for gene expression microarrays
Liu et al. Assessing agreement of clustering methods with gene expression microarray data
EP0736107A1 (en) Automatic genotype determination
Karadağ et al. Assessment of SNP-SNP interactions by using square contingency table analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: CLUSTER BIOTECHNOLOGY CO., LTD. SHANGHAI

Free format text: FORMER OWNER: SHANGHAI SENSICHIP TECH+INFOR CO., LTD.

Effective date: 20130312

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 200333 PUTUO, SHANGHAI TO: 200433 YANGPU, SHANGHAI

TR01 Transfer of patent right

Effective date of registration: 20130312

Address after: 200433, room 4, building 200, No. 303 East State Road, Shanghai, Yangpu District

Patentee after: Shanghai Cluster Biotech Co., Ltd.

Address before: 801 room 80, 999 lane, South Qilian Mountains Road, Shanghai, 200333

Patentee before: Shanghai SensiChip Tech&infor Co., Ltd.

C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130123

Termination date: 20131228