CN108052797A - Detection method applied to Binding site for transcription factor on chromosome in tissue samples - Google Patents

Detection method applied to Binding site for transcription factor on chromosome in tissue samples Download PDF

Info

Publication number
CN108052797A
CN108052797A CN201711464358.XA CN201711464358A CN108052797A CN 108052797 A CN108052797 A CN 108052797A CN 201711464358 A CN201711464358 A CN 201711464358A CN 108052797 A CN108052797 A CN 108052797A
Authority
CN
China
Prior art keywords
dna
base
subsequence
transcription factor
binding site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711464358.XA
Other languages
Chinese (zh)
Inventor
李旦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiayin Biological Technology Co Ltd
Original Assignee
Shanghai Jiayin Biological Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiayin Biological Technology Co Ltd filed Critical Shanghai Jiayin Biological Technology Co Ltd
Priority to CN201711464358.XA priority Critical patent/CN108052797A/en
Publication of CN108052797A publication Critical patent/CN108052797A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Abstract

The present invention relates to the detection method applied to Binding site for transcription factor on chromosome in tissue samples, including data prediction, the short sequences of segmentation DNA, average detection and Probability Detection.Compared with the detection algorithm that oneself has, improve the performance of the Binding site for transcription factor recognizer of ChIP seq data, the time of algorithm consumption is less, and can accurately identify existing and new Binding site for transcription factor, and new technological means and important tool are provided for the research of transcription factor.

Description

Detection method applied to Binding site for transcription factor on chromosome in tissue samples
Technical field
The invention belongs to technical field of immunoassay more particularly to applied to transcription factor knot on chromosome in tissue samples Close the detection method in site.
Background technology
In recent years, " big data " this vocabulary has become one of most common vocabulary instantly, and since last century 90 Age, bioinformatics, from initial DNA sequence analysis and protein sequence analysis, expand by development for many years Open up the every field of biology so that the growth of biological data is surprising, and biology has also come into " big data " now Epoch.
Transcription is the first stage of gene expression and the Main Stage of Gene regulation, by transcription factor with it is special Sequence combines, and the expression of gene is play a part of to inhibit or be enhanced.Identify these calmodulin binding domain CaMs in sequence, i.e. transcription factor Binding site identifies, is bioinformatics now to understanding the transcriptional activity of gene and understanding gene expression important in inhibiting In it is the most widely studied the problem of one of.
Binding site for transcription factor identifies that the difficult point of problem is, the ambient noise with a large amount of length several hundred or thousands bases Sequence is compared, and the die body signal that length is more than ten or tens is relatively short, and the die body example of same transcription factor also can It can partly morph.Meanwhile with the increase of sequence length and quantity, solution space size also can rapidly huge increasing, computing cost It is often unrealistic.In addition, the multiple Binding site for transcription factor, searching in identification calmodulin binding domain CaM are specifically total to regulatory transcription factor The combination of sub- binding site and binding site is found in the range of full-length genome and huge challenge that this problem is faced.
The content of the invention
In view of this, what a kind of solution of present invention offer or part solved the above problems is applied to chromosome in tissue samples The detection method of upper Binding site for transcription factor.
To achieve the effect that above-mentioned technical proposal, the technical scheme is that:Applied on chromosome in tissue samples The detection method of Binding site for transcription factor, comprises the following steps:
Step 1:Data prediction:
First, the ChIP-seq data of sample are read, and are compared in reference gene group, search out transcription factor knot Close the characteristic peak of site enrichment and the location information of peak maximum;Then, both sides are extended to the left and right respectively centered on peak maximum 500bp, in the data after extension, the center of each DNA sequence dna is peak maximum, and DNA sequence dna length is 1002bp;Most Afterwards, the sequence that DNA sequence dna is extracted and removed wherein repeatedly is obtained into the short sequences of DNA;
Step 2:Split the short sequences of DNA:
Respectively using preceding N-4 base in the short sequences of DNA as head base, by head base and its afterwards continuous four bases Be divided into a subsequence, and using head base the short sequences of DNA order as the number of subsequence, the number of subsequence is just Integer;N is the base quantity in the short sequences of DNA, and N is positive integer;Subsequence includes five bases, and head base is in subsequence First base, the short sequences of DNA can mark off N-4 subsequence;
Step 3:Average detects:
Respectively four kinds of bases are included with A (adenine), T (thymidine), C (cytimidine), G (guanine) and calculates current alkali Base average:
(1) base calculated is current base, according to the number of subsequence, counts current base successively in subsequence The number of middle appearance obtains mean vector (y1,y2,…,yN-4), wherein, y is the number that current base occurs in subsequence, y1 The number that be current base occur in the subsequence that number is 1, y2It is that current base occurs in the subsequence that number is 2 Number, yN-4It is the number that current base occurs in the subsequence that number is N-4;
(2) number for counting element of the value more than 3 in mean vector is current base average;
Average detection is carried out to the current base average that four kinds of bases calculate:If four current base averages all exist In the range of 0.8N~1.2N, then step 4 is carried out;Otherwise detection terminates, and the short sequences of DNA are not Binding site for transcription factor;
Step 4:Probability Detection:
Current base probability is calculated to four kinds of bases respectively, is calculated with formula two:
Formula one:
Wherein, G is current base probability, is the real number between 0~1, without unit;σ, μ be variance of unit weight and average because Son is the real number between 0~5, is artificially determined based on experience value by testing staff;I is the number of subsequence, yiIt is current base The number occurred in the subsequence that number is i;
Probability Detection is carried out to the current base probability that four kinds of bases calculate:If four current base parameter probability valuings are equal Less than 0.7, then the short sequences of DNA are not Binding site for transcription factor;Otherwise, the short sequences of DNA are Binding site for transcription factor.
The present invention useful achievement be:The present invention provides applied to transcription factor binding site on chromosome in tissue samples The detection method of point, including data prediction, the short sequences of segmentation DNA, average detection and Probability Detection.The detection algorithm having with oneself It compares, improves the performance of the Binding site for transcription factor recognizer of ChIP-seq data, the time of algorithm consumption is less, and Can accurately identify existing and new Binding site for transcription factor, for transcription factor research provide new technological means and Important tool.
Specific embodiment
In order to which technical problems, technical solutions and advantages to be solved are more clearly understood, tie below Embodiment is closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only explaining The present invention is not intended to limit the present invention, and can be realized that the product of said function belongs to equivalent substitution and improvement, is all contained in this hair Within bright protection domain.Specific method is as follows:
Embodiment 1:The present embodiment specifically describes Binding site for transcription factor method for expressing, as follows:
Same transcription factor can be combined with similar Binding site for transcription factor, and Binding site for transcription factor has 3 kinds often Use representation:
(1) it is based on consensus sequence
The sequence of the same or similar Binding site for transcription factor according to position is arranged, selects most may be used on each position The base that can occur, according to the consensus sequence of position composition Binding site for transcription factor.
DNA sequence dna is by tetra- kinds of base compositions of A, G, C, T.In practical applications, some Binding site for transcription factor is some Frequency on position there are two kinds or more bases appearance is similar or even essentially equal.If the base that most probable is selected to occur It cannot reflect its conservative completely.
In order to represent the consensus sequence of Binding site for transcription factor, generally using IUPAC degeneracy codes, using A, G, C, T it Outer letter represents the combination of two or two upper bases to merge.
(2) it is based on probability matrix
Row in matrix represents four kinds of bases, and the quantity of rectangular array is consistent with the sequence length of Binding site for transcription factor. Each row represent the probability of each base on the position.Compared with the method for consensus sequence, position probability matrix can be accurate The frequency for translating different bases on each position and occurring, the probability phase that different bases occurs on each position of the model hypothesis It is mutually independent, rely on the size of sample size.
Research shows that the frequency that different bases on each position occurs might not be mutual indepedent, with reference to background information and The relation of interdependence of different bases in this position gives the method that position probability matrix represents Binding site for transcription factor. Meanwhile the Preference in DNA sequence dna base composition that may be present itself, position probability matrix are usually converted into position weight Matrix.Element in position probability matrix is the number that each base occurs except the sum that all bases occur in this position. In position weight matrix, by background sequence, i.e. the data in non-transcribed factor binding site region eliminate DNA sequence dna alkali itself Base forms the influence of Preference.In practice, in order to avoid arranging the situation that some character occurrence number is 0 because a certain, usually A spurious counter is added in the count matrix of position.
(3) LOGO schemes
Probability size of the different bases on different position is intuitively represented by patterned way, is owned on each position The highly reactive height of the conservative of base in this position of base, height are more high more conservative.Each base on each position Letter color is different, the frequency that each letter occurs on the position compared with the size of other letters with the base on each position Rate is directly proportional.
Embodiment 2:The present embodiment is specifically illustrated applied to Binding site for transcription factor on chromosome in tissue samples It is the step of detection method, as follows:
Step 1:Data prediction:
First, the ChIP-seq data of sample are read, and are compared in reference gene group, search out transcription factor knot Close the characteristic peak of site enrichment and the location information of peak maximum;Then, both sides are extended to the left and right respectively centered on peak maximum 500bp, in the data after extension, the center of each DNA sequence dna is peak maximum, and DNA sequence dna length is 1002bp;Most Afterwards, the sequence that DNA sequence dna is extracted and removed wherein repeatedly is obtained into the short sequences of DNA;
Step 2:Split the short sequences of DNA:
Respectively using preceding N-4 base in the short sequences of DNA as head base, by head base and its afterwards continuous four bases Be divided into a subsequence, and using head base the short sequences of DNA order as the number of subsequence, the number of subsequence is just Integer;N is the base quantity in the short sequences of DNA, and N is positive integer;Subsequence includes five bases, and head base is in subsequence First base, the short sequences of DNA can mark off N-4 subsequence;
Step 3:Average detects:
Respectively four kinds of bases are included with A (adenine), T (thymidine), C (cytimidine), G (guanine) and calculates current alkali Base average:
(1) base calculated is current base, according to the number of subsequence, counts current base successively in subsequence The number of middle appearance obtains mean vector (y1,y2,…,yN-4), wherein, y is the number that current base occurs in subsequence, y1 The number that be current base occur in the subsequence that number is 1, y2It is that current base occurs in the subsequence that number is 2 Number, yN-4It is the number that current base occurs in the subsequence that number is N-4;
(2) number for counting element of the value more than 3 in mean vector is current base average;
Average detection is carried out to the current base average that four kinds of bases calculate:If four current base averages all exist In the range of 0.8N~1.2N, then step 4 is carried out;Otherwise detection terminates, and the short sequences of DNA are not Binding site for transcription factor;
Step 4:Probability Detection:
Current base probability is calculated to four kinds of bases respectively, is calculated with formula two:
Formula one:
Wherein, G is current base probability, is the real number between 0~1, without unit;σ, μ be variance of unit weight and average because Son is the real number between 0~5, is artificially determined based on experience value by testing staff;I is the number of subsequence, yiIt is current base The number occurred in the subsequence that number is i;
Probability Detection is carried out to the current base probability that four kinds of bases calculate:If four current base parameter probability valuings are equal Less than 0.7, then the short sequences of DNA are not Binding site for transcription factor;Otherwise, the short sequences of DNA are Binding site for transcription factor.
The present invention useful achievement be:The present invention provides applied to transcription factor binding site on chromosome in tissue samples The detection method of point, including data prediction, the short sequences of segmentation DNA, average detection and Probability Detection.The detection algorithm having with oneself It compares, improves the performance of the Binding site for transcription factor recognizer of ChIP-seq data, the time of algorithm consumption is less, and Can accurately identify existing and new Binding site for transcription factor, for transcription factor research provide new technological means and Important tool.
The foregoing is merely the preferred embodiments of the invention, are not limited to claims of the invention. Simultaneously it is described above, for those skilled in the technology concerned it would be appreciated that and implement, therefore other be based on institute of the present invention The equivalent change that disclosure is completed, should be included in the covering scope of the claims.

Claims (1)

1. the detection method applied to Binding site for transcription factor on chromosome in tissue samples, which is characterized in that including following Step:
Step 1:Data prediction:
First, the ChIP-seq data of sample are read, and are compared in reference gene group, search out transcription factor binding site The characteristic peak of point enrichment and the location information of peak maximum;Then, both sides are extended to the left and right respectively centered on the peak maximum 500bp, in the data after extension, the center of each DNA sequence dna is the peak maximum;Finally, the DNA sequence dna is extracted Out and remove wherein repeat sequence obtain the short sequences of DNA;
Step 2:Split the short sequences of DNA:
It is by the head base and its continuous afterwards using preceding N-4 base in the short sequences of the DNA respectively successively as head base Four bases are divided into a subsequence, and using the head base the short sequences of the DNA order as the subsequence Number, the number of the subsequence is positive integer;The N is the base quantity in the short sequences of the DNA, and the N is positive integer; The subsequence includes five bases, and the head base is first base in the subsequence, and the short sequences of DNA can To mark off the N-4 subsequences;
Step 3:Average detects:
Respectively four kinds of bases are calculated with current base average, four kinds of bases include A (adenine), T (thymidine), C (born of the same parents Pyrimidine), G (guanine):
(1) base calculated is current base, according to the number of the subsequence, counts current base successively in the son The number occurred in sequence obtains mean vector (y1,y2,…,yN-4), wherein, y is the current base in the subsequence The number of appearance, y1The number that be the current base occur in the subsequence that number is 1, y2It is that the current base is being compiled Number for the number that occurs in 2 subsequence, yN-4It is the number that the current base occurs in the subsequence that number is N-4;
(2) number for counting element of the value more than 3 in the mean vector is current base average;
Average detection is carried out to the current base average that four kinds of bases calculate:If four current bases are equal Value then carries out step 4 all in the range of 0.8N~1.2N;Otherwise detection terminates, and the short sequences of DNA are not transcription factors Binding site;
Step 4:Probability Detection:
Current base probability is calculated to four kinds of bases respectively, is calculated with formula one:
Formula one:
Wherein, G is the current base probability, is the real number between 0~1, without unit;σ, μ be variance of unit weight and average because Son is the real number between 0~5, is determined based on experience value by testing staff;I is the number of the subsequence, yiIt is described current The number that base occurs in the subsequence that number is i;
Probability Detection is carried out to the current base probability that four kinds of bases calculate:If four current bases are general Rate value is respectively less than 0.7, then the short sequences of the DNA are not the Binding site for transcription factor;Otherwise, the short sequences of the DNA are The Binding site for transcription factor.
CN201711464358.XA 2017-12-28 2017-12-28 Detection method applied to Binding site for transcription factor on chromosome in tissue samples Pending CN108052797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711464358.XA CN108052797A (en) 2017-12-28 2017-12-28 Detection method applied to Binding site for transcription factor on chromosome in tissue samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711464358.XA CN108052797A (en) 2017-12-28 2017-12-28 Detection method applied to Binding site for transcription factor on chromosome in tissue samples

Publications (1)

Publication Number Publication Date
CN108052797A true CN108052797A (en) 2018-05-18

Family

ID=62128825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711464358.XA Pending CN108052797A (en) 2017-12-28 2017-12-28 Detection method applied to Binding site for transcription factor on chromosome in tissue samples

Country Status (1)

Country Link
CN (1) CN108052797A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872782A (en) * 2019-02-20 2019-06-11 程俊美 A kind of system and method for handling for histological tissue specimen
CN110070908A (en) * 2019-03-11 2019-07-30 西安电子科技大学 A kind of die body searching method, device, equipment and the storage medium of binomial tree model
CN111415704A (en) * 2020-05-18 2020-07-14 北京博安智联科技有限公司 STR gene data analysis method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103946396A (en) * 2011-10-31 2014-07-23 三星Sds株式会社 Method for sequence recombination and apparatus for ngs
CN108154008A (en) * 2017-12-25 2018-06-12 上海嘉因生物科技有限公司 Detection method applied to Binding site for transcription factor on chromosome in tissue samples

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103946396A (en) * 2011-10-31 2014-07-23 三星Sds株式会社 Method for sequence recombination and apparatus for ngs
CN103946396B (en) * 2011-10-31 2016-08-24 三星Sds株式会社 Sequence recombination method and device for next generation's order-checking
CN108154008A (en) * 2017-12-25 2018-06-12 上海嘉因生物科技有限公司 Detection method applied to Binding site for transcription factor on chromosome in tissue samples

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872782A (en) * 2019-02-20 2019-06-11 程俊美 A kind of system and method for handling for histological tissue specimen
CN110070908A (en) * 2019-03-11 2019-07-30 西安电子科技大学 A kind of die body searching method, device, equipment and the storage medium of binomial tree model
CN110070908B (en) * 2019-03-11 2021-08-13 西安电子科技大学 Motif searching method, device, equipment and storage medium of binomial tree model
CN111415704A (en) * 2020-05-18 2020-07-14 北京博安智联科技有限公司 STR gene data analysis method
CN111415704B (en) * 2020-05-18 2021-05-18 北京博安智联科技有限公司 STR gene data analysis method

Similar Documents

Publication Publication Date Title
Ren et al. Widespread whole genome duplications contribute to genome complexity and species diversity in angiosperms
Chen et al. Large-scale phylogenetic analyses provide insights into unrecognized diversity and historical biogeography of Asian leaf-litter frogs, genus Leptolalax (Anura: Megophryidae)
Ochoa de Alda et al. The plastid ancestor originated among one of the major cyanobacterial lineages
Bremer et al. Subfamilial and tribal relationships in the Rubiaceae based on rbcL sequence data
CN108052797A (en) Detection method applied to Binding site for transcription factor on chromosome in tissue samples
CN106480221B (en) Based on gene copy number variation site to the method for forest tree population genotyping
Sinn et al. Phylogenetic relationships in Asarum: effect of data partitioning and a revised classification
Gillespie et al. Molecular phylogenetic relationships and a revised classification of the subfamily Ericoideae (Ericaceae)
Bardy et al. Extensive gene flow blurs species boundaries among Veronica barrelieri, V. orchidea and V. spicata (Plantaginaceae) in southeastern Europe
van den Bergh et al. Gene and genome duplications and the origin of C4 photosynthesis: birth of a trait in the Cleomaceae
CN106202998B (en) A kind of method of non-mode biology transcript profile gene order structural analysis
Dong et al. Phylogenomics and biogeography of Catalpa (Bignoniaceae) reveal incomplete lineage sorting and three dispersal events
JP7357023B2 (en) Method and system for generating non-coding-coding gene co-expression networks
CN110459264A (en) Based on grad enhancement decision tree prediction circular rna and disease associated method
CN108154008A (en) Detection method applied to Binding site for transcription factor on chromosome in tissue samples
CN109330584A (en) Electrocardiosignal personal identification method and system based on dictionary learning and rarefaction representation
Kim et al. Systematics, biogeography, and character evolution of Deutzia (Hydrangeaceae) inferred from nuclear and chloroplast DNA sequences
Feng et al. Phylogenomics recovers monophyly and early Tertiary diversification of Dipteronia (Sapindaceae)
Brito et al. Genome-wide association study for resistance to cassava root rot
Shi et al. The slow-evolving Acorus tatarinowii genome sheds light on ancestral monocot evolution
CN106446601B (en) A kind of method of extensive mark lncRNA function
CN102760209A (en) Transmembrane helix predicting method for nonparametric membrane protein
Carvalho et al. Integrating phylogenetic and network approaches to study gene family evolution: The case of the AGAMOUS family of floral genes
CN108804871A (en) Key protein matter recognition methods based on maximum neighbours' subnet
Bustamam et al. Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS–CoV genetic relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180518

WD01 Invention patent application deemed withdrawn after publication