CN116343917B - Method for identifying transcription factor co-localization based on ATAC-seq footprint - Google Patents

Method for identifying transcription factor co-localization based on ATAC-seq footprint Download PDF

Info

Publication number
CN116343917B
CN116343917B CN202310326955.5A CN202310326955A CN116343917B CN 116343917 B CN116343917 B CN 116343917B CN 202310326955 A CN202310326955 A CN 202310326955A CN 116343917 B CN116343917 B CN 116343917B
Authority
CN
China
Prior art keywords
transcription factor
atac
localization
transcription
seq
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310326955.5A
Other languages
Chinese (zh)
Other versions
CN116343917A (en
Inventor
刘利
韩璐
邹权
丁漪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202310326955.5A priority Critical patent/CN116343917B/en
Publication of CN116343917A publication Critical patent/CN116343917A/en
Application granted granted Critical
Publication of CN116343917B publication Critical patent/CN116343917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The scheme discloses a method for identifying transcription factor co-localization based on an ATAC-seq footprint, which comprises the steps of firstly preprocessing data by using a HINT-ATAC method according to the ATAC-seq data, then constructing an identification transcription factor co-localization model based on Poisson distribution, realizing identification of transcription factor pairs with cooperative binding and competitive binding based on distance ds, and identifying the transcription factor pairs with cooperative binding and competitive binding by using only the ATAC-seq data as input, thereby providing a basis for further application to a plurality of cell lines.

Description

Method for identifying transcription factor co-localization based on ATAC-seq footprint
Technical Field
The scheme belongs to the technical field of computer biology and provides a method for identifying transcription factors based on ATAC-seq footprint.
Background
Transcription factors (transcription factor, TF) are a class of sequence-specific DNA binding proteins capable of binding to transcription factor binding site sequences upstream of a target gene, involved in regulating the gene transcription process, thereby ensuring that the target gene is expressed at a specific strength in a specific time and space. In general, transcription factors regulate the expression of higher biological genes in a combined form, and most transcription factors must work together to complete the transcription task. Thus, it is necessary to obtain significant co-located transcription factors in genetic studies as well as in genopathy studies. The existing transcription factor co-localization identification method comprises a statistical test method based on transcription factor ChIP-seq data or motif matching.
The ChIP-seq based method is characterized in that first, chIP-seq experimental data of all transcription factors of a cell line are collected, the binding sites of all the transcription factors in a whole genome are determined, and the relation of the binding sites of the transcription factors is statistically tested to obtain obvious co-located transcription factor pairs.
Transcription factor motif matching methods typically use ATAC-seq sequencing data to obtain chromatin open regions, and perform transcription factor motif scanning on these regions to identify potential transcription factor binding sites, and then identify co-located transcription factor pairs by statistical analysis.
The large amount of input data required for the ChIP-seq based approach, i.e., multiple transcription factor ChIP-seq experimental data are required as input for a cell line if multiple transcription factors are desired, requiring hundreds of thousands of transcription factor ChIP-seq experimental data for the cell line or tissue under study. Currently, so much experimental data is available only in limited cell lines, so there are limitations in experimental data acquisition. Variably, the motif matching-based method can analyze co-localization of transcription factors using only one experimental data ATAC-seq, but cannot define distinction of overlapping localization or neighbor localization for co-localization, so that neither can reflect whether transcription factors are in competitive or cooperative binding, and distinction of transcription factors in competitive or cooperative relationship is critical to understanding molecular mechanisms of gene transcription regulation.
Disclosure of Invention
The objective of the present scheme is to provide a method for identifying transcription factor co-localization based on ATAC-seq footprint, which comprises the steps of preprocessing data according to ATAC-seq data by using the HINT-ATAC method, constructing a transcription factor co-localization model based on poisson distribution, identifying transcription factor pairs of cooperative binding and competitive binding based on distance ds, and identifying transcription factor pairs of cooperative binding and competitive binding by using only ATAC-seq data as input, thereby providing a basis for further application to a plurality of cell lines.
A method for identifying transcription factor co-localization based on ATAC-seq footprint, comprising:
s1, collecting and downloading ATAC-seq data of a target to be identified to acquire original chromatin openness sequencing data; a cell line or tissue has a corresponding ATAC-seq data, that is, only one ATAC-seq data is required for a cell line.
S2, analyzing coordinate data of transcription factor footprints in the target genome to be identified by using a footprint analysis tool based on the data file obtained in the step S1; the ATAC-seq data is open area data that identifies which sites in the genome are binding sites for transcription factors by a foltprint analysis, but does not determine which transcription factor is.
S3, matching transcription factors motif of a transcription factor database with footprint coordinate data to obtain specific transcription factor types of each binding site; binding sites for a certain transcription factor can be obtained by footprint+motif.
The transcription factor database records transcription factors and binding sites and binding modes of different organisms;
s4, calculating the distance ds between every two transcription factors through a distance calculation tool;
determining the co-localization number k1 of the two transcription factors based on the calculated ds and the first distance threshold;
determining the co-localization number k2 of the two transcription factors based on the calculated ds and the second distance threshold;
s5, constructing a transcription factor identification co-localization model based on poisson distribution;
respectively calculating probability values P under two conditions of k1 and k2 by using a recognition transcription factor co-localization model;
respectively screening transcription factor pairs with significance under two conditions according to the probability value P and the threshold value P';
s6, judging a significant transcription factor pair with k1 larger than an expected value as a synergistically combined transcription factor pair under the condition of a second distance threshold;
for a first distance threshold condition, judging a significant transcription factor pair with k2 greater than an expected value as a transcription factor pair competing for binding;
and judging whether the transcription factor pairs combined cooperatively belong to competitive binding according to the transcription factor pairs combined competitively, and if so, judging that the corresponding transcription factor pairs are competitive and cooperative.
In the above-described method of identifying co-localization of transcription factors based on the ATAC-seq footprint, in step S5, the significant transcription factor pairs include significantly co-localized transcription factor pairs and significantly reject co-localized transcription factor pairs.
In the method for identifying the co-localization of the transcription factors based on the ATAC-seq footprint, the construction of the identification transcription factor co-localization model based on the Poisson distribution is as follows:
wherein k in the formula (1) is the co-location number of the two transcription factors in the threshold range, N and m are the respective location numbers of the two transcription factors, N represents the total binding sites of the target to be identified, and lambda is a desired value;
calculating the probability of judging that each two transcription factor pairs are co-located or refused to co-locate under two distance thresholds respectively through a formula (1);
and respectively screening out transcription factor pairs with significance characteristics under two distance thresholds based on the probability values.
In the above-described method of identifying transcription factors co-located based on the ATAC-seq footprint, the distance between the two transcription factors from the nearest transcription factor is used as the distance between the two transcription factors. That is, in a cell line, there may be a plurality of transcription factors, and then there may be a plurality of distances between each two transcription factors, where the shortest distance is taken as the distance between the two transcription factors.
In the above-described method for identifying the co-location of transcription factors based on the ATAC-seq footprint, the number of locations of each transcription factor is determined based on the coordinates of the transcription factor.
In the above method for identifying transcription factor co-localization based on ATAC-seq footprint, two distance thresholds are set, the first distance threshold is ds=0 and the second distance threshold is 0< ds <150.
In the above method for identifying transcription factor co-localization based on ATAC-seq footprint, in step S1, the target to be identified is a target cell line or a target tissue.
In the above method for identifying transcription factor co-localization based on ATAC-seq footprint, in step S2, the footprint analysis tool is HINT-ATAC tool, and in step S1, an ATAC-seq narrow peak format file compatible with HINT-ATAC tool is obtained.
In the above method for identifying transcription factor co-localization based on ATAC-seq footprint, in step S3, the matching is implemented using the motif-analysis module of HINT-ATAC software.
In the above method for identifying the co-localization of transcription factors based on the ATAC-seq footprint, in step S3, the transcription factor database is a JASPAR database.
The advantage of this scheme lies in:
(1) The invention provides a method for identifying the co-localization of transcription factors based on an ATAC-seq footprint, which can realize the identification of the co-localization of the transcription factors by using the ATAC-seq data as input, can be applied to a plurality of cell lines, and provides method support for further exploring the interaction of the transcription factors with DNA in a combined mode;
(2) The invention calculates a co-localization P value matrix based on a Poisson distribution background model, identifies transcription factor co-localization with statistical significance, eliminates the influence of a random background from a statistical perspective, and effectively identifies transcription factor co-localization;
(3) The invention utilizes the high-resolution data of the foltprint, thus effectively improving the accuracy of transcription factor binding site recognition;
(4) According to the invention, by setting two thresholds, on the basis of the saliency transcription factor pairs screened by poisson distribution, the overlapping positioning and the neighbor positioning are distinguished, so that the identification of the co-positioned transcription factor pairs can be realized, and the identification of the co-positioned transcription factor pairs belongs to competitive binding or cooperative binding, and the method has important significance for the molecular mechanism research of transcription factor regulation.
Drawings
FIG. 1 is a flow chart of a method for identifying transcription factor co-localization based on an ATAC-seq footprint provided by an embodiment of the present invention;
FIG. 2 is a heat map of a P-value matrix generated by clustering transcription factors when a distance threshold is ds=0 bp according to an embodiment of the present invention;
FIG. 3 is a P-value matrix heat map generated by clustering transcription factors when the distance threshold is 0< ds <150bp, provided by the embodiment of the invention;
FIG. 4A is a heat map of a P-value matrix arranged alphabetically by transcription factor name when the distance threshold is ds=0bp according to the embodiment of the present invention;
FIG. 4B is a thermal diagram of a P-value matrix arranged alphabetically by transcription factor name when the distance threshold provided by the embodiment of the invention is 0bp < ds <150 bp;
FIG. 4C is a thermal graph of a P-value matrix co-located with the FOS_JUN family transcription factors of FIG. 4A;
FIG. 4D is a thermal graph of a P-value matrix co-located with the FOS_JUN family transcription factors of FIG. 4B;
FIG. 4E is a diagram of the information content logo of the motif locus of the FOS_JUN family;
FIG. 5A is a heat map of a KLF family transcription factor co-localization P-value matrix when the distance threshold provided by the embodiment of the invention is ds=0 bp;
FIG. 5B is a heat map of a KLF family transcription factor co-localization P-value matrix when 0bp < ds <150bp provided by the example of the present invention;
FIGS. 6A-6D are graphs comparing the length distribution of four data peaks, chIP-seq, chIP-exo, footprint, ATAC-seq, respectively;
FIG. 7A is a Venn diagram comparing ChIP-seq+motif with ChIP-exo;
FIG. 7B is a Venn diagram comparing boot+motif to ChIP-exo.
Detailed Description
The present invention is described in further detail below with reference to the drawings and detailed description.
This example shows a method for identifying transcription factor co-localization based on the ATAC-seq footprint, as shown in FIG. 1, comprising the following steps:
s1, collecting and downloading ATAC-seq data of an object to be identified, such as a K562 cell line, so as to obtain original chromatin patency sequencing data. The code contains 370 tissues or cell lines of ATAC-seq sequencing data available, and the embodiment obtains the ATAC-seq data for the K562 cell line from the code.
S2, according to the downloaded ATAC-seq narrow peak format file, coordinate data of a transcription factor footprint (footprint) in a genome is obtained by using a HINT-ATAC tool.
The HINT-ATAC is software RGT under LINUX system, RGT is an open source library, HINT-ATAC is an open source software in RGT library, HINT-ATAC can be used for carrying out the analysis of the footprints, the coordinate data of the footprints in the genome can be obtained through the analysis of the footprints, the combination condition of the transcription factors on the whole genome can be obtained, and the specific analysis mode is directly adopted in the prior art and is not repeated here.
At this time, it is possible to recognize which sites in the genome are binding sites for transcription factors, but it is uncertain which transcription factors.
S3, matching the above-mentioned footprint position by using the transcription factor motif of the JASPAR database. Specifically, matching the above-mentioned foltprint coordinate data with JASPAR database to obtain transcription factor motif and distinguishing specific transcription factor type of every binding site, and using motif-analysis module of HINT-ATAC software to make matching, the threshold value can be selected to be 0.0001.
S4, arranging the N matched transcription factors according to the names of the transcription factors so as to rapidly distinguish the transcription factor families; for the K562 cell line, a total of 633 transcription factors were matched;
s5, after the motif is matched, the matching times (m, n) of every two transcription factors in the footprint area can be obtained, namely the number (m, n) of each transcription factor is obtained;
and then, taking the motif coordinate data of each transcription factor matched by the fotprint as input, calculating the co-located number of the two transcription factors by using bedtools, and marking the co-located number as a k value.
The matching times of each transcription factor can be obtained by matching the transcription factors with the coordinates and the matching times of each transcription factor, namely, the matching times m and n can be obtained for every two transcription factors.
The calculation of the co-localization number k of each two transcription factors specifically comprises the following steps:
s5-1, obtaining the distance ds between two transcription factors which are closest to each other by using bedtools close-d;
if TFA and TFB are any two transcription factors, TFA is the site information of transcription factor A, and three columns are respectively chromosome, start site, stop site and B. After the two position files are subjected to bedtools close processing, the transcription factor of the TFB nearest to the TFA is found, a new file is generated, the files are in 7 columns, the first three columns are TFA positions, the last three columns are TFB, and the 7 th column is the nearest distance, namely ds.
S5-2, dividing ds into two cases of ds=0 and 0< ds <150, setting the two thresholds, and calculating k values respectively; ds=0 indicates that there is overlap between the two transcription factor sites.
ds=0 and 0< ds <150 are empirical thresholds evaluated over the length of most open areas, where ds=0 is taken as a first distance threshold and 0< ds <150 is taken as a second distance threshold. In practical applications, the person skilled in the art may change it to other values as the first distance threshold and the second distance threshold, respectively.
Depending on the two distance thresholds, the number of co-locations of the two transcription factors may vary, so that each two transcription factors will have two k values, one for the first distance threshold and one for the second distance threshold.
S5-3, forming two k value matrixes 633 multiplied by 633 according to pairwise pairing calculation of 633 transcription factors; k is the number of co-localization conforming to the distance threshold, and whether to co-localize or not can be judged according to the comparison of k and the expected value lambda;
s6, obtaining a significant P value matrix for co-locating the transcription factors by two pairs based on Poisson distribution through the m, n and k values, and judging whether the significant P value matrix is co-located or not according to the P value distribution. That is, here, a significant P-value matrix is obtained in two cases, one in which the distance threshold is ds=0 and one in which the distance threshold is 0< ds <150.
The poisson distribution comprises the following calculation method:
wherein k in the formula (1) is the number of co-localization of two transcription factors within a distance threshold, N, m are the number of localization of each of the two transcription factors, respectively, and N represents the total binding sites, indicating the number of regions available for Transcription Factor (TF) binding in the whole genome. N is determined based on both the ATAC-seq and the foltprint data, and if the 633 transcription factors of the K562 cell line match in 269997 foltprint regions, N is 269997. Lambda is the desired value, obtained from N, m and N.
In a two-tailed poisson distribution, a low P value indicates that the localization of two transcription factors in the genome is non-random, and may represent two salients, one is a significant co-localization and the other is a significant refusal co-localization, such as the left shaded portion in fig. 1 is a significant refusal co-localization, the right shaded portion is a significant co-localization, and the distinguishing between significant and non-significant threshold P' values is determined empirically by one skilled in the art, such as 0.01, or a proportion of the total, such as 1% of the total, based on the total. In this way, transcription factor pairs with significance are screened, and the screened transcription factor pairs can be subjected to significant co-localization or significant co-localization rejection.
The screened transcription factor pairs are then further judged using the relationship of k to the expected value λ, and if k is higher than the expected value, two transcription genes (TFs) are considered to be prone to co-localization on the genome, and if k is lower than or equal to the expected value, two TFs are considered to be prone to refusal co-localization on the genome. In the above manner, two results will be obtained for distance thresholds ds=0 and 0< ds <150, respectively, each result containing pairs of transcription factors judged to be co-localized and pairs of transcription factors judged to be refused to co-localized.
S7, judging that the transcription factor pairs are co-located when the screening threshold value is 0< ds <150, and obtaining the transcription factor pairs which are cooperatively combined;
screening the transcription factor pairs judged to be co-located when the threshold ds=0 to obtain competition-combined transcription factor pairs;
and judging whether each of the cooperatively bound transcription factor pairs simultaneously belongs to competitive binding according to the competitive binding transcription factor pairs, if so, judging that the corresponding transcription factor pairs are competitive and cooperative, and rejecting the corresponding transcription factor pairs from the competitive binding transcription factor pairs and the cooperatively bound transcription pairs.
Thus, pairs of synergistically bound transcription factors, pairs of competing bound transcription factors and pairs of both competing and synergistically bound transcription factors are selected. That is, the scheme proposes to use a higher resolution method of footprint+motif, and combines a mode of setting two thresholds based on a statistical poisson distribution, so that not only can co-located transcription factor pairs be screened out more accurately, but also the co-located transcription factor pairs can be distinguished to be in competitive or cooperative combination, and the method can be used for helping understanding a molecular mechanism of gene transcription regulation and assisting in gene research of current cell lines and tissues.
Fig. 2 shows a P-value matrix heat map generated by transcription factor clustering when the distance threshold ds=0 bp provided in this embodiment, where +30 in the upper right corner indicates significant co-localization, -30 indicates significant rejection of co-localization. The original image is a color display, the significant co-localization is blue, the significant rejection is co-localization is red, the less significant the color is, the lighter the color is. Blue is concentrated mainly at the diagonal, with some blue bias elsewhere, indicating co-located TF pairs concentrated mainly at the diagonal, but also elsewhere. After gray scale processing, a cluster of significantly black in fig. 2 represents significant co-localization, dark gray represents significant rejection of co-localization, and white and light gray are less significant TF pairs. It can be seen that the diagonal line shows itself competing for binding to itself (the same transcription factor competes for the same site), and that the diagonally clustered clusters are mostly the same gene family, with the same or similar motif.
FIG. 3 shows a P-value matrix heat map generated when 0< ds <150bp transcription factors are clustered, and the +30 at the upper right corner represents significant co-localization, and-30 represents significant rejection co-localization. The original image is a color display, with significant co-localization in blue and significant rejection in red, showing a synergistically bound transcription factor pair. In this figure, most of the TF pairs appear to be blue-colored and a small amount of red-colored, indicating that most of the TF pairs appear to be co-localized, and after gray scale processing, the differences are difficult to see due to the large number of participating transcription factors, i.e., 633 on the abscissa and 633 on the ordinate, and the FOS family and KLF family are described in detail below for better understanding by the reader.
Fig. 4A and 4B are thermal graphs of P-value matrices arranged alphabetically by transcription factor name when the distance threshold provided in this example is ds=0 bp and 0bp < ds <150bp, respectively. The original image is still a color display, the obvious co-localization is blue, the obvious rejection is red, and fig. 4A and 4B are similar to fig. 2 and 3, and the difference is only whether clustering is performed or not and whether the clustering is arranged in alphabetical order, and after the gray scale treatment, the fact is still unclear, and the details will be described below by taking the FOS family and the KLF family as examples.
FIGS. 4C and 4D are diagrams showing an example of the FOS_JUN family, FIG. 4C is a heat map of a P-value matrix in FIG. 4A in which FOS_JUN family transcription factors are co-located, and FIG. 4D is a heat map of a P-value matrix in FIG. 4B in which FOS_JUN family transcription factors are co-located. For transcription factors of the same family, the transcription factors have the same or similar motif. FIG. 4E is a plot of FOS_JUN family motif site information content logo, showing that the family motifs of significantly co-located transcription factor pairs are similar.
For a clearer representation, in fig. 4C, circles represent bluish colors, dots represent reddish colors, the lighter the colors, the weaker the corresponding saliency, unlabeled black represents a significantly co-localized blue color, and white represents no saliency approaching 0. In fig. 4D, the box marked with circles indicates that the blue-colored rest is reddish or of no significance.
FIG. 4D is a P-value matrix heatmap at a threshold of 0< ds <150, with the darker colored circled boxes representing significant co-localization, as can be seen from FIG. 4D, such boxes are actually few, indicating that this family has only a small number of pairs of synergistically bound transcription factors, such as FOSB: : JUNB and FOS, FOSB: : JUNB (var.2) and FOS, FOS: : JUN (var.2), FOSL1, etc.
Fig. 4C is a P-value matrix heat map at a threshold of ds=0, with darker colored boxes of labeled dots indicating significant rejection of co-localization, it can be seen that fig. 4C does not have such boxes, i.e., there is no transcription factor pair that significantly rejects co-localization, the unlabeled black boxes and darker colored boxes indicate significant co-localization, and it can be seen that fig. 4C has more such boxes, indicating that the family has more competing pairs of transcription factors, such as fosb. : JUN, fosb..jun and FOSL2: : JUND, etc.
Based on the competing and synergistically bound transcription factor pairs, one can then find both competing and bound transcription factor pairs, as can be seen in FIGS. 4C and 4D, where such transcription factor pairs are absent.
Fig. 5A and 5B are heat maps of P-value matrices co-located with KLF family transcription factors when the distance threshold provided in this embodiment is ds=0 bp and 0bp < ds <150bp, respectively, fig. 5A is a heat map of P-value matrices co-located with KLF family transcription factors in fig. 4A, and fig. 5B is a heat map of P-value matrices co-located with KLF family transcription factors in fig. 4B. All boxes in fig. 5A show a darker blue color, indicating that the blocks show significant co-localization at a distance threshold of ds=0, and each transcription factor pair is judged to compete for binding. In fig. 5B, the redness is indicated by dots, i.e., the co-localization is rejected, and the remaining unlabeled blueness, i.e., the co-localization. Taking this as an example, in FIG. 5B, the darker unlabeled boxes (i.e., significantly co-localized pairs of transcription factors) will be screened as synergistically bound pairs of transcription factors. In addition, it can be seen that some transcription factor pairs, which are judged herein to be synergistically bound, also exhibit significant co-localization at a threshold of ds=0 bp, i.e., are simultaneously judged to be competitively bound, such transcription factor pairs will be judged to be both competitively and synergistically bound transcription factor pairs, such as KLF10 and KLF9, KLF11 and KLF9, etc., indicating that the present method can distinguish between competitively and synergistically bound.
The scheme adopts a bacterial+motif mode to identify the transcription factor binding site, the prior art can realize that the ChIP-seq+motif mode can identify the transcription factor binding site, and the ChIP-seq and the ChIP-exo are both special binding sites for locating a specific transcription factor in a genome, but the latter has higher resolution and can be more accurately located, and in order to improve the resolution of the ChIP-seq, the prior art usually combines the ChIP-seq with motif, but the method can realize the identification of the transcription factor binding site, but has the problem of large data volume.
This example uses higher resolution ChIP-exo data as a gold standard to compare and verify whether footprint+motif has the same efficacy as conventional ChIP-seq+motif in identifying transcription factor binding sites.
As shown in fig. 6A-6D, which show comparison of the length distribution of four data peaks of ChIP-seq and ChIP-exo, footprint, ATAC-seq, fig. 6A shows the length distribution of ChIP-seq data, fig. 6B shows the length distribution of ChIP-exo data, fig. 6C shows the length distribution of foltprint data, and fig. 6D shows the length distribution of ATAC-seq data, it is known that foltprint has the highest data resolution, about 15bp, about 50bp, and about 250bp, so that the method adopted in this scheme is advantageous in terms of resolution.
The ChIP-exo data, which has a higher resolution than ChIP-seq, is further used as a gold standard herein to compare the foltprint+motif with the ChIP-seq+motif data. For ChIP-exo data, the original reads were aligned with the reference genome using bwa default parameters. Double ended sequencing data was poorly performed with samtools rmdup, so PCR replicates with the 'MarkDuplates Remove Duplates =true' option were deleted with the markdulicates function of the picard tool. The peak and peak top formed in the genome after alignment were determined by MACS2, and the P value was 0.001. The ChIP-exo data file processed by call peak is in a narrow peak format.
Figures 7A and 7B show venn diagrams comparing ChIP-seq + motif, footprint + motif with ChIP-exo, respectively. The proportion of overlapping portions in each method is marked with a percentage number, and it is shown that ChIP-exo is comparable to both the foltprint and motif when used as a transcription factor binding site gold standard, indicating that the foltprint+motif data using one ATAC-seq data in this scheme can be used to identify potential transcription factor binding sites instead of ChIP-seq+motif, which requires a large amount of data.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the present solution. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions in a similar manner without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims (10)

1. A method for identifying transcription factor co-localization based on ATAC-seq footprint, comprising:
s1, collecting and downloading ATAC-seq data of a target to be identified to acquire original chromatin openness sequencing data;
s2, analyzing coordinate data of transcription factor footprints in the target genome to be identified by using a footprint analysis tool based on the data file obtained in the step S1;
s3, matching transcription factors motif of a transcription factor database with footprint coordinate data to obtain specific transcription factor types of each binding site;
the transcription factor database records transcription factors and binding sites and binding modes of different organisms;
s4, calculating the distance ds between every two transcription factors through a distance calculation tool;
determining the co-localization number k1 of the two transcription factors based on the calculated ds and the first distance threshold;
determining the co-localization number k2 of the two transcription factors based on the calculated ds and the second distance threshold;
s5, constructing a transcription factor identification co-localization model based on poisson distribution;
respectively calculating probability values P under two conditions of k1 and k2 by using a recognition transcription factor co-localization model;
respectively screening transcription factor pairs with significance under two conditions according to the probability value P and the threshold value P';
s6, judging a significant transcription factor pair with k1 larger than an expected value as a synergistically combined transcription factor pair under the condition of a second distance threshold;
for a first distance threshold condition, judging a significant transcription factor pair with k2 greater than an expected value as a transcription factor pair competing for binding;
and judging whether the transcription factor pairs combined cooperatively belong to competitive binding according to the transcription factor pairs combined competitively, and if so, judging that the corresponding transcription factor pairs are competitive and cooperative.
2. The method of identifying co-localization of transcription factors based on ATAC-seq footprint of claim 1, wherein in step S5, the significant transcription factor pairs comprise significant co-localized transcription factor pairs and significant reject co-localized transcription factor pairs.
3. The method of identifying transcription factor co-localization based on ATAC-seq footprint of claim 2, wherein constructing an identifying transcription factor co-localization model based on poisson distribution is:
wherein k in the formula (1) is the co-location number of the two transcription factors in the threshold range, N and m are the respective location numbers of the two transcription factors, N represents the total binding sites of the target to be identified, and lambda is a desired value;
calculating the probability of each two transcription factor pairs being judged to be co-located or refused to be co-located under two distance thresholds respectively through a formula (1);
and respectively screening out transcription factor pairs with significance characteristics under two distance thresholds based on the probability values.
4. The method for identifying co-localization of transcription factors based on ATAC-seq footprint according to claim 3, wherein the distance between two transcription factors is used as the distance between the two transcription factors.
5. The method for identifying co-localization of transcription factors based on an ATAC-seq footprint according to claim 4, wherein in each threshold case the number of localization of each transcription factor is determined based on how many coordinates it matches.
6. The method of claim 5, wherein two distance thresholds are set, a first distance threshold is ds=0 and a second distance threshold is 0< ds <150.
7. The method for identifying transcription factor co-localization based on ATAC-seq footprint according to claim 2, wherein in step S1, the target to be identified is a target cell line or a target tissue.
8. The method for co-localization of transcription factors based on ATAC-seq footprint recognition according to claim 2, wherein in step S2 the footprint analysis tool used is a hin-ATAC tool and step S1 is obtained as an ATAC-seq narrow peak format file compatible with the hin-ATAC tool.
9. The method for identifying transcription factor co-localization based on the ATAC-seq footprint according to claim 8, wherein in step S3, the matching is performed using a motif-analysis module of HINT-ATAC software.
10. The method for identifying co-localization of transcription factors based on the ATAC-seq footprint according to claim 9, wherein in step S3, the transcription factor database is a JASPAR database.
CN202310326955.5A 2023-03-22 2023-03-22 Method for identifying transcription factor co-localization based on ATAC-seq footprint Active CN116343917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310326955.5A CN116343917B (en) 2023-03-22 2023-03-22 Method for identifying transcription factor co-localization based on ATAC-seq footprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310326955.5A CN116343917B (en) 2023-03-22 2023-03-22 Method for identifying transcription factor co-localization based on ATAC-seq footprint

Publications (2)

Publication Number Publication Date
CN116343917A CN116343917A (en) 2023-06-27
CN116343917B true CN116343917B (en) 2023-11-10

Family

ID=86891067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310326955.5A Active CN116343917B (en) 2023-03-22 2023-03-22 Method for identifying transcription factor co-localization based on ATAC-seq footprint

Country Status (1)

Country Link
CN (1) CN116343917B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101918578A (en) * 2007-10-27 2010-12-15 Od260公司 Promoter detection and analysis
CN107368701A (en) * 2017-07-31 2017-11-21 浙江绍兴千寻生物科技有限公司 In high volume unicellular ATAC seq data quality controls and analysis method
WO2022147296A1 (en) * 2020-12-30 2022-07-07 10X Genomics, Inc. Cleavage of capture probes for spatial analysis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017075294A1 (en) * 2015-10-28 2017-05-04 The Board Institute Inc. Assays for massively combinatorial perturbation profiling and cellular circuit reconstruction
US20180016314A1 (en) * 2016-07-12 2018-01-18 Children's Hospital Medical Center Treatment of disease via transcription factor modulation
AU2019398351A1 (en) * 2018-12-14 2021-06-03 Pioneer Hi-Bred International, Inc. Novel CRISPR-Cas systems for genome editing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101918578A (en) * 2007-10-27 2010-12-15 Od260公司 Promoter detection and analysis
CN107368701A (en) * 2017-07-31 2017-11-21 浙江绍兴千寻生物科技有限公司 In high volume unicellular ATAC seq data quality controls and analysis method
WO2022147296A1 (en) * 2020-12-30 2022-07-07 10X Genomics, Inc. Cleavage of capture probes for spatial analysis

Also Published As

Publication number Publication date
CN116343917A (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN101401101B (en) Methods and systems for identification of DNA patterns through spectral analysis
US7979212B2 (en) Method and system for morphology based mitosis identification and classification of digital images
Li et al. Donuts, scratches and blanks: robust model-based segmentation of microarray images
National Research Council Mapping and sequencing the human genome
Krueger et al. Large scale loss of data in low-diversity illumina sequencing libraries can be recovered by deferred cluster calling
CN102194641B (en) Mass analysis data processing method and mass analysis data processing apparatus
EP1260594A2 (en) Arrangement of nucleic acid sequences and use thereof
Puniyani et al. SPEX2: automated concise extraction of spatial gene expression patterns from Fly embryo ISH images
WO2005076197A2 (en) Method and system for morphology based mitosis identification and classification of digital images
CN116343917B (en) Method for identifying transcription factor co-localization based on ATAC-seq footprint
CN116597985A (en) Survival rate prediction model training method, survival period prediction method, survival rate prediction device and survival rate prediction equipment
US7877213B2 (en) System and methods for automated processing of multiple chemical arrays
Appel et al. Computer analysis of 2-D images
Garrison et al. Visualization and analysis of microtubule dynamics using dual color-coded display of plus-end labels
EP3387616B1 (en) Object classification in digital images
CN106021987A (en) Ultra-lower frequency clustering and grouping algorithm for mutant peptide labels
US20050232488A1 (en) Analysis of patterns among objects of a plurality of classes
Huang et al. A systematic evaluation of Hi-C data enhancement methods for enhancing PLAC-seq and HiChIP data
Kbiri et al. Quantifying Meiotic CrossoverRecombination in Arabidopsis Lines Expressing Fluorescent Reporters in Seeds Using SeedScoring Pipeline for CellProfiler
CN115273973A (en) QTL sample processing, model training and identifying method, device and equipment
DE602005001850T2 (en) COMPUTER SOFTWARE TO SUPPORT THE IDENTIFICATION OF SNPS WITH MICROARRAYS
Tomizawa et al. Harnessing deep learning to analyze cryptic morphological variability of Marchantia polymorpha
Baek et al. Segmentation and intensity estimation of microarray images using a gamma-t mixture model
WO2009126495A2 (en) Method and system for processing microarray images
Adiga et al. An efficient tool for genetic experiments: Agarose gel image analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant