CN109033751B

CN109033751B - Function prediction method for non-coding region mononucleotide genome variation

Info

Publication number: CN109033751B
Application number: CN201810804405.9A
Authority: CN
Inventors: 刘宏德; 孙啸; 罗坤; 马伟恒
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-07-20
Filing date: 2018-07-20
Publication date: 2021-07-27
Anticipated expiration: 2038-07-20
Also published as: CN109033751A

Abstract

The invention discloses a function prediction method of non-coding region mononucleotide genome variation, which comprises the following steps: 1) chromatin open area recognition; 2) transcription factor binding site recognition; 3) evaluation of the effect of single nucleotide variation: calculating the influence of the single nucleotide variation positioned in the transcription factor binding site region on the binding of the transcription factor based on the site specific frequency matrix of the transcription factor, and identifying the single nucleotide variation which obviously changes the binding capacity of the transcription factor; the effect of single nucleotide variations was further assessed by looking at the target gene biological pathways of transcription factors. The method completes the identification of multiple transcription factors and binding sites thereof at one time through chromatin open region information and gene expression information, and realizes the functional annotation of genome variation of non-coding regions.

Description

Function prediction method for non-coding region mononucleotide genome variation

Technical Field

The invention belongs to the technical field of genes, and particularly relates to a function prediction method of a single nucleotide genomic variation of a non-coding region, which is a method for identifying Transcription Factors (TF) and binding sites thereof in a eukaryotic genome based on chromatin open region high-throughput sequencing information, and a method for evaluating the influence of the single nucleotide variation based on a motif (motif) of the transcription factors and DNA.

Background

All biological functions and characteristics of the cell are related to the transcription regulation of the gene, the transcription regulation has cell type specificity, and has close relation with the differentiation and canceration processes, and the cell type specificity are a key for analyzing the cell rule and solving the cancer problem. Analysis of the transcriptional regulation of genes has as a primary task to identify the binding sites (TFBS) of various Transcription Factors (TF) in cells on genomic DNA, i.e. to determine which transcription factors are bound to and at what location in the genome, and to regulate the transcription of which genes. At present, the whole genome high-throughput determination of transcription factors and binding sites thereof is mainly realized by a chromatin co-immunoprecipitation sequencing (ChIP-Seq) experiment. ChIP-Seq recognizes the genomic DNA bound to the transcription factor on the chromatin using the antibody of the transcription factor, then the DNAs are separated and purified, then the base sequences of the DNA fragments (reads) are determined by the next generation sequencing technology (NGS), and finally the positions of the DNA fragments on the genome are recognized by the back-to-back alignment, thereby determining the binding sites of the transcription factor. The method has the disadvantages that only the binding information of one transcription factor can be determined in one experiment, and the method is high in cost and time-consuming.

Transcriptional regulation is closely coupled with nucleosome exclusion, chromatin opening, and other processes. Eukaryotic DNA exists in the form of chromatin, the basic building block of which is the nucleosome. Generally, the binding of transcription factor to its binding site DNA has a necessary premise that nucleosomes are displaced around the transcription factor binding site to form a region of open chromatin regions, leaving the DNA duplex naked. Thus, chromatin opening regions on the genome are likely regions for transcription factor binding. The high-throughput detection method of the chromatin opening region comprises the following steps: DNase-Seq, ATAC-Seq, FAIRE-Seq, etc. In addition, transcription factor binding is specific to the DNA sequence, i.e., the DNA to which the transcription factor binds has a specific pattern (i.e., motif) in composition and sequence. Therefore, it is an important part of the present invention that multiple transcription factors (which may be all known transcription factors of motif) and their binding sites can be recognized at a time by using chromatin opening region information and sequence characteristics (motif) of transcription factor binding sites.

Non-coding regions of the genome, although not directly involved in coding proteins, are important regions for regulating transcription of coding sequences, and include numerous elements or combinations of elements that regulate gene expression, such as enhancers and transcription factor binding sites.

With the progress of research on mutation and disease-related studies, there is increasing evidence that mutations in functional elements of non-coding regions are closely linked to inherited diseases. In particular, the risk-type single nucleotide variations/polymorphisms (SNVs/SNPs) of various diseases, which have been discovered by genome-wide association analysis (GWAS), are mostly directed to non-coding regions of the genome. These non-coding region mutations, if present at the transcription factor binding site, alter the affinity of the transcription factor for DNA binding and thus alter the level of transcription of downstream target genes, altering the phenotypic characteristics of the cell. The evaluation of the effect of SNV located at the transcription factor binding site on transcription factor binding is important for analyzing cell differentiation and canceration, and annotating individual genome functions. For example, the oncogene c-MYC has a regulatory region (8q24) at a distance of 335kb, where the G allele type of SNP rs 69883267 has a strong affinity for transcription factor TCF4, resulting in an uncontrolled high expression of c-MYC, resulting in a cancerous phenotype. Therefore, there is a need to create a model to evaluate the effects of variation in non-coding regions, and to predict the effects of variation on cellular phenotype.

Unfortunately, there is no method for systematically identifying the binding sites of multiple transcription factors at once and evaluating the effect of nucleotide variations in the region of these transcription factor binding sites on transcription factor binding and transcription of downstream genes.

Disclosure of Invention

The purpose of the invention is as follows: the invention utilizes chromatin open region high-throughput sequencing data to establish an identification method for identifying a transcription factor and a binding site thereof, and establishes a method and a model for evaluating the effect of non-coding nucleotide variation positioned in the transcription factor binding site region on transcription regulation. The invention aims to provide a function prediction method of a single nucleotide genome variation of a non-coding region.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for functionally predicting a single nucleotide genomic variation in a non-coding region, comprising the steps of:

1) chromatin open area recognition;

2) transcription factor binding site recognition: scanning the chromatin opening region for transcription factor binding sites using a site specific frequency matrix (PSSM) of transcription factors; determining a transcription factor based on the expression level of the gene encoding the transcription factor; taking a gene within 8 kilobases downstream of a transcription factor binding site as a target gene of the transcription factor;

3) evaluation of the effect of single nucleotide variation: calculating the influence of the single nucleotide variation positioned in the transcription factor binding site region on the binding of the transcription factor based on the site specific frequency matrix of the transcription factor, and identifying the single nucleotide variation which obviously changes the binding capacity of the transcription factor; the effect of single nucleotide variations was further assessed by looking at the target gene biological pathways of transcription factors.

Wherein, the step 1) chromatin open area identification step is as follows: performing quality control on chromatin open region sequencing data unfolding reads (reads), performing reverse comparison (mapping) and reading enrichment region identification, and identifying the chromatin open region.

Wherein the quality value Q of the reads is more than or equal to 30, and the reads with the sequencing error rate less than or equal to 0.001 are used for the replying comparison.

Wherein, the mathematical model of the read enrichment area is Poisson distribution, and the enrichment significance value formula is as follows:

k is the reading count at the genomic locus, λ is the average of the genome reading counts, and the threshold for the enrichment significance value P is 10^-5。

Wherein, the chromatin open area sequencing data in the step 1) is one or more of deoxyribonuclease I sensitive site sequencing data, regulation original formaldehyde-assisted separation sequencing data and transposase accessible chromatin experiment sequencing data.

Wherein the gene expression level data in step 2) is Microarray data (Microarray) or ribonucleic acid sequencing data (RNA-Seq).

When the expression quantity of the gene FPKM is more than or equal to 8, the abundance of the transcription factor in the cell is considered to be high.

Wherein the effect of evaluating a single nucleotide variation in the step 3) is evaluated by the following formula 1:

wherein P (i, j) and P (i, k) are the values of base j and base k, respectively, at the i-th position of the site-specific frequency matrix, j and k being adenine, guanine, cytosine and thymineOne kind of the medicine.

Wherein, the base k (genotype k) is a mutant type or a base type with low frequency in the crowd, and the base j (genotype j) is a base type with high frequency in the wild type crowd; f is a positive number indicating that the mutated or low-frequency genotype has the effect of increasing the affinity of the transcription factor, and F is a negative number indicating that the wild type or high-frequency genotype has the effect of increasing the affinity of the transcription factor; f is positive and the larger the value, the higher the affinity to the transcription factor after mutation or for the low-frequency genotype, on the contrary, F is negative and the smaller the value, the higher the affinity to the transcription factor for the wild type or for the high-frequency genotype; f is 0, indicating that the genotype has no effect on transcription factor binding.

Has the advantages that: compared with the prior art, the invention has the advantages that:

1. the method can be suitable for simultaneously predicting multiple transcription factors in the target sample genome, and the same sample is not repeatedly sequenced by using different antibodies for different transcription factors like the ChIP-seq technology.

2. The invention is based on chromatin open region sequencing data, has low requirement on sequencing depth and simultaneously aims at multiple transcription factors, thereby having low time and economic cost.

3. The invention establishes a model for evaluating the combination site region of the transcription factor and the influence of the genome single base variation on the combination of the transcription factor, namely a method for constructing the functional annotation of the variation of the non-coding region. The rationality and accuracy of the model was demonstrated in example 1.

4. The invention realizes the recognition of transcription factors, the recognition of transcription factor binding sites, the recognition of target genes regulated by the transcription factors and the annotation analysis of the variation of the transcription factor binding sites only by the sequencing of chromatin open regions and the sequencing of messenger nucleotides (mRNA).

Drawings

FIG. 1 is a flow chart of a method of the present invention for functional prediction of a single nucleotide genomic variant of a noncoding region;

FIG. 2 is a calculation process description of embodiment 1 of the present invention;

FIG. 3, the binding of transcription factors of chromatin opening regions conserved in eight cells at DNA sites of chromatin opening regions.

Detailed Description

The present invention is further illustrated by the following specific examples, it should be noted that, for those skilled in the art, variations and modifications can be made without departing from the principle of the present invention, and these should also be construed as falling within the scope of the present invention.

Example 1: the accuracy experiment of the prediction method of the invention comprises the following steps: calculating the effect of a single nucleotide polymorphism in DNA at a transcription factor binding site on transcription factor binding

The protooncogene C-MYC has a regulatory region (8q24) at a distance of 335kb, which has a SNP (rs 69883267) (genome assembly version grch37.p13), and in the genome project of thousands of people, 1008 asian populations (Phase3_ V1-EAS), on the positive strand, the frequency of guanine (G) at this site is G0.388, and the frequency of cytosine (C) is C0.612. In the european population (thousand genomes, number of people 1006), the frequency is G-0.499 and T-0.501. The DNA sequence ("ATGAAAGGC") in which the SNP is located is the binding site for the transcription factor TCF4, and the target gene for regulation is c-MYC.

In the European population, at the polymorphic site rs 69883267, genotype G has a slightly lower frequency (0.501) and genotype T has a slightly higher frequency (0.499), and the site is the binding site of the transcription factor TCF4, and the target gene regulated by the transcription factor is c-MYC. Then what effect did genotype G and genotype T have on the affinity of the transcription factor TCF 4? The F number is calculated using the site specific frequency matrix (PSSM) of the transcription factor TCF4, as defined in the method of the invention for assessing the effect of non-coding nucleotide variations in the region of the transcription factor binding site on transcriptional regulation, and the formula defined in the invention (formula 1): this polymorphic site corresponds to position 9 of the PSSM matrix, so P (9, G) is 0.48, P (9, T) is 0.07, and F-log 10(0.07/0.48) is 0.83. The value of F is positive, which indicates that the genotype G has the function of enhancing the affinity of the transcription factor; if the conversion is 0.48/0.07 to 6.86, it indicates that genotype G has a 6.86-fold greater affinity for the transcription factor TCF4 than genotype T. The affinity of the transcription factor is enhanced, and the corresponding effect is the increase of the expression quantity of the target gene c-MYC. That is, in individuals with genotype G at the locus, the gene c-MYC is expressed in higher amounts than in individuals with genotype T, i.e., increasing the risk of developing a cancer phenotype.

In the published literature, evidence is found to support the above conclusions. Evidence suggests that: the G allele of rs 69883267 is a risk variation for colon, breast, or prostate cancer. In the colon cancer cell lines HCT116 and DLD, the G allele increased TCF4 by about 26% and% 51 affinity; in DLD cell lines, the G allele results in a 2-fold increase in c-MYC expression (Molecular And Cellular Biology, 2010, 30 (6): 1411-1420).

Similarly, the risk of disease at this site in the asian population can be calculated, with about 61.2% of asian populations having a genotype C at site rs 69883267, and genotype C still having a risk of disease compared to the lowest risk genotype T, where F-log 10(P (9, T)/P (9, C)) -log10(0.07/0.20) — 0.46 > 0, but not as significant as genotype G.

Therefore, the functional prediction method of the non-coding region single nucleotide genomic variation can accurately evaluate the influence of the variation of the transcription factor binding site region on the transcription factor binding and the regulation effect on the transcription of a target gene.

Example 2 identification of transcription factors common to 8 cell lines and prediction of the function of genomic variations of transcription factor binding sites

1. The data source is as follows:

the 8 cell lines GM12878, IMR90, MCF-7, K562, BJ, H7, HepG2 and M059J chromatin opening region high-throughput sequencing data (DNase-Seq) are from the data accession number of the national center for Biotechnology information (GEO ID: GSE32970) (Nature 2012 Sep 6; 489 (7414): 75-82). The cell types of the 8 cell lines are shown in Table 1.

TABLE 1

The expression data (RNA high throughput sequencing (RNA-Seq)) for the eight cell lines are derived from Table 1. Mutation data was obtained from the genome information browser (UCSC genome browser) at the university of california, santa cruz, usa, using the Tables function therein to obtain simple mutation data information (dbSNP150) of the human genome hg 19. The SNP (single SNP) of a single base is selected according to the category, and the SNP mainly comprises the following information: chromosome number (chrom), position (chromStart), SNP number (name), plus and minus strand (strand), and base information (observe).

2. Data processing

The data was analyzed according to the flow chart shown in fig. 1.

1) Chromatin region identification ('Peak' identification)

Chromatin open area sequencing data (DNase-Seq) analysis: sequencing reads (reads) were aligned on the human gene reference set with the tool BWA (Bioinformatics, 2010, 25: 1754-60), version number hg19, retaining reads with only fewer than 4 aligned positions on the reference genome. Identification of the Reads-enriched regions (peaks) was performed using the MACS2 tool (P.ltoreq.10-5) (Genome Biol, 2008, 9 (9): R137).

2) Transcription factor and binding site recognition thereof

recognition of transcription factor binding sites in the reads-rich region (peaks) was performed using the Homer tool (Molecular cell, 2010, 38 (4): 576-589), P-values ≦ 0.01. In this example, those open regions of (conserved) chromatin found in all 8 cells were selected for analysis. The target gene downstream of the Reads enrichment region was annotated with the PAVIS tool (Bioinformatics, 2013, 29 (23): 3097-3099) at a distance of 8 kilobase pairs. Thus, transcription factor binding sites in the open regions of chromatin, and transcription factors that may bind (binding sites and certain transcription factors) and genes downstream of this binding site (regulatory target genes) have been identified for each cell type.

The number of chromatin open regions conserved in 8 cells is shown in Table 2.

TABLE 2 information on chromatin opening regions ("peaks") identified in eight cell lines

Expression data (RNA-Seq) reference Genome hg19 was aligned with Tophat (Genome biology, 2013, 14 (4): R36), and the expression level FPKM (number of fragments per kilobase transcript) for each gene was calculated using Cufflinks. Calculating the expression quantity (FPKM) of the gene coding the transcription factor, and if the FPKM of the gene is more than or equal to 8, the transcription factor is considered to be abundant in cells and can be combined at the combination site thereof, thereby having the transcription regulation and control effect on downstream target genes. The chromatin opening regions conserved in 8 cells and the bound transcription factors identified are shown in FIG. 3. The basic logic is that chromatin in an open state is a prerequisite for transcription factor binding at a DNA site, where the DNA fragment to which the transcription factor binds is sequence specific. Based on these two points, the binding sites of the transcription factors in the chromatin opening region in 8 cells were identified, and the possible transcription factor types were also presumed. Then, those transcription factors which play a dominant role in the cells are finally determined based on the expression levels of the genes encoding the transcription factors in the respective cells. In FIG. 3, this example shows that the significance P.ltoreq.10^-8And FPKM is not less than 8. The recognized transcription factors, calculated in this example, are those that bind to chromatin opening regions conserved in eight cells. The circle size in the figure indicates the significance (P) of the transcription factor binding at the DNA site of the chromatin open region, the larger the circle, the more significant the binding; the depth of the circle filling color indicates the expression level (FPKM) of the gene encoding the transcription factor, and the specific data are shown in Table 3 (Table 3-1 is the expression data of the gene encoding the transcription factor; Table 3-2 is the significance value of the enrichment of the transcription factor in the chromatin open region). FIG. 3 shows the significance P.ltoreq.10^-8And FPKM is not less than 8. In FIG. 3, the columns represent cell types and the rows represent transcription factors.

TABLE 3-1 expression data (FPKM) of genes encoding transcription factors

TABLE 3-2 significance of enrichment of transcription factor in chromatin open region (-log10(P-value))

Among the four cancer cell lines, the recognized transcription factors and target genes are listed in tables 4 to 8.

TABLE 4 highly affected genes and related information in the K562 cell line

TABLE 5 highly affected genes and related information in HepG2 cell line

TABLE 6 genes highly affected in the M059J cell line and related information

TABLE 7 genes highly affected in MCF-7 cell line and related information

This example identifies the significant presence of transcription factors from 8 cell lines, some of the common transcription factors, such as the transcription factors CTCF, BORIS and Sp 1. The gene encoding CTCF is frequently expressed in normal somatic cells, whereas BORIS (brother of Regulator of expressed sites) is a paralogue of CTCF, in contrast (PLoS Genet, 2008.4 (8): e 1000169). The expression gene of BORIS is thought to be involved in the canceration of cells, and the literature found that the gene encoding it is frequently expressed in most tumor cells and rarely expressed in normal cells (Proc Natl Acad Sci USA, 2002.99 (10): 6806-11; Eur J Cancer, 2012.48 (6): 929-35). Recent studies have shown that the BORIS-encoding gene is highly expressed in rectal Cancer and has an inhibitory effect on apoptosis (Eur J Cancer, 2012.48 (6): 929-35).

3) Evaluation of the Effect of Single nucleotide variations

The effect of single nucleotide variation was assessed based on the site specific frequency matrix (PSSM) of transcription factors (equation 1).

Where P (i, j) and P (i, k) are the values of base j and base k, respectively, at the i-th position of the site-specific frequency matrix, j and k being one of adenine, guanine, cytosine and thymine. The variation data information is from dbSNP150(hg19) described above.

This example identifies Single Nucleotide Polymorphism (SNPs) variations in the transcription factor binding site region. The number of SNPs present in the binding site region of the transcription factors that bind in the conserved chromatin opening region in the eight cells is shown in Table 8.

TABLE 8 number of Single nucleotide variants (SNPs) in the region of the Transcription Factor Binding Site (TFBS) of the eight cell lines

The 4 cancer cell lines recognize transcription factors, target genes, and risk genomic variations, and the extent to which these variations affect transcription factor binding, expressed as F values that sum all variations (SNPs). The value of F is calculated in equation 1. The method specifically comprises the following steps:

where P (i, j) and P (i, k) are the values of base j and base k, respectively, at the i-th position of the site-specific frequency matrix, j and k being one of adenine, guanine, cytosine and thymine.

Wherein, the base k (genotype k) is a mutant type or a base type with low frequency in the crowd, and the base j (genotype j) is a base type with high frequency in the wild type crowd; the more positive (larger) the F number, the higher the affinity to the transcription factor after mutation or for the low-frequency genotype, and conversely, the more negative (smaller) the F number, the higher the affinity to the transcription factor for the wild type or for the high-frequency genotype; f is a positive number indicating that the mutant or low-frequency genotype has an effect of increasing the affinity for the transcription factor, F is a negative number indicating that the wild-type or high-frequency genotype has an effect of increasing the affinity for the transcription factor, and F is 0 indicating that the genotype has no effect on the binding of the transcription factor.

A gene may have several SNPs that will be transcription factor binding sites, and to assess the regulation of transcription and transcription of the gene by these SNPs, the total regulation is expressed by summing the F values (Sigma F) of all SNPs.

For the results of this example, several special cases were chosen to demonstrate the rationality or biological significance of the results. The following are found: the expression of the genes DAD1, SIRPA, BAX, etc. varies from individual genome to individual genome, and in particular, the expression level of these genes varies due to the different genotypes (SNPs) of the regulatory regions (transcription factor binding sites) of these genes in different populations, resulting in different binding strengths of transcription factors, and these changes are associated with the disease (cancer) phenotype.

In K562 (leukemia cell line) cells (table 4), Σ F of the gene DAD1 is-15.4, which is negative in terms of F number and larger in absolute value, suggesting that the low frequency allele (MAF) of polymorphic sites rs2301200, rs227870, and rs5742730 leads to a decrease in the affinity of the transcription factors NFY, Klf9, Max, and NRF at the regulatory site of the gene DAD1, resulting in a decrease in the expression level of the gene DAD 1. The gene DAD1 is a regulatory gene (target gene) of the transcription factors NFY, Klf9, Max, and NRF. The expression product of the DAD1 gene is an enzyme that inhibits the apoptotic process of cells, and inactivation of DAD1 causes apoptosis (Genomics, 1995, 26 (2): 433-5). And there is an interaction between the DAD1 gene and the MCL1 gene, which also functions to inhibit apoptosis (J Biochem, 2000.128 (3): 399-.

In HepG2 (liver cancer cell line) (table 5), the ∑ F ═ 13.5 of the gene SIRPA, which is negative in terms of F number and larger in absolute value, indicates that the low frequency allelic form (MAF) of the polymorphic sites rs55698111 and rs67558779 results in a decrease in the affinity of the transcription factors NFY and Sox2 at the regulatory site of the gene SIRPA, resulting in a decrease in the expression level of the gene SIRPA. The gene SIRPA is a target gene of NFY and Sox2, and the expression product is a signal regulatory family protein and is an inhibitory receptor. The protein interacts with CD47 protein, which protects cells from phagocytosis by macrophages, and antibodies thereof can play a role in inhibiting cancer cell growth and metastasis (JCI Insight, 2017, 2 (1): e 89140).

In the M059J cell line (table 6), Σ F ═ 14 for the gene PTGR1, which is positive in terms of F number and larger in absolute value, indicates that the low frequency allelic form (MAF) of the polymorphic sites rsl0980954, rs3031178, rs71501685 and rs200997621 results in increased affinity of the transcription factors Mef2b and OCT4 at the regulatory site of the gene PTGR, resulting in increased expression level of the gene SIRPA.

In MCF-7 (breast cancer cell line) (table 7), the ∑ F of gene BAX is 10, and the absolute value is larger according to the positive F number, indicating that the low frequency allelic form (MAF) of polymorphic sites rs115440855 and rs138364829 results in the increased affinity of transcription factor c-Myc at the regulatory site of gene BAX, resulting in the increased expression level of gene BAX. BAX gene is a member of Bcl-2 gene family, and is regulated by c-Myc, and the product of BAX is closely related to apoptosis. In the normal state of cells, BAX protein exists in the cytosol, and upon generation of an apoptotic signal, BAX protein undergoes conformational change and becomes a protein associated with the membrane of organelles, particularly the mitochondrial membrane (EMBO J, 1998, 17 (14): 3878-85). Importantly, these highly affected genes are closely related to the function of cell death, and the change of the binding tendency of the transcription factor caused by SNP changes the genes to avoid apoptosis.

Similarly, the biological significance of other results can be explained based on the magnitude of F-values, transcription factors, target genes, and mutations as points.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Variations or modifications in other variations may occur to those skilled in the art based upon the foregoing description. Not all embodiments need be illustrated or described herein. And obvious variations or modifications of this embodiment may be made without departing from the spirit or scope of the invention.

Claims

1. A method for functionally predicting a single nucleotide genomic variation in a noncoding region, comprising the steps of:

1) chromatin open area recognition;

2) transcription factor binding site recognition: scanning the binding sites of the transcription factors in the chromatin opening regions by using the site-specific frequency matrix of the transcription factors; determining a transcription factor based on the expression level of the gene encoding the transcription factor; taking a gene within 8 kilobases downstream of a transcription factor binding site as a target gene of the transcription factor;

3) evaluation of the effect of single nucleotide variation: calculating the influence of the single nucleotide variation positioned in the transcription factor binding site region on the binding of the transcription factor based on the site specific frequency matrix of the transcription factor, and identifying the single nucleotide variation which obviously changes the binding capacity of the transcription factor; further evaluating the effect of single nucleotide variation by looking at the target gene biological pathway of the transcription factor;

the effect of evaluating a single nucleotide variation in the step 3) is evaluated by the following formula 1:

wherein P (i, j) and P (i, k) are respectively the values of a base j and a base k at the ith position of the site-specific frequency matrix, j and k belong to one of adenine, guanine, cytosine and thymine, F is a positive number and indicates that after mutation or a low-frequency genotype has the effect of increasing the affinity of a transcription factor, and F is a negative number and indicates that a wild type or a high-frequency genotype has the effect of increasing the affinity of the transcription factor; f is positive and the larger the value, the higher the affinity to the transcription factor after mutation or for the low-frequency genotype, on the contrary, F is negative and the smaller the value, the higher the affinity to the transcription factor for the wild type or for the high-frequency genotype; f is 0, indicating that the genotype has no effect on transcription factor binding.

2. The method of claim 1, wherein the step 1) of identifying the open chromatin region comprises: and (3) performing quality control, replying comparison and reading enrichment area identification on the reading of the sequencing data of the chromatin open area, and identifying the chromatin open area.

3. The method of claim 2, wherein the reads have a quality value Q of 30 or more and a sequencing error rate of 0.001 or less for the back-to-back alignment.

4. The method of claim 2, wherein the mathematical model of the read enrichment region is Poisson distribution, and the formula of the enrichment significance value is as follows:

5. The method for functionally predicting a single nucleotide genomic variation from a noncoding region according to claim 1, wherein the chromatin open region sequencing data in step 1) is one or more of dnase I sensitive site sequencing data, regulatory element formaldehyde assisted segregation sequencing data and transposase accessible chromatin experiment sequencing data.

6. The method of claim 1, wherein the gene expression level data of step 2) is microarray data or ribonucleic acid sequencing data.

7. The method of claim 1, wherein the transcription factor is considered to be abundant in the cell when the expression level of the gene FPKM is 8 or more.