CN113160877A - Prediction method of cell-specific genome G-quadruplex - Google Patents

Prediction method of cell-specific genome G-quadruplex Download PDF

Info

Publication number
CN113160877A
CN113160877A CN202110030502.9A CN202110030502A CN113160877A CN 113160877 A CN113160877 A CN 113160877A CN 202110030502 A CN202110030502 A CN 202110030502A CN 113160877 A CN113160877 A CN 113160877A
Authority
CN
China
Prior art keywords
dna
cell
specific
data
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110030502.9A
Other languages
Chinese (zh)
Other versions
CN113160877B (en
Inventor
孙啸
张卓凡
居胜红
杨婧
刘宏德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110030502.9A priority Critical patent/CN113160877B/en
Publication of CN113160877A publication Critical patent/CN113160877A/en
Application granted granted Critical
Publication of CN113160877B publication Critical patent/CN113160877B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a cell-specific G4-DNA prediction method, and belongs to the technical field of genes. The method comprises the following steps: (1) generating a set of all potential G4-DNA sequences of a given species; (2) collecting cell-specific G4-DNA data obtained from the species experimental tests; (3) calculating a chromatin open structure signal and a methylation distribution signal corresponding to the whole genome range of the cell; (4) establishing a cell-specific chromatin environment characteristic vector of G4-DNA; (5) establishing a positive and negative training sample set; (6) through sample training, a cell-specific G4-DNA prediction classifier is established, the input of the classifier is a feature vector of a potential sequence, and the output of the classifier is a positive and negative sample classification result. The existing G4-DNA prediction method can only identify G4-DNA formed in vitro or G4-DNA with in vivo activity, but the method can identify G4-DNA existing in specific cells.

Description

Prediction method of cell-specific genome G-quadruplex
Technical Field
The invention belongs to the technical field of genes, and particularly relates to a cell-specific G4-DNA prediction method, which realizes the recognition and identification of G4-DNA in specific cells or tissues and can be applied to the gene regulation research of tumors and the classification research of cell populations.
Background
The research on the relationship between the genome structure and major diseases is the leading edge of the development of international genetic technology, and the existing research progress shows that the G4-DNA structure is linked with the development and development of tumor relationship, but the direct detection of the G4-DNA structure in cells is very difficult, and a new method needs to be developed to predict whether the G4-DNA structure exists in specific cells or tissues and the accurate position of the G4-DNA structure in the genome, so as to research the relationship between the G4-DNA structure and diseases.
Genomic DNA is mostly of the double helix structure, but a special secondary structure, the G-quadruplex (G-quadruplex), also exists. The G-quadruplexes are composed of four consecutive pieces of guanine (G) on a single-stranded nucleic acid, and the G on different segments form a square plane, in which adjacent G form pairwise interactions through Hoogsteen base pairing. Three or more planes of G4 were stacked together to form a G-quadruplex structure. The electronegative carboxyl group of each G points to the center of the plane and can accommodate a monovalent metal ion (e.g., Na)+、K+) To stabilize the G-quadruplex structure. The G-quadruplex may be formed on DNA or RNA. The technology focuses on the G-quadruplex on the genomic DNA, which is called G4-DNA for short.
G4-DNA was first found at telomeric sites in chromosomes and maintained telomeric structure. Scientists subsequently found that G4-DNA plays a role in gene transcription, DNA replication, DNA recombination and the like, and especially in recent years, more and more research results suggest that G4-DNA is a key node of genome and may play an important role in the regulation of genome function. Researchers find that the sequence which can form G4-DNA is widely distributed in a promoter region about 1kb upstream of a gene transcription starting site, and more than 40 percent of gene promoters in a human genome at least contain a segment of the sequence of G4-DNA, which suggests that G4-DNA may have important gene transcription regulation function and may become a new drug target.
Since G4-DNA influences gene transcription, it is naturally associated with human diseases. In fact, G4-DNA is more localized to tumor-associated and regulatory genes. The promoter region of many protooncogenes contains G4 structure, such as VEGF, HIF-1. alpha., bcl-2, c-kit, etc., and these G4 structures have important regulatory effects on the expression of protooncogenes. The presence of G4-DNA was first found in the promoter of the proto-oncogene c-MYC, and a mutation in this G4-DNA was found to affect the in vivo transcriptional activity of c-MYC. The results of the latest experimental studies show that: also, there is a key G4-DNA in the promoter of the various tumor-associated genes BCL2, which inhibits gene transcription. The KRAS gene is also closely related to tumors, and the region-148 to-116 upstream of the transcription initiation site of the gene can form G4-DNA, and the quadruplex recruits two important transcription nuclear factors to a promoter region so as to control the transcription of the KRAS gene. Fanconi anemia is closely related to a G4-DNA structure in the genome replication process, and helicase FANCJ can unwind the G4-DNA structure, so that the replication process is influenced, and chromosome instability is caused.
At present, G4-DNA experimental detection methods are divided into two main categories, namely biophysical methods and biological methods. Biophysical methods mainly utilize some instrumental methods of physics to identify G4-DNA, and commonly used methods include circular dichroism spectroscopy, ultraviolet spectroscopy, electrospray ionization mass spectrometry, and nuclear magnetic resonance spectroscopy. Besides, the structure of G4-DNA can be identified by a biophysical method by using an X-ray atomic force microscope. The immunofluorescence microscopy imaging technology is a leading-edge technology for detecting G4-DNA in cells, and the visual detection and analysis of the G-DNA in the cells can be realized by using the technology. The biological method is mainly based on the biological characteristics of the G4-DNA structure for identification. The leading edge of the G4-DNA biological detection method is usually combined with DNA high-throughput sequencing technology, and the representative technologies are G4-seq technology and G4 ChIP-seq technology. G4-seq is a detection technique based on polymerase termination experiments, and the basic principle is as follows: if the sequenced DNA template chain forms a G4-DNA structure, the sequencing extension is blocked, the sequencing quality is greatly reduced, an error sequencing result is obtained, and whether the G4-DNA structure exists can be judged according to the sequencing result and the quality value of each base. This is the first international experimental detection technique for whole genome G4-DNA. However, this technique is essentially an in vitro assay and does not reflect the G4-DNA that is actually present in cells in vivo. G4 ChIP-seq is a technology for realizing in vivo detection, and the technology captures G4-DNA by using a specific ligand or an antibody and obtains G4-DNA data actually existing in a specific cell by means of a chromatin co-immunoprecipitation technology and a sequencing technology.
Although G4-DNA can be detected by experimental techniques, such techniques are currently complex to implement, difficult to use in practical studies (e.g., tumor-associated G4-DNA studies), and expensive to detect G4-DNA experimentally. Therefore, it is desirable to have a method capable of predicting the G4-DNA actually present. At present, some prediction methods exist internationally, but most of the existing prediction methods can only predict G4-DNA which possibly exists in vitro, and cannot predict G4-DNA in vivo. Recently, a method for predicting G4-DNA in vivo has been developed, but the method predicts that G4-DNA is not cell-specific. The cell-specific G4-DNA means G4-DNA actually present in a specific cell. Unfortunately, no cell-specific prediction method exists internationally, and the aim of the invention is to establish a cell-specific G4-DNA prediction method.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the current situation that a cell-specific G4-DNA prediction method does not exist at present and aiming at the application requirement that cell-specific G4-DNA data need to be obtained in the research of serious diseases, the invention provides a cell-specific G4-DNA prediction method, which generates a cell-specific candidate G4-DNA set by analyzing the sequence of the genome of a species or the known G4-DNA experimental data of the species, and then predicts whether the candidate G4-DNA can form actual G4-DNA in a corresponding cell by utilizing the known biological data of the corresponding cell, including chromatin open structure data and DNA methylation data.
The technical scheme is as follows: in order to achieve the technical purpose, the invention provides a cell-specific G4-DNA prediction method, which comprises the following steps:
(1) a set of all potential G4-DNA sequences was generated for a given species. The potential G4-DNA sequence includes two parts, regular G4-DNA and fanciful G4-DNA. First, the whole of the species is analyzedGenomic sequences (e.g., human genomic sequences), identifying the regular G4-DNA sequence. Based on the structural characteristics of G4-DNA, a G4-DNA sequence characteristic pattern, denoted as G, was constructedxN1- 7GxN1-7GxN1-7GxWherein G isxRepresents x consecutive guanines (G), x being greater than or equal to 3; n is a radical of1-7Represents a continuous sequence fragment consisting of 1 to 7 arbitrary bases. The genome is scanned using this pattern to find all the regular G4-DNA sequences that fit this pattern, and the set of these sequences is called GS1. Secondly, collecting the in vitro experimental data of G4-DNA obtained from the species experiment, and finding out all singular G4-DNA sequences which do not satisfy the above-mentioned pattern, and the collection of these sequences is called GS2. Make GS equal to GS1∪GS2GS is then the set of all potential G4-DNA sequences of the species, which is the set of candidate sequences for cell-specific G4-DNA. The set of candidate sequences is stored in the form of a genome annotation file BED, containing genomic coordinate information for each potential G4-DNA sequence. Each line in the BED file represents the position of a G4-DNA in the genome, in particular in the form of (chrom, chromStart, chromEnd), where the three variables represent the chromosome number, chromosome start position and end position, respectively, at which the G4-DNA sequence is located.
(2) In vivo cell-specific G4-DNA data obtained from experimental testing of this species were collected as a priori information for the construction of predictive models. In vivo cell-specific G4-DNA data were provided from G4 ChIP-seq sequencing experiments, and raw experimental data obtained from the detection of different cells using this technique were collected. To ensure data quality, the method filters sequences of too short length. The cell-specific G4-DNA set was also stored in the form of a BED file, and the entry was in the form of (chrom, chromStart, chromEnd).
(3) Cell-specific chromatin opening structure data and DNA methylation data were collected and calculated to form genome-wide chromatin opening structure signals and methylation distribution signals for the corresponding cells. These signals are considered as chromatin environment signals of G4-DNA. First, data obtained from the chromatin accessibility detection technique ATAC-seq for the corresponding cells is processed to generate genome-wide chromatin open structure signals. ATAC-seq sequencing data of specific cells (such as human chronic myelogenous leukemia cell K562) is obtained, and the data form is usually a BedGraph form, and the data form comprises coordinate information and an openness value of each open region, which are specifically expressed as (chrom, chromStart, chromEnd, value), namely, each region is a quadruple consisting of a chromosome, a region start coordinate, a region end coordinate and an openness value. The method expands the file, specifically, adds all the genomic region (i.e., low openness region) entries that do not appear in the original file to the file, and assigns the openness value of the above-mentioned filling entry to 0. Finally, the method yields whole genome chromatin patency information presented in a BedGraph format file. Similarly, the data obtained by the DNA methylation detection technique WGBS-seq for the corresponding cells are processed to form genome-wide methylation distribution signals. The WGBS-seq sequencing result data are also stored in a BedGraph form, coordinate information and methylation degree values of each hypermethylation region are contained, and data entries are in a quadruple form. The method expands the file, specifically, adds all the entries of the genomic region (i.e. the hypomethylation region) which do not appear in the original file into the file, and assigns the methylation degree value of the filled entry to 0. Finally, the method yields genome-wide chromatin methylation information presented in a BedGraph format file.
(4) And establishing a G4-DNA sequence cell-specific chromatin environment characteristic vector. The method is characterized in that the cell-specific chromatin environment information is an important basis for judging whether the cell-specific G4-DNA is formed or not, and the chromatin environment information of the region where a G4-DNA sequence is located is arranged into a corresponding chromatin environment characteristic vector. Specifically, the method determines a chromatin background investigation region of the sequence, and thereby calculates a chromatin environment feature vector of the G4-DNA. The method selects the middle point of each G4-DNA item coordinate as the center, expands respectively upstream and downstream, and finally forms a fixed-length area (the default length is 6000bp) as the background investigation area of the chromosome environment corresponding to each G4-DNA item. Considering that the length of each G4-DNA sequence chromatin investigation region is larger, the length of each G4-DNA sequence chromatin investigation region is adoptedAnd compressing the data characteristics by using a method of calculating the area mean value by using a sliding window method. Specifically, the method adopts a fixed-length sliding window (the default length is 300bp) to scan the region in a certain step length (the default length is 300bp), and the average value of the chromosome opening degree value/methylation degree value in the window is calculated in each step and is used as the chromosome environment background value of the region contained in the sliding window. If calculated by default, a 20-dimensional sequence of chromosome openness values and a 20-dimensional sequence of methylation values will be obtained. For each G4-DNA sequence, a set of numerical feature entries may be obtained, each represented by a floating-point feature vector of dimension (1, 40): (o)1,o2,…o20,m1,m2,…m20) Wherein o isiAnd miMean values of chromatin patency and methylation in the i-th scan of the window are shown.
(5) A cell-specific G4-DNA training sample set was established. A potential G4-DNA if it forms true G4-DNA in a particular cell, then the G4-DNA is a positive sample of that cell; conversely, if a potential G4-DNA does not form G4-DNA in a particular cell, it is a negative sample.
(6) And establishing a G4-DNA prediction classifier with cell specificity. The method adopts a machine learning classification model (Xgboost is used as a default) as a classifier model, takes the cell specificity sample set constructed in the steps (4) and (5) as a training, verifying and testing sample, and forms the classifier of the cell specificity G4-DNA sequence universally suitable for various cell data through the training of a data driving model. The classifier takes a potential G4-DNA chromatin environment feature vector as an input to judge whether it will form G4-DNA in a specific cell environment. The method adopts Accuracy (Accuracy), Precision (Precision) and Recall (Recall) as model evaluation standards. And (3) recording the number of TP, TN, FP and FN as a true positive sample, a true negative sample, a false positive sample and a false negative sample respectively, and obtaining three indexes which are expressed as follows:
Figure BDA0002891726110000051
Figure BDA0002891726110000052
Figure BDA0002891726110000053
wherein Accuracy, Precision and Recall respectively refer to Accuracy, Precision and Recall.
The method comprises the following steps of firstly carrying out five-fold cross validation on a training set: the training set is randomly divided into five equal parts, four parts of the training set are taken as the training set in each training, the rest part is taken as the testing set, five times of verification are carried out, and evaluation indexes are calculated and evaluated. And after cross validation, training the Xgboost model by using the complete training set, testing on the complete testing set, evaluating evaluation indexes, and finally obtaining the classifier model.
(7) Cell-specific G4-DNA prediction. For a cell needing prediction, the feature vector (namely the chromosome openness numerical sequence and the methylation numerical sequence) of each potential G4-DNA entry in the set GS in the corresponding cell is used as the input of a prediction classifier, and the classifier outputs the classification (whether the classification is cell-specific G4-DNA) of each entry so as to indicate whether the potential sequence can really form G4-DNA in the corresponding cell. Finally, all the G4-DNA predicted to actually exist for the corresponding cell is output in the form of a BED file containing all the entry chromosome coordinates classified as cell-specific G4-DNA.
The invention further provides application of the cell-specific G4-DNA prediction method in major diseases, particularly in the research of tumor genomes.
Specifically, the application includes the following processes:
(1) chromosome openness data and methylation degree data of the target research cell are collected, and in-vitro G4-DNA data or pattern search data of a species to which the target research cell belongs are collected as candidate G4-DNA data sets. The method is used for cleaning and extracting the characteristics of the data, and the generated characteristics are used as model input, so that possible cell-specific G4-DNA data can be screened from a candidate G4-DNA data set.
(2) And (3) carrying out combined analysis on the cell specificity G4-DNA data obtained by screening and the research target disease. Taking the study of tumor genomes as an example, such studies are generally focused on the problems of tumor heterogeneity, gene mutations, and copy number variation. By correlating the distribution of the cell-specific G4-DNA obtained by the above method with the distribution of mutant gene sites or copy number variation sites, it is possible to explain disease mechanisms based on the structure and function of G4-DNA by focusing on the cell-specific G4-DNA that coincides with the position of the variation site or affects related functions.
The invention also provides application of the cell specificity G4-DNA prediction method in single cell classification research.
Specifically, the application includes the following processes:
(1) chromosome openness data and methylation degree data of the target research cell are collected, and in-vitro G4-DNA data or pattern search data of a species to which the target research cell belongs are collected as candidate G4-DNA data sets. The data are cleaned and subjected to feature extraction by the method, and the generated features of each cell data are used as model input, so that the single cell specific G4-DNA data can be screened from the candidate G4-DNA data set.
(2) The single cell specific G4-DNA data were used to classify the cells. At present, the single cell is classified mainly by clustering and classifying by using expression data, and the invention can provide another idea: cell typing was performed using single cell specific G4-DNA distribution differences. Specifically, the distribution characteristic vector of any cell-specific G4-DNA in candidate G4-DNA is obtained according to the data of the single-cell-specific G4-DNA, and then the cluster typing is carried out on all the cell distribution characteristic vectors.
The idea of the invention is summarized as follows: for a given species of cells, first generating all possible G4-DNA sets of the species as cell-specific candidate G4-DNA sets by analyzing the genomic sequence of the species while collecting known G4-DNA experimental data of the species; then, using biological data known to the corresponding cells to be associated with G4-DNA, including chromatin opening structure data and DNA methylation data, it is predicted whether each candidate G4-DNA will form actual G4-DNA in the corresponding cells. The core is to utilize G4-DNA related biological data of known cells, so that the prediction of G4-DNA has cell specificity.
Has the advantages that: compared with the prior art, the invention has the following advantages:
(1) the G4-DNA prediction method provided by the invention has cell specificity, namely can predict G4-DNA actually existing in a specific cell. The current international G4-DNA prediction method can only predict G4-DNA which may be present in the genome of a species, and cannot predict which G4-DNA is present in a cell. There is no international method for the prediction of cell-specific G4-DNA.
(2) In the research of tumor-related genes, the method can be used for predicting G4-DNA existing in tumor tissues, further fusing tumor-related gene expression data for comprehensive analysis, finding the regulation and control effect of the G4-DNA on the tumor-related genes, revealing the molecular mechanism of tumors, and simultaneously taking the G4-DNA influencing the expression of the tumor genes as a tumor treatment target so as to provide a new target for designing anti-tumor new drugs.
(3) In single cell research, the method of the invention can predict the G4-DNA possibly existing in each cell in a cell population, further the distribution of the G4-DNA on a genome, classify the cell population and find the connection and difference of different cells in the aspect of G4-DNA.
Drawings
FIG. 1 shows the cell-specific G4-DNA prediction principle;
FIG. 2 is a flow chart of predictive model construction;
FIG. 3 is a schematic diagram of input feature vectors for building a model;
figure 4 is a framework of tumor study application.
Detailed Description
The invention provides a cell-specific G4-DNA prediction method, which comprises the following steps:
(1) a set of all potential G4-DNA sequences was generated for a given species. The potential G4-DNA sequence includes two parts, regular G4-DNA and fanciful G4-DNA. First, the whole genome sequence of the species (e.g., human genome sequence) is analyzed to identify the regular G4-DNA sequence. Based on the structural characteristics of G4-DNA, a G4-DNA sequence characteristic pattern, denoted as G, was constructedxN1- 7GxN1-7GxN1-7GxWherein G isxRepresents x consecutive guanines (G), x being greater than or equal to 3; n is a radical of1-7Represents a continuous sequence fragment consisting of 1 to 7 arbitrary bases. The genome is scanned using this pattern to find all the regular G4-DNA sequences that fit this pattern, and the set of these sequences is called GS1. Secondly, collecting the in vitro experimental data of G4-DNA obtained from the species experiment, and finding out all singular G4-DNA sequences which do not satisfy the above-mentioned pattern, and the collection of these sequences is called GS2. Make GS equal to GS1∪GS2GS is then the set of all potential G4-DNA sequences of the species, which is the set of candidate sequences for cell-specific G4-DNA. The set of candidate sequences is stored in the form of a genome annotation file BED, containing genomic coordinate information for each potential G4-DNA sequence. Each line in the BED file represents the position of a G4-DNA in the genome, in particular in the form of (chrom, chromStart, chromEnd), where the three variables represent the chromosome number, chromosome start position and end position, respectively, at which the G4-DNA sequence is located.
(2) In vivo cell-specific G4-DNA data obtained from experimental testing of this species were collected as a priori information for the construction of predictive models. In vivo cell-specific G4-DNA data were provided from G4 ChIP-seq sequencing experiments, and raw experimental data obtained from the detection of different cells using this technique were collected. To ensure data quality, the method filters sequences of too short length. The cell-specific G4-DNA set was also stored in the form of a BED file, and the entry was in the form of (chrom, chromStart, chromEnd).
(3) Cell-specific chromatin opening structure data and DNA methylation data were collected and calculated to form genome-wide chromatin opening structure signals and methylation distribution signals for the corresponding cells. These signals are considered as chromatin environment signals of G4-DNA. First, data obtained from the chromatin accessibility detection technique ATAC-seq for the corresponding cells is processed to generate genome-wide chromatin open structure signals. ATAC-seq sequencing data of specific cells (such as human chronic myelogenous leukemia cell K562) is obtained, and the data form is usually a BedGraph form, and the data form comprises coordinate information and an openness value of each open region, which are specifically expressed as (chrom, chromStart, chromEnd, value), namely, each region is a quadruple consisting of a chromosome, a region start coordinate, a region end coordinate and an openness value. The method expands the file, specifically, adds all the genomic region (i.e., low openness region) entries that do not appear in the original file to the file, and assigns the openness value of the above-mentioned filling entry to 0. Finally, the method yields whole genome chromatin patency information presented in a BedGraph format file. Similarly, the data obtained by the DNA methylation detection technique WGBS-seq for the corresponding cells are processed to form genome-wide methylation distribution signals. The WGBS-seq sequencing result data are also stored in a BedGraph form, coordinate information and methylation degree values of each hypermethylation region are contained, and data entries are in a quadruple form. The method expands the file, specifically, adds all the entries of the genomic region (i.e. the hypomethylation region) which do not appear in the original file into the file, and assigns the methylation degree value of the filled entry to 0. Finally, the method yields genome-wide chromatin methylation information presented in a BedGraph format file.
(4) And establishing a G4-DNA sequence cell-specific chromatin environment characteristic vector. The method is characterized in that the cell-specific chromatin environment information is an important basis for judging whether the cell-specific G4-DNA is formed or not, and the chromatin environment information of the region where a G4-DNA sequence is located is arranged into a corresponding chromatin environment characteristic vector. Specifically, the method determines a chromatin background investigation region of the sequence, and thereby calculates a chromatin environment feature vector of the G4-DNA. The method selects the middle point of each G4-DNA item coordinate as the center, expands respectively upstream and downstream, and finally forms a fixed-length areaThe domain (default length is 6000bp) was considered as the background region of the chromosome environment for each G4-DNA entry. Considering that the length of each G4-DNA sequence chromatin investigation region is larger, a method of calculating the mean value of the region by a sliding window method is adopted to compress data characteristics. Specifically, the method adopts a fixed-length sliding window (the default length is 300bp) to scan the region in a certain step length (the default length is 300bp), and the average value of the chromosome opening degree value/methylation degree value in the window is calculated in each step and is used as the chromosome environment background value of the region contained in the sliding window. If calculated by default, a 20-dimensional sequence of chromosome openness values and a 20-dimensional sequence of methylation values will be obtained. For each G4-DNA sequence, a set of numerical feature entries may be obtained, each represented by a floating-point feature vector of dimension (1, 40): (o)1,o2,…o20,m1,m2,…m20) Wherein o isiAnd miMean values of chromatin patency and methylation in the i-th scan of the window are shown.
(5) A cell-specific G4-DNA training sample set was established. A potential G4-DNA if it forms true G4-DNA in a particular cell, then the G4-DNA is a positive sample of that cell; conversely, if a potential G4-DNA does not form G4-DNA in a particular cell, it is a negative sample.
(6) And establishing a G4-DNA prediction classifier with cell specificity. The method adopts a machine learning classification model (Xgboost is used as a default) as a classifier model, takes the cell specificity sample set constructed in the steps (4) and (5) as a training, verifying and testing sample, and forms the classifier of the cell specificity G4-DNA sequence universally suitable for various cell data through the training of a data driving model. The classifier takes a potential G4-DNA chromatin environment feature vector as an input to judge whether it will form G4-DNA in a specific cell environment. The method adopts Accuracy (Accuracy), Precision (Precision) and Recall (Recall) as model evaluation standards. And (3) recording the number of TP, TN, FP and FN as a true positive sample, a true negative sample, a false positive sample and a false negative sample respectively, and obtaining three indexes which are expressed as follows:
Figure BDA0002891726110000091
Figure BDA0002891726110000092
Figure BDA0002891726110000093
wherein Accuracy, Precision and Recall respectively refer to Accuracy, Precision and Recall.
The method comprises the following steps of firstly carrying out five-fold cross validation on a training set: the training set is randomly divided into five equal parts, four parts of the training set are taken as the training set in each training, the rest part is taken as the testing set, five times of verification are carried out, and evaluation indexes are calculated and evaluated. And after cross validation, training the Xgboost model by using the complete training set, testing on the complete testing set, evaluating evaluation indexes, and finally obtaining the classifier model.
(7) Cell-specific G4-DNA prediction. For a cell needing prediction, the feature vector (namely the chromosome openness numerical sequence and the methylation numerical sequence) of each potential G4-DNA entry in the set GS in the corresponding cell is used as the input of a prediction classifier, and the classifier outputs the classification (whether the classification is cell-specific G4-DNA) of each entry so as to indicate whether the potential sequence can really form G4-DNA in the corresponding cell. Finally, all the G4-DNA predicted to actually exist for the corresponding cell is output in the form of a BED file containing all the entry chromosome coordinates classified as cell-specific G4-DNA.
Example 1:
establishing a cell specificity G4-DNA prediction model of a human chronic myelogenous leukemia cell line K562, and carrying out actual inspection. The specific implementation steps are as follows:
process one, prepare a candidate G4-DNA sequence set for the K562 cell line. Since K562 belongs to a human cell line sample, the data obtained by in vitro sequencing of human genomic G-DNA was selected. And acquiring complete data in a GEO database, wherein 750,536 candidate sequences are contained in the complete data.
And secondly, collecting chromatin openness degree ATAC-seq data, methylation degree WGBS-seq data and cell specificity G4-DNA sequencing G4 ChIP-seq data of the K562 cell line in a GEO and ENCODE database.
And thirdly, preprocessing the collected data.
(1) According to the design of the above embodiment, the candidate G4-DNA sequence set and the entries with the length less than 15bp in the G4 ChIP-seq data set are deleted firstly, and finally 750,481 entries and 8,883 entries in the G4-DNA candidate data set and 8,883 entries in the G4 ChIP-seq data set are obtained. The G4 ChIP-seq entry reflects the G4-DNA actually present in the K562 cell line.
(2) Preprocessing the chromosome reachable data and the methylation degree data to remove the items of unknown chromosome positions; for the numerical values of the low-openness region not contained in the ATAC-seq sequencing data and the numerical values of the low-methylation region lacking the quantification value in the WGBS-seq sequencing data, the method fills all the occurring null values with floating-point zero values. After this step, information on the chromosome openness and methylation degree of the whole genome of the K562 cell line can be obtained.
And fourthly, generating the chromatin environment feature vector. The candidate G4-DNA sequence region was expanded to 6,000bp and scanned with a sliding window of 300bp in width and 300bp step size according to the method described in the detailed implementation procedure above. For each candidate sequence, a 20-dimensional chromosome openness numerical sequence and a 20-dimensional methylation numerical sequence are obtained, and the two numerical sequences are combined to obtain a 40-dimensional chromatin environment feature vector.
And fifthly, carrying out positive and negative sample division on the candidate G4-DNA sequence data of the K562. According to the method described in the concrete implementation steps, candidate G4-DNA entries with the region overlapping length of the G4 ChIP-seq entries being more than or equal to 10% of the length of the G4 ChIP-seq are set as cell-specific G4-DNA entries (positive samples), and the rest entries are negative samples, so that 7,767 positive samples and 742,714 negative samples are finally obtained.
And sixthly, training and evaluating the model by using the processed K562 chromatin environment feature vector. The K562 candidate G4-DNA dataset is divided into two parts, which are respectively denoted as dataset a and dataset B, where dataset a includes 3,883 positive samples and 371,357 negative samples, and dataset B includes 3,883 positive samples and 371,357 negative samples. Oversampling the positive samples in data set a using SMOTE method: selecting a positive sample and finding another positive sample which is the nearest to the positive sample, and randomly taking a point on a straight line connected by the characteristics of the two samples, thereby completing the construction of a new positive sample; the iterative process continues until a specified number of positive samples are obtained. Thus, a data set a' consisting of 371,357 positive samples and 371,357 negative samples was obtained as a model training set. The data set B is oversampled to obtain B ', and the B' consists of 7,767 positive samples and 742,714 negative samples and serves as a test set. Using Xgboost as a classifier, performing five-fold cross validation on a training set to ensure that a model is available, training by using all training set data, calculating evaluation indexes on a test set, and finally obtaining the expression on the test set as follows: the accuracy rate is 0.991, the precision rate is 0.668 and the recall rate is 0.965.
And seventhly, actually applying the model. There is experimental evidence that positions 128748245 to 128748495 on chromosome eight of the human genome comprise the G4-DNA pattern and are located on the MYC gene. Sample 1 from the candidate G4-DNA sequence set: (chr8,128748245,128748495), according to the positive and negative sample division result, the candidate sequence has 10% or more overlap with the K562G 4 ChIP-seq peak interval, and is considered to be specifically generated in K562 cells. And taking the environmental feature vector of the sequence as the input of a prediction model, and obtaining an output result which is a positive sample, wherein the model predicts that the G4-DNA sequence exists in the K562 cells, and the prediction result is consistent with the experimental result. On this basis, the exact position of G4-DNA in the candidate G4-seq signal region can be located by scanning using the G4-DNA pattern algorithm. In a regular form: g3-5 N1-12 G3-5 N1-12 G3-5 N1-12G3-5Scanning candidate G4-DNA sequences for example, to obtain the probable G4-DNA sequence isGGGGGGCTGCAAACATGGGCAGTCTAAGGGGAAGGGATGGGThe G4-DNA is located on the minus strand of DNA (the consecutive bases G constituting G4-DNA are indicated by the underlined symbol) with the coordinates (chr8, 128751430,128751471).
Similarly, there is experimental evidence that positions 671464 to 671908 on chromosome IV of the human genome comprise the G4-DNA pattern and are located on the MLY5 gene[1]. Sample 2, whose coordinate is (chr4,671464,671908), was taken from the candidate G4-DNA sequence set, and the candidate sequence region overlapped with K562G 4 ChIP-seq by less than 10% according to the positive and negative sample partition results, and thus it was considered that it did not form a G4-DNA structure in K562 cells. And taking the environmental feature vector of the sequence as the input of the model, and obtaining an output result which is a negative sample, so that the model predicts that the G4-DNA sequence does not exist in the K562 cells, and the prediction result is consistent with the experimental result.
Example 2:
using the established model, tumor-associated G4-DNA analysis was performed. By way of background, G4-DNA may affect biological functions including gene expression. In the example, data of a human breast cancer cell line MCF7 are selected, and the influence of cell-specific G4-DNA on the gene expression of a chromosome is analyzed, and the specific implementation steps are as follows:
firstly, a candidate G4-DNA sequence set of an MCF7 cell line is prepared, and because MCF7 belongs to a human sample, data obtained by in vitro sequencing of human genome G-DNA is selected, and a chromosome I subset is selected.
And secondly, collecting the ATAC-seq data and the WGBS-seq data of the degree of methylation of the chromatin of the MCF7 cell line in the GEO and ENCODE databases.
And thirdly, preprocessing the collected data.
(1) Deleting entries with the length less than 15bp in the candidate G4-DNA sequence set to obtain 66,441 entries in the candidate data set.
(2) Preprocessing the chromosome reachable data and the methylation degree data to remove the items of unknown chromosome positions; for the numerical values of the low-openness region not contained in the ATAC-seq sequencing data and the numerical values of the low-methylation region lacking the quantification value in the WGBS-seq sequencing data, the method fills all the occurring null values with floating-point zero values. After this step, information on the degree of opening and degree of methylation of chromosome I of the MCF7 cell line can be obtained.
And fourthly, performing chromatin environment characteristic generation. The candidate G4-DNA sequence region was expanded to 6,000bp and scanned with a sliding window of 300bp in width and 300bp step size according to the method described in the detailed implementation procedure above. For each candidate sequence, a 20-dimensional chromosome openness numerical sequence and a 20-dimensional methylation numerical sequence are obtained, and the two numerical sequences are combined to obtain a 40-dimensional chromatin environment feature vector.
And fifthly, reading the sample chromatin environment characteristic as input by using the model obtained by training in the example 1, and outputting MCF7 cell-specific G4-DNA. Model prediction results showed that MCF7 has a total of 1,258 cell-specific G4-DNA on chromosome I.
Sixthly, the effect of MCF7 cell-specific G4-DNA on its gene expression was analyzed. Taking the example of the CCND1 gene on MCF7 chromosome 11, it is located at the coordinate (chr11,69455873,69469242) and there is a G4-seq signal region: (chr11,69460115,69460421). With the regular expression: g3-5 N1-12 G3-5 N1-12 G3-5 N1-12G3-5This region was searched for a G4-DNA sequence:GGGTGGGTCCCGAGGGAGGGGCAGGAGACCAGGGGthe structure is located at (chr11,69460312,69460347) coordinates and is contained in the sense strand of the coding region of the CCND1 gene. However, the G4-DNA structure located on the sense strand is favorable for maintaining the unwinding state of the currently transcribed DNA, thereby promoting the transcription process. In addition, another experiment shows that the CCND1 gene in MCF7 cells is up-regulated compared with that of ordinary cells[2]The cell-specific G4-DNA found by the method is considered to have a regulating effect on the up-regulated expression of the CCND1 gene.
Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the embodiments, and various equivalent modifications can be made within the technical spirit of the present invention, and the scope of the present invention is also within the scope of the present invention.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. The invention is not described in detail in order to avoid unnecessary repetition.
In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.
[1]Chambers V S,Marsico G,Boutell J M,et al.High-throughput sequencing of DNA G-quadruplex structures in the human genome[J].Nature biotechnology,2015, 33(8):877-881.
[2]Yamaga R,Ikeda K,Horie-Inoue K,et al.RNA sequencing of MCF-7 breast cancer cells identifies novel estrogen-responsive genes with functional estrogen receptor-binding sites in the vicinity of their transcription start sites[J].Hormones and Cancer,2013,4(4):222-232。

Claims (7)

1. A method for predicting cell-specific G4-DNA, comprising the steps of:
(1) generating all potential G4-DNA sequence sets for a given species: the potential G4-DNA sequence includes: the regular G4-DNA sequence and the fanciful G4-DNA sequence;
(2) collecting in vivo cell-specific G4-DNA data obtained from experimental testing of this species: in vivo cell-specific G4-DNA data is provided by a G4 ChIP-seq sequencing experiment, original experimental data obtained by detecting different cells by using the technology is collected to obtain a cell-specific G4-DNA set, the method filters out sequences with the length less than 15bp, the cell-specific G4-DNA set is stored in a BED file form, and the entry forms are 'chrom, chromStart and chromEnd';
(3) cell-specific chromatin opening structure data and DNA methylation data were collected:
the analysis method of the cell-specific chromatin opening structure data is as follows: processing sequencing data of corresponding cells obtained by a chromatin accessibility detection technology ATAC-seq, wherein the data form is a BedGraph form, the data form comprises coordinate information and an openness degree value of each open region, and the coordinate information and the openness degree value are specifically expressed as 'chrom, chromStart, chromEnd, value', namely, each region is a quadruple consisting of a chromosome, a region starting coordinate, a region ending coordinate and an openness degree value; adding all genome region entries which do not appear in the original file into the file, and assigning the openness value of the added entries to be 0 to obtain the whole genome chromatin openness information presented by the BedGraph form file;
the analysis of the cell-specific DNA methylation data was as follows: processing sequencing data of corresponding cells, which are obtained by a DNA methylation detection technology WGBS-seq, wherein the data form is stored in a BedGraph form and comprises coordinate information and a methylation degree value of each hypermethylated region, and the coordinate information and the methylation degree value are specifically expressed as chrom, chromStart, chromEnd and value, namely each region is a quadruple consisting of a chromosome, a region starting coordinate, a region ending coordinate and an opening degree value; adding all the genomic regions which do not appear in the original file into the file, and assigning the methylation degree value of the added entries to be 0 to obtain the methylation degree information of the whole genome chromatin presented by the BedGraph form file;
(4) establishing a G4-DNA sequence cell-specific chromatin environment characteristic vector: selecting the middle point of each G4-DNA item coordinate as a center, expanding the middle point upstream and the middle point downstream respectively to finally form a fixed-length area as a chromosome environment background investigation area corresponding to each G4-DNA item, and compressing data characteristics by adopting a method of calculating an area mean value by a sliding window method;
the calculation method of the sliding window method is as follows:
scanning the region by a certain step length by adopting a certain long sliding window, and calculating the average value of the chromosome opening degree value/methylation degree value in the window in each step to be used as the chromosome environment background value of the region contained in the sliding window;
if the calculation is carried out according to the default value, a 20-dimensional chromosome openness numerical sequence and a 20-dimensional methylation numerical sequence are finally obtained;
for each G4-DNA sequence, a set of numerical feature entries may be obtained, each represented by a floating-point feature vector of dimension (1, 40): (o)1,o2,…o20, m1,m2,…m20) Wherein o isiAnd miRespectively representing the average value of the chromatin opening degree area and the average value of the methylation degree area in the scanning area of the step i of the sliding window;
(5) establishing a cell-specific G4-DNA training sample set: potential G4-DNA if true G4-DNA is formed in a particular cell, then the G4-DNA is a positive sample of that cell; conversely, if a potential G4-DNA does not form G4-DNA in a particular cell, it is a negative sample;
(6) establishing a cell-specific G4-DNA prediction classifier model: the classifier model takes potential G4-DNA chromatin environment feature vectors as input and judges whether G4-DNA is formed in a specific cell environment; and (3) recording the number of TP, TN, FP and FN as a true positive sample, a true negative sample, a false positive sample and a false negative sample respectively, and obtaining three indexes which are expressed as follows:
Figure RE-144406DEST_PATH_IMAGE002
Figure RE-874595DEST_PATH_IMAGE004
Figure RE-716649DEST_PATH_IMAGE006
wherein Accuracy, Precision and Recall respectively indicate Accuracy, Precision and Recall;
performing five-fold cross validation on the cell-specific G4-DNA training sample set obtained in the step (5): dividing a cell-specific G4-DNA training sample set into five equal parts at random, taking four parts as a training set and taking the rest part as a test set for five times of verification, and calculating and evaluating evaluation indexes; after cross validation, training the Xgboost model by using a complete training set, testing on the complete testing set, evaluating evaluation indexes, and finally obtaining a G4-DNA prediction classifier model with cell specificity;
(7) cell-specific G4-DNA prediction: for a cell to be predicted, to aggregateGSThe chromatin environment characteristic vector of each potential G4-DNA entry in the corresponding cell, namely the chromosome opening degree numerical sequence and the methylation degree numerical sequence, is used as the input of a prediction classifier, and the classifier outputs whether each entry is cell-specific G4-DNA so as to indicate whether the potential sequence can really form G4-DNA in the corresponding cell; finally, all the G4-DNA predicted to actually exist for the corresponding cell is output in the form of a BED file containing all the entry chromosome coordinates classified as cell-specific G4-DNA.
2. The method for predicting cell-specific G4-DNA according to claim 1, wherein the regular G4-DNA sequence is analyzed as follows in step (1): construction of G4-DNA sequence characterization pattern, expressed asG x N 1-7 G x N 1-7 G x N 1-7 G x Wherein, in the step (A),G x to representxA plurality of consecutive guanines (G),xis greater than or equal to 3, and the content of the active carbon,N 1-7 represents a continuous sequence segment consisting of 1-7 arbitrary bases, and the genome in the five segments is scanned by using the pattern to find out all the regular G4-DNA sequences conforming to the pattern, and the set of the sequences is called asGS 1
The analysis of the unusual G4-DNA sequence was as follows: collecting G4-DNA in vitro obtained from the species experimentData to find all singular G4-DNA sequences that do not satisfy the above G4-DNA sequence characteristic pattern, and the collection of these sequences is calledGS 2
3. The method for predicting cell-specific G4-DNA according to claim 2, wherein the method comprises the step ofGS = GS 1 GS 2 (ii) a ThenGSIs the set of all potential G4-DNA sequences of the species; the set of candidate sequences is stored in the form of a genome annotation file (BED) containing genome coordinate information for each potential G4-DNA sequence, and each line in the BED file represents the position of a G4-DNA in the genome, in the form of "chrom, chromStart, chromEnd", wherein chrom, chromStart, chromEnd represent the chromosome number, chromosome start position and end position, respectively, at which the G4-DNA sequence is located.
4. The method for predicting cell-specific G4-DNA according to claim 1, wherein the default length of the fixed length region in step (4) is 6000 bp.
5. The method for predicting cell-specific G4-DNA according to claim 1, wherein in step (4), the length of the fixed-length sliding window is 300bp by default; the default length of the certain step length is 300 bp.
6. Use of the cell-specific G4-DNA prediction method according to claim 1 in the study of tumor genomes, comprising the steps of:
(S1) predicting the presence of G4-DNA: collecting ATAC-seq and WGBS-seq data of the tumor tissue, and predicting G4-DNA in the corresponding tumor tissue according to signal characteristics of the two types of data;
(S2) analysis of the possible Gene Regulation of genomic G4-DNA in tumors: identifying differentially expressed genes in tumor tissues, analyzing G4-DNA on the gene sequences, and determining the position of G4-DNA on the genes; the position of the gene includes a promoter upstream of the gene, exons and introns of the gene, and a termination signal region downstream of the gene. Furthermore, according to the position of the G4-DNA gene, the gene regulation function possibly exerted by each G4-DNA is deduced by combining the expression change of the gene in tumor tissues.
7. Use of the cell-specific G4-DNA prediction method of claim 1 in single cell sorting studies, comprising the steps of:
(W1) predicting the presence of G4-DNA in each cell of a cell population using the cell-specific G4-DNA prediction method of claim 1; collecting chromatin open structure data and DNA methylation data of each cell in a cell population by using a single cell biological detection technology, and predicting G4-DNA in each cell according to signal characteristics of the two types of data;
(W2) classifying cell populations according to the distribution of the G4-DNA on the genome, and finding out the connection and difference of different cells in the G4-DNA; for each cell in the cell population, mapping G4-DNA in the cell to a specific gene and determining its position on the gene, including a promoter region upstream of the gene, an exon region of the gene, an intron region of the gene, a transcription termination region downstream of the gene, to form a G4-DNA distribution feature vector for the cell; then, the cell population is subjected to clustering analysis according to the feature vector, cells with similar G4-DNA distribution are clustered together in a clustering space, and the cells in different clusters are obviously different in G4-DNA distribution.
CN202110030502.9A 2021-01-11 2021-01-11 Prediction method of cell-specific genome G-quadruplex Active CN113160877B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110030502.9A CN113160877B (en) 2021-01-11 2021-01-11 Prediction method of cell-specific genome G-quadruplex

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110030502.9A CN113160877B (en) 2021-01-11 2021-01-11 Prediction method of cell-specific genome G-quadruplex

Publications (2)

Publication Number Publication Date
CN113160877A true CN113160877A (en) 2021-07-23
CN113160877B CN113160877B (en) 2022-11-25

Family

ID=76878306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110030502.9A Active CN113160877B (en) 2021-01-11 2021-01-11 Prediction method of cell-specific genome G-quadruplex

Country Status (1)

Country Link
CN (1) CN113160877B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113930430A (en) * 2021-12-14 2022-01-14 中国农业大学 Regulation of functional nucleic acid sites corresponding to genes that activate thermogenesis
CN114464261A (en) * 2022-04-12 2022-05-10 天津诺禾致源生物信息科技有限公司 Method and apparatus for assembling elongated sex chromosomes
CN114842914A (en) * 2022-04-24 2022-08-02 山东大学 Chromatin loop prediction method and system based on deep learning
CN116110493A (en) * 2023-03-20 2023-05-12 电子科技大学长三角研究院(衢州) Data set construction method for G-quadruplex prediction model and prediction method thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103667455A (en) * 2013-11-19 2014-03-26 眭维国 Analyzing method for expression difference of histone H3K9me3 of organ gene and gene model
CN110544509A (en) * 2019-08-20 2019-12-06 广州基迪奥生物科技有限公司 single-cell ATAC-seq data analysis method
CN111312329A (en) * 2020-02-25 2020-06-19 成都信息工程大学 Transcription factor binding site prediction method based on deep convolution automatic encoder

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103667455A (en) * 2013-11-19 2014-03-26 眭维国 Analyzing method for expression difference of histone H3K9me3 of organ gene and gene model
CN110544509A (en) * 2019-08-20 2019-12-06 广州基迪奥生物科技有限公司 single-cell ATAC-seq data analysis method
CN111312329A (en) * 2020-02-25 2020-06-19 成都信息工程大学 Transcription factor binding site prediction method based on deep convolution automatic encoder

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113930430A (en) * 2021-12-14 2022-01-14 中国农业大学 Regulation of functional nucleic acid sites corresponding to genes that activate thermogenesis
CN113930430B (en) * 2021-12-14 2022-03-18 中国农业大学 Regulation of functional nucleic acid sites corresponding to genes that activate thermogenesis
CN114464261A (en) * 2022-04-12 2022-05-10 天津诺禾致源生物信息科技有限公司 Method and apparatus for assembling elongated sex chromosomes
CN114464261B (en) * 2022-04-12 2022-07-01 天津诺禾致源生物信息科技有限公司 Method and apparatus for assembling extended sex chromosomes
CN114842914A (en) * 2022-04-24 2022-08-02 山东大学 Chromatin loop prediction method and system based on deep learning
CN114842914B (en) * 2022-04-24 2024-04-05 山东大学 Deep learning-based chromatin ring prediction method and system
CN116110493A (en) * 2023-03-20 2023-05-12 电子科技大学长三角研究院(衢州) Data set construction method for G-quadruplex prediction model and prediction method thereof
CN116110493B (en) * 2023-03-20 2023-06-20 电子科技大学长三角研究院(衢州) Data set construction method for G-quadruplex prediction model and prediction method thereof

Also Published As

Publication number Publication date
CN113160877B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
CN113160877B (en) Prediction method of cell-specific genome G-quadruplex
Martin et al. Nanopore adaptive sampling: a tool for enrichment of low abundance species in metagenomic samples
Agarwal et al. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks
Le et al. Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous fasttext N-grams
Linder et al. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome
KR101950395B1 (en) Method for deep learning-based biomarker discovery with conversion data of genome sequences
KR102586651B1 (en) Method for detecting chromosomal abnormality based on artificial intelligence
CN112289376B (en) Method and device for detecting somatic cell mutation
CN111968701B (en) Method and device for detecting somatic copy number variation of designated genome region
EP4127231A1 (en) Cancer classification with genomic region modeling
CN110739027A (en) cancer tissue positioning method and system based on chromatin region coverage depth
Huang et al. Identification of Smoking-Associated Transcriptome Aberration in Blood with Machine Learning Methods
CN107368701A (en) In high volume unicellular ATAC seq data quality controls and analysis method
Nyberg et al. Modeling protein target search in human chromosomes
Hajkarim et al. Single cell RNA-sequencing for the study of atherosclerosis
KR102142909B1 (en) Methods for Identifying Microdeletion or Microamplification of Fetal Chromosomes Using Non-invasive Prenatal testing
Nair et al. Hybridizing deep neural network for genes expression classification using histone modification
CN115612743B (en) HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis
Zhou et al. Hicluster: A robust single-cell hi-c clustering method based on convolution and random walk
US20230368863A1 (en) Multiplexed Screening Analysis of Peptides for Target Binding
Xing Epigenetic Profiling of Active Enhancers in Mouse Retinal Ganglion Cells
Ibn-Salem Genome folding in evolution and disease
KR20220133516A (en) Method for detecting tumor derived mutation from cell-free DNA based on artificial intelligence and Method for early diagnosis of cancer using the same
Shi et al. An active chromatin interactome elucidates the biological mechanisms underlying genetic risk factors of dermatological conditions in disease relevant cell lines
Wang et al. Decoding the stochastic profile of m6A over the entire transcriptome

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant