CN117095744A - Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data - Google Patents
Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data Download PDFInfo
- Publication number
- CN117095744A CN117095744A CN202311056237.7A CN202311056237A CN117095744A CN 117095744 A CN117095744 A CN 117095744A CN 202311056237 A CN202311056237 A CN 202311056237A CN 117095744 A CN117095744 A CN 117095744A
- Authority
- CN
- China
- Prior art keywords
- copy number
- number variation
- data
- gene
- detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 47
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 31
- 230000014509 gene expression Effects 0.000 claims abstract description 32
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims abstract description 11
- 239000012634 fragment Substances 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims abstract description 10
- 238000012268 genome sequencing Methods 0.000 claims abstract description 5
- 238000003062 neural network model Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 201000010099 disease Diseases 0.000 abstract description 12
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 abstract description 12
- 208000026350 Inborn Genetic disease Diseases 0.000 abstract description 3
- 230000008844 regulatory mechanism Effects 0.000 abstract description 3
- 208000035977 Rare disease Diseases 0.000 abstract description 2
- 208000024556 Mendelian disease Diseases 0.000 abstract 1
- 238000004458 analytical method Methods 0.000 description 19
- 239000003814 drug Substances 0.000 description 5
- 238000003559 RNA-seq method Methods 0.000 description 4
- 229940079593 drug Drugs 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 210000000349 chromosome Anatomy 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 206010028980 Neoplasm Diseases 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 2
- 238000003556 assay Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 208000016361 genetic disease Diseases 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 206010003805 Autism Diseases 0.000 description 1
- 208000020706 Autistic disease Diseases 0.000 description 1
- 208000002330 Congenital Heart Defects Diseases 0.000 description 1
- 108020004414 DNA Proteins 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 208000012902 Nervous system disease Diseases 0.000 description 1
- 208000025966 Neurological disease Diseases 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000006806 disease prevention Effects 0.000 description 1
- 208000022602 disease susceptibility Diseases 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000004043 responsiveness Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000007671 third-generation sequencing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a copy number variation detection method based on single-sample high-throughput transcriptome sequencing data, which comprises the steps of firstly comparing genome sequencing data to a human reference genome, and calculating the data quantity of sequencing fragments on each gene to obtain an expression matrix; then inputting the expression matrix into a detection model to obtain a copy number variation detection result; the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the detection method can better understand the genome structure and variation of a single individual; the use of lower cost and time not only provides the information of the expression level of the gene, but also can obtain the information of the copy number of the gene, can better understand the functions and the regulation mechanism of the genome and the influence of copy number variation on the gene expression, and has important significance for researching hereditary diseases, rare diseases and complex diseases.
Description
Technical Field
The invention relates to the technical field of raw letter analysis, in particular to a copy number variation detection method based on single-sample high-throughput transcriptome sequencing data.
Background
Copy number variation (CopyNumberVariations, CNV) refers to a change in the copy number of a region on a chromosome, i.e., an increase or decrease in the number of DNA sequence repeats in that region. CNVs are common in the human genome and have been associated with the occurrence and progression of some human diseases.
CNVs can involve large genomic fragments, even whole genes or multiple genes, affecting gene expression and function. They may have a significant impact on susceptibility to human disease, risk of morbidity and clinical manifestations. Some CNVs are closely related to the occurrence of genetic diseases, such as certain hereditary cancers, neurological disorders (e.g., autism and dysnoesia), and certain congenital heart diseases.
In addition, CNV can also have an effect on drug response and individual responsiveness to drug treatment. Some CNVs can cause changes in the number or function of drug metabolizing enzymes or drug targets, thereby affecting the metabolism and efficacy of the drug in vivo.
Thus, CNV associated with disease is detected and the contribution and mechanism of these variations to disease is further understood. These studies have helped to enhance understanding of the disease, develop personalized medicine, and improve methods of disease prevention and treatment.
There are three main strategies for CNV analysis, namely Whole Genome (WGS), whole Exome (WES) and targeted sequencing, and detection involves a variety of algorithms and methods. The following are some common algorithms:
1. based on Depth analysis (Read Depth): the method extrapolates copy number variation based on the distribution density of sequencing reads across the genome. By comparing the read depths of the sample and reference genomes, regions of increased or absent copy number can be identified. However, reads depth analysis has limitations for detecting smaller CNVs and is susceptible to factors such as sequencing depth and regional GC content.
2. Breakpoint analysis (split reads): the method detects CNV by analyzing breakpoint positions of copy number variation. It can use paired-end sequencing data or long-read sequencing data to find breakpoint regions and infer the location and size of copy number variations. However, breakpoint analysis requires high quality sequencing data and accurate breakpoint positioning, and is challenging for complex structural variations.
3. Segment comparison (segment-based Methods): these methods divide the genome into successive fragments and compare the read depth or other characteristics of each fragment. By detecting copy number variation between fragments, the presence of CNV can be determined. However, the segment comparison method may have a problem of false alarm or missing report when identifying small CNVs and complex structural variations.
In contrast, there are few detection schemes to address transcriptome data CNV, probably for the following reasons:
1) Noise and technical errors such as sequencing errors, alignment errors, expression estimation errors and the like exist in the transcriptome sequencing data. These errors may have an impact on the results of CNV detection, requiring proper data correction and correction, and conventional CNV analysis methods may not be applicable;
2) The characteristic that the gene expression signal and the CNV signal in the data can interfere with each other can cause the detection of copy number variation to be interfered and influenced by gene expression;
3) The coverage of the genome by the RNA-seq is relatively sparse, which means that some regions may have a higher coverage, while other regions may be under-covered. Such uneven coverage can affect the accuracy and sensitivity of CNV detection, making it highly challenging, if not impossible, to detect accurate CNV breakpoints and small CNV fragments based on a comparative depth approach.
4) Current sequencing analysis CNVs generally rely on the construction of a reference baseline. Reference to a baseline refers to sequencing and analysis of a large number of individuals to determine genomic variations in the normal population. By comparison with a reference baseline, variations in the individual genome can be determined and the presence and copy number of CNV can be inferred. However, there are also limitations to the construction of the reference baseline, such as:
i) Sample number and diversity limitations: the quality and representativeness of the reference baseline depends on the number and type of samples contained. If the number of reference baseline samples is limited or not sufficiently diverse, it may result in some crowd-specific CNVs not being accurately captured.
ii) detection of rare and individual-specific CNVs: reference baselines are often primarily concerned with common CNV variations, however, there may be differences between different samples and data sets, which may not provide accurate baseline information for rare or individual-specific CNVs. These variations may have a significant impact on disease susceptibility and phenotypic characteristics of an individual.
iii) Change of experimental environment: changes in the experimental environment may include changes in temperature, humidity, illumination, etc., or changes in experimental equipment, reagent lots. These variations can lead to inconsistent experimental conditions, resulting in large differences between the sample from which the baseline is constructed and the sample from which the analysis is actually performed, and can have a large impact on the results of the analysis.
Therefore, the current transcriptome sequencing data is mostly used for detecting the expression levels of genes and transcripts to estimate the gene activity, or for identifying Single Nucleotide Polymorphisms (SNPs) and short indels, however, it contains a large amount of information about genomic variations in samples, which is not fully utilized. Among these variations, copy Number Variations (CNVs) are very important for cancer research, as they are the primary genetic driver of cancer. However, identification of CNV from RNA-seq data is very challenging, because the dynamic and highly heterogeneous coverage of the genome by the RNA-seq signal makes it difficult to distinguish between deletion and amplification events and dynamic changes in gene expression levels.
Thus, the use of conventional CNV analysis methods that rely solely on reference baselines and depth-based may have certain limitations on transcriptome data, and there is a strong need for a flexible and accurate method of detecting CNV.
Disclosure of Invention
The invention aims to provide a copy number variation detection method based on single-sample high-throughput transcriptome sequencing data, which overcomes the limitations of the traditional CNV analysis method and realizes flexible and accurate detection of CNV.
In view of this, the scheme of the invention is as follows:
a copy number variation detection method based on single-sample high-throughput transcriptome sequencing data comprises the following steps:
comparing the genome sequencing data to a human reference genome, and calculating the data quantity of the sequencing fragments on each gene to obtain an expression matrix;
inputting the expression matrix into a detection model to obtain a copy number variation detection result;
the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the pretreatment comprises the standardization of the gene expression quantity of the database sample and the division of copy number variation types.
Further, the data alignment is preceded by a pretreatment of the genomic sequencing data to remove low quality sequences and excision of consecutive low quality bases.
Further, the database samples, prior to training, convert the copy number type to a value that can be used by the deep learning algorithm.
Further, the deep network model initializes weights and biases of the neural network in a random initialization manner.
Further, the hidden layer is activated in the training process of the detection model, and the expression level of the sample gene is mapped to max (0, x), namely x is output when x is greater than 0, otherwise 0 is output.
Further, an output layer is activated in the training process of the detection model, input vectors are normalized, and probability values of each element are taken.
Further, a cross entropy loss function is used in the training process of the detection model, and minimization processing is carried out.
The invention also provides a copy number variation detection system based on single-sample high-throughput transcriptome sequencing data, which comprises:
calculating a comparison module: comparing the genome sequencing data to a human reference genome, and calculating the data quantity of the sequencing fragments on each gene to obtain an expression matrix;
and a detection module: inputting the expression matrix into a detection model to obtain a copy number variation detection result;
the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the pretreatment comprises the standardization of the gene expression quantity of the database sample and the division of copy number variation types.
The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above-described detection method when the processor executes the computer program.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the detection method described above.
Compared with the prior art, the beneficial effects of the invention include, but are not limited to:
the detection method provided by the invention uses a single sample to carry out CNV analysis, so that the genome structure and variation condition of a single individual can be better known; the lower cost and time are used, so that not only is the expression level information of the gene provided, but also the copy number information of the gene can be obtained, and the functions and regulation mechanisms of the genome and the influence of copy number variation on the gene expression can be better understood; the detection method can identify potential disease-related CNVs, thereby aiding in the diagnosis and prognosis of the disease. This is of great importance for the study of genetic, rare and complex diseases.
Drawings
FIG. 1 is a flow chart of a method for detecting copy number variation of single sample high throughput transcriptome sequencing data according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantageous technical effects of the present invention more apparent, the present invention will be described in further detail with reference to the following detailed description. It should be understood that the detailed description is intended to illustrate the invention, and not to limit the invention.
CNV analysis using a single sample can address some pain points and limitations, including:
sample collection is difficult: conventional CNV analysis typically requires a large number of samples to produce meaningful results. However, in some cases, it may be difficult or expensive to obtain a sufficient number of samples. The CNV analysis using a single sample can solve the problem of sample collection, making the analysis more convenient and feasible.
The cost and time to construct the reference baseline are high: conventional CNV assays typically require long sample collection times and high assay costs. Using a single sample for analysis may save time and cost in collecting the sample and may allow for faster results to be obtained.
In order to solve the problem of cost caused by the need to construct a baseline for detecting copy number variation in transcriptome data, and the problem that the traditional DNA CNV detection method cannot be used for RNAseq data, in one embodiment, a method for detecting copy number variation in single sample-based high throughput genome capture sequencing data is provided, and the flowchart is shown in fig. 1, and specifically includes the following steps:
1. the patient's genomic sequencing data was data quality processed using fastp software, low quality sequences were removed, consecutive low quality bases were excised, and then high quality sequences were aligned to human reference genome hg 19.
2. A gene annotation file (e.g., a GTF or GFF file) is then used to determine the location of each gene. The gene annotation file provides positional information of transcripts and exons of each gene. Correlating the reads with genes according to the comparison result, and calculating the data quantity of the reads compared to each gene to obtain an expression matrix.
3. Then we downloaded sample data in the public database, each sample having the expression level of the gene and the corresponding copy number variation type, for constructing the model. The specific method comprises the following steps:
1) Data preprocessing: the average value (μ) and standard deviation (σ) of each gene expression amount (x) were calculated, and for each sample gene, normalization was performed using the following formula:
z=(x-μ)/σ;
2) Copy number types are classified as normal (copy number equal to 2), deleted (copy number less than 2), added (copy number greater than 2), and One-Hot Encoding (One-Hot Encoding) is used to convert the copy number types to a numerical representation that can be used by the deep learning algorithm.
4. And then dividing the expression quantity and copy number data of the genes obtained by the treatment into a training set and a testing set.
1) Initializing network parameters: the weights and biases of the neural network are initialized in a random initialization manner.
2) An input layer (with 3 features), two hidden layers (each with 16 neurons) and an output layer (with 3 neurons) were provided
3) Activating the hidden layer: the hidden layer is activated using the ReLU activation function, mapping the input x to max (0, x), i.e. outputting x when x is greater than 0, otherwise outputting 0.
4) Activating an output layer: the output layer is activated using a Softmax laser function. The input vector is normalized, the value of each element is converted to a probability value between 0 and 1, and the sum of all elements is 1.
5) Parameter optimization: the cross entropy loss function used in the training process of the model is optimized and minimized by using an Adam optimizer, and the accuracy of the model is improved.
6) And (3) verifying a model: using the test set as an input to the model, the accuracy of the model was verified as shown in table 1.
Table 1:
gene | Accurate predictionRate of |
ENSG00000000457.14 | 0.9122 |
ENSG00000000460.17 | 0.9298 |
ENSG00000000938.13 | 0.9298 |
ENSG00000000971.16 | 0.9122 |
ENSG00000001460.18 | 0.9298 |
5. The final step is to input the expression moment of the sample into the model, and then obtain the copy number prediction result of each gene.
In the above examples, the detection method may be used to detect the presence or absence of copy number variation in a region on a chromosome in a single sample, and is not capable of directly diagnosing a disease or diseases. The CNV results are only used as a single sample, so that the functions and the regulation mechanism of the genome and the influence of copy number variation on gene expression can be better understood, and the CNV results have important significance for researching genetic diseases, rare diseases and complex diseases.
The following are CNV detection examples using certain genomic sequencing data as an example:
(1) Sequencing data pretreatment
Fastq data were obtained and the statistics are shown in Table 2.
Table 2:
Samples | Totalreads | Totalbases(bp) | Q20(%) | Q30(%) |
read1 | 54593963 | 8189094450 | 97.52 | 93.27 |
read2 | 54593963 | 8189094450 | 97.52 | 93.27 |
(2) Fastq data processing
After quality control, high quality sequences were obtained and the data statistics are shown in Table 3.
Table 3:
(3) Sequence to reference genome alignment
The alignment of the sequence data with the human reference genome hg19 is shown in table 4.
Table 4:
(4) The number of reads per gene was calculated as shown in Table 5.
Table 5:
(5) The copy number variation test results were obtained by inputting the test model as shown in Table 6.
Table 6:
chromosome of the human body | Variation type |
ENSG00000007908.16 | Gain |
ENSG00000007923.16 | Gain |
ENSG00000007933.13 | Gain |
ENSG00000007968.7 | Normal |
ENSG00000008118.10 | Normal |
ENSG00000008128.23 | Normal |
ENSG00000008130.15 | Normal |
ENSG00000009307.16 | Normal |
The present invention is not limited to the details and embodiments described herein, and further advantages and modifications may readily be achieved by those skilled in the art, so that the present invention is not limited to the specific details, representative solutions and examples described herein, without departing from the spirit and scope of the general concepts defined by the claims and the equivalents.
Claims (10)
1. A copy number variation detection method based on single sample high throughput transcriptome sequencing data, comprising the steps of:
comparing the genome sequencing data to a human reference genome, and calculating the data quantity of the sequencing fragments on each gene to obtain an expression matrix;
inputting the expression matrix into a detection model to obtain a copy number variation detection result;
the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the pretreatment comprises the standardization of the gene expression quantity of the database sample and the division of copy number variation types.
2. The method of claim 1, wherein the data alignment is preceded by pretreatment of genomic sequencing data to remove low quality sequences and excision of consecutive low quality bases.
3. The method of claim 1, wherein the database sample converts the copy number type to a value that can be used by a deep learning algorithm prior to training.
4. The method of claim 1, wherein the deep network model initializes weights and biases of the neural network in a random initialization manner.
5. The method according to claim 1, wherein the hidden layer is activated during the training of the detection model, and the sample gene expression level is mapped to max (0, x), i.e., x is output when x is greater than 0, otherwise 0 is output.
6. The method according to claim 1, wherein the output layer is activated during the training process of the detection model, the input vector is normalized, and the probability value of each element is taken.
7. The method according to claim 1, wherein the cross entropy loss function is used in the training process of the detection model and is subjected to a minimization process.
8. A copy number variation detection system based on single sample high throughput transcriptome sequencing data, comprising:
calculating a comparison module: comparing the genome sequencing data to a human reference genome, and calculating the data quantity of the sequencing fragments on each gene to obtain an expression matrix;
and a detection module: inputting the expression matrix into a detection model to obtain a copy number variation detection result;
the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the pretreatment comprises the standardization of the gene expression quantity of the database sample and the division of copy number variation types.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311056237.7A CN117095744A (en) | 2023-08-21 | 2023-08-21 | Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311056237.7A CN117095744A (en) | 2023-08-21 | 2023-08-21 | Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117095744A true CN117095744A (en) | 2023-11-21 |
Family
ID=88771058
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311056237.7A Pending CN117095744A (en) | 2023-08-21 | 2023-08-21 | Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117095744A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110648721A (en) * | 2019-09-19 | 2020-01-03 | 北京市儿科研究所 | Method and device for detecting copy number variation by aiming at exon capture technology |
CN111210873A (en) * | 2020-01-14 | 2020-05-29 | 西安交通大学 | Exon sequencing data-based copy number variation detection method and system, terminal and storage medium |
CN111276187A (en) * | 2020-01-12 | 2020-06-12 | 湖南大学 | Gene expression profile feature learning method based on self-encoder |
CN111599407A (en) * | 2020-05-13 | 2020-08-28 | 北京橡鑫生物科技有限公司 | Method and device for detecting copy number variation |
CN112634987A (en) * | 2020-12-25 | 2021-04-09 | 北京吉因加医学检验实验室有限公司 | Method and device for detecting copy number variation of single-sample tumor DNA |
CN113903395A (en) * | 2021-10-28 | 2022-01-07 | 聊城大学 | BP neural network copy number variation detection method and system for improving particle swarm optimization |
CN114566209A (en) * | 2022-03-03 | 2022-05-31 | 四川大学 | Training method and application of mycobacterium tuberculosis drug resistance prediction model based on hierarchical attention neural network |
CN115171779A (en) * | 2022-07-13 | 2022-10-11 | 浙江大学 | Cancer driver gene prediction device based on graph attention network and multigroup chemical fusion |
CN115249513A (en) * | 2021-12-14 | 2022-10-28 | 聊城大学 | Neural network copy number variation detection method and system based on Adaboost integration idea |
-
2023
- 2023-08-21 CN CN202311056237.7A patent/CN117095744A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110648721A (en) * | 2019-09-19 | 2020-01-03 | 北京市儿科研究所 | Method and device for detecting copy number variation by aiming at exon capture technology |
CN111276187A (en) * | 2020-01-12 | 2020-06-12 | 湖南大学 | Gene expression profile feature learning method based on self-encoder |
CN111210873A (en) * | 2020-01-14 | 2020-05-29 | 西安交通大学 | Exon sequencing data-based copy number variation detection method and system, terminal and storage medium |
CN111599407A (en) * | 2020-05-13 | 2020-08-28 | 北京橡鑫生物科技有限公司 | Method and device for detecting copy number variation |
CN112634987A (en) * | 2020-12-25 | 2021-04-09 | 北京吉因加医学检验实验室有限公司 | Method and device for detecting copy number variation of single-sample tumor DNA |
CN113903395A (en) * | 2021-10-28 | 2022-01-07 | 聊城大学 | BP neural network copy number variation detection method and system for improving particle swarm optimization |
CN115249513A (en) * | 2021-12-14 | 2022-10-28 | 聊城大学 | Neural network copy number variation detection method and system based on Adaboost integration idea |
CN114566209A (en) * | 2022-03-03 | 2022-05-31 | 四川大学 | Training method and application of mycobacterium tuberculosis drug resistance prediction model based on hierarchical attention neural network |
CN115171779A (en) * | 2022-07-13 | 2022-10-11 | 浙江大学 | Cancer driver gene prediction device based on graph attention network and multigroup chemical fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Barfield et al. | Transcriptome‐wide association studies accounting for colocalization using Egger regression | |
Hsu et al. | Denoising array-based comparative genomic hybridization data using wavelets | |
US7454293B2 (en) | Methods for enhanced detection and analysis of differentially expressed genes using gene chip microarrays | |
Taylor | Implementation and accuracy of genomic selection | |
CN103201744B (en) | For estimating the method that full-length genome copies number variation | |
US7937225B2 (en) | Systems, methods and software arrangements for detection of genome copy number variation | |
CN109887546B (en) | Single-gene or multi-gene copy number detection system and method based on next-generation sequencing | |
CN107408163B (en) | Method and apparatus for analyzing gene | |
JP2018522531A5 (en) | ||
Nevado et al. | Resequencing studies of nonmodel organisms using closely related reference genomes: optimal experimental designs and bioinformatics approaches for population genomics | |
CN110648721B (en) | Method and device for detecting copy number variation by aiming at exon capture technology | |
CN114049914B (en) | Method and device for integrally detecting CNV, uniparental disomy, triploid and ROH | |
CN112634987A (en) | Method and device for detecting copy number variation of single-sample tumor DNA | |
CN111210873B (en) | Exon sequencing data-based copy number variation detection method and system, terminal and storage medium | |
Eichner et al. | Support vector machines-based identification of alternative splicing in Arabidopsis thaliana from whole-genome tiling arrays | |
CN109461473B (en) | Method and device for acquiring concentration of free DNA of fetus | |
Gong et al. | MethCP: differentially methylated region detection with change point models | |
Mezey et al. | Coordinated evolution of co-expressed gene clusters in the Drosophila transcriptome | |
CN117095744A (en) | Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data | |
WO2023196928A2 (en) | True variant identification via multianalyte and multisample correlation | |
CN113284558B (en) | Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data | |
Schrider et al. | Detecting highly differentiated copy-number variants from pooled population sequencing | |
CN114694752B (en) | Method, computing device and medium for predicting homologous recombination repair defects | |
US20150094223A1 (en) | Methods and apparatuses for diagnosing cancer by using genetic information | |
CN116508105A (en) | Genomic marker interpolation based on haplotype blocks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |