CN117095744A - Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data - Google Patents

Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data Download PDF

Info

Publication number
CN117095744A
CN117095744A CN202311056237.7A CN202311056237A CN117095744A CN 117095744 A CN117095744 A CN 117095744A CN 202311056237 A CN202311056237 A CN 202311056237A CN 117095744 A CN117095744 A CN 117095744A
Authority
CN
China
Prior art keywords
copy number
data
number variation
detection
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311056237.7A
Other languages
Chinese (zh)
Inventor
钟建伟
柳佳琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Original Assignee
Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinnuo Baishi Medical Laboratory Co ltd filed Critical Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority to CN202311056237.7A priority Critical patent/CN117095744A/en
Publication of CN117095744A publication Critical patent/CN117095744A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种基于单样本高通量转录组测序数据的拷贝数变异检测方法,所述检测方法首先将基因组测序数据比对到人类参考基因组上,计算比对到每个基因上测序片段的数据量,得到表达矩阵;然后将表达矩阵输入检测模型获得拷贝数变异检测结果;所述检测模型基于深度神经网络模型,用预处理后的已知数据库样本为数据集进行训练得到;所述检测方法更好地了解单一个体的基因组结构和变异情况;使用更低的成本和时间,不仅提供了基因的表达水平信息,还能够获取基因的拷贝数信息,可以更好地理解基因组的功能和调控机制,以及拷贝数变异对基因表达的影响,对于遗传性疾病、罕见病以及复杂疾病的研究具有重要意义。

The invention discloses a copy number variation detection method based on single-sample high-throughput transcriptome sequencing data. The detection method first compares the genome sequencing data to the human reference genome, and calculates the comparison to the sequencing fragments on each gene. The amount of data, the expression matrix is obtained; then the expression matrix is input into the detection model to obtain the copy number variation detection results; the detection model is based on the deep neural network model and is obtained by training with the preprocessed known database samples as the data set; The detection method can better understand the genome structure and variation of a single individual; using lower cost and time, it not only provides gene expression level information, but also can obtain gene copy number information, which can better understand the function and function of the genome. Regulatory mechanisms and the impact of copy number variation on gene expression are of great significance for the study of genetic diseases, rare diseases and complex diseases.

Description

一种基于单样本高通量转录组测序数据的拷贝数变异检测 方法A copy number variation detection based on single-sample high-throughput transcriptome sequencing data method

技术领域Technical field

本发明涉及生信分析技术领域,尤其涉及一种基于单样本高通量转录组测序数据的拷贝数变异检测方法。The invention relates to the technical field of bioinformatics analysis, and in particular to a copy number variation detection method based on single sample high-throughput transcriptome sequencing data.

背景技术Background technique

拷贝数变异(CopyNumberVariations,CNV)是指染色体上的一个区域的拷贝数发生变化,即该区域的DNA序列重复次数增加或减少。CNV在人类基因组中很常见,并且已经与一些人类疾病的发生和发展有关。Copy Number Variations (CNV) refers to changes in the copy number of a region on a chromosome, that is, an increase or decrease in the number of DNA sequence repeats in the region. CNVs are common in the human genome and have been associated with the onset and progression of several human diseases.

CNV可以涉及较大的基因组片段,甚至整个基因或多个基因,从而影响基因的表达和功能。它们可能对人类疾病的易感性、发病风险和临床表现产生重要影响。一些CNV与遗传性疾病的发生密切相关,例如某些遗传性癌症、神经发育障碍(如自闭症和智力障碍)以及某些先天性心脏病。CNVs can involve larger genome segments or even entire genes or multiple genes, thereby affecting gene expression and function. They may have important effects on human disease susceptibility, risk, and clinical manifestations. Some CNVs are closely related to the occurrence of genetic diseases, such as certain hereditary cancers, neurodevelopmental disorders (such as autism and intellectual disability), and certain congenital heart diseases.

此外,CNV还可以对药物反应和个体对药物治疗的反应性产生影响。一些CNV可以导致药物代谢酶或药物靶标的数量或功能发生变化,从而影响药物在体内的代谢和效果。In addition, CNV can also have an impact on drug response and individual responsiveness to drug treatment. Some CNVs can cause changes in the quantity or function of drug-metabolizing enzymes or drug targets, thereby affecting the metabolism and effects of drugs in the body.

因此,检测与疾病相关的CNV,并进一步理解这些变异对疾病的贡献和机制。这些研究有助于增进对疾病的认识、发展个体化医疗和改善疾病预防和治疗的方法。Therefore, detect disease-associated CNVs and further understand the contribution and mechanisms of these variants to disease. These studies help improve understanding of disease, develop personalized medicine, and improve methods for disease prevention and treatment.

CNV分析主要有三种策略,即全基因组(WGS)、全外显子组(WES)和靶向测序,检测涉及多种算法和方法。以下是一些常见的算法:There are three main strategies for CNV analysis, namely whole genome (WGS), whole exome (WES) and targeted sequencing, and detection involves a variety of algorithms and methods. Here are some common algorithms:

1.基于深度分析(Read Depth):该方法基于测序reads在基因组上的分布密度来推断拷贝数变异。通过比较样本和参考基因组的读深度,可以识别出拷贝数增加或缺失的区域。然而,reads深度分析对于检测较小的CNV具有限制,并且容易受到测序深度和区域GC含量等因素的影响。1. Based on depth analysis (Read Depth): This method is based on the distribution density of sequencing reads on the genome to infer copy number variation. By comparing the read depth of the sample and the reference genome, regions with copy number gains or losses can be identified. However, read depth analysis has limitations for detecting smaller CNVs and is easily affected by factors such as sequencing depth and regional GC content.

2.断点分析(split reads):该方法通过分析拷贝数变异的断点位置来检测CNV。它可以利用配对末端测序数据或长读测序数据来寻找断点区域,并推断拷贝数变异的位置和大小。然而,断点分析需要高质量的测序数据和精确的断点定位,且对于复杂的结构变异较为挑战。2. Breakpoint analysis (split reads): This method detects CNVs by analyzing the breakpoint positions of copy number variations. It can use paired-end sequencing data or long-read sequencing data to find breakpoint regions and infer the location and size of copy number variations. However, breakpoint analysis requires high-quality sequencing data and precise breakpoint positioning, and is more challenging for complex structural variations.

3.分段比较(Segmentation-based Methods):这些方法将基因组划分为连续的片段,并对每个片段的读深度或其他特征进行比较。通过检测片段之间的拷贝数变异,可以确定CNV的存在。然而,分段比较方法在识别小型CNV和复杂结构变异时可能存在误报或漏报的问题。3. Segmentation-based Methods: These methods divide the genome into continuous fragments and compare the read depth or other characteristics of each fragment. By detecting copy number variation between fragments, the presence of CNV can be determined. However, segmented comparison methods may suffer from false positives or false negatives when identifying small CNVs and complex structural variants.

相比之下,很少有用来解决转录组数据CNV的检测方案,可能原因如下:In contrast, there are few detection solutions to resolve CNVs in transcriptome data, possibly for the following reasons:

1)由于转录组测序数据本身存在噪音和技术误差,如测序错误、对齐错误和表达估计误差等。这些误差可能会对CNV检测的结果产生影响,需要进行适当的数据校正和纠正,传统的CNV分析方法可能不适用;1) Due to the noise and technical errors in transcriptome sequencing data itself, such as sequencing errors, alignment errors, and expression estimation errors, etc. These errors may affect the results of CNV detection and require appropriate data correction and correction. Traditional CNV analysis methods may not be applicable;

2)数据中的基因表达信号与CNV信号可能会相互干扰的特点,会导致拷贝数变异的检测受到基因表达的干扰和影响;2) The gene expression signal and CNV signal in the data may interfere with each other, which will cause the detection of copy number variation to be interfered with and affected by gene expression;

3)RNA-seq对基因组的覆盖度相对稀疏,这意味着某些区域可能有较高的覆盖度,而其他区域可能覆盖不足。这种不均匀的覆盖度会影响CNV的检测精确性和灵敏度,使得基于比较深度的方法检测精确的CNV断点和小的CNV片段非常具有高度挑战性,甚至不可能。3) The coverage of the genome by RNA-seq is relatively sparse, which means that some regions may have higher coverage, while other regions may have insufficient coverage. This uneven coverage affects CNV detection accuracy and sensitivity, making it highly challenging or even impossible for comparative depth-based methods to detect precise CNV breakpoints and small CNV fragments.

4)目前测序分析CNV通常依赖于参考基线的构建。参考基线是指对大量个体进行测序和分析,以确定正常人群中的基因组变异情况。通过与参考基线比较,可以确定个体基因组中的变异,并推断出CNV的存在和拷贝数。然而,参考基线的构建也存在一些局限性,比如:4) Current sequencing analysis of CNV usually relies on the construction of a reference baseline. A reference baseline refers to the sequencing and analysis of a large number of individuals to determine the genomic variation in the normal population. By comparing to a reference baseline, variations in an individual's genome can be determined and the presence and copy number of CNVs inferred. However, there are also some limitations in the construction of reference baselines, such as:

i)样本数量和多样性限制:参考基线的质量和代表性取决于所包含的样本数量和种类。如果参考基线样本数量有限或者不具有足够的多样性,可能会导致一些人群特异性的CNV未被准确捕获。i) Sample number and diversity limitations: The quality and representativeness of a reference baseline depends on the number and variety of samples included. If the number of reference baseline samples is limited or does not have sufficient diversity, some population-specific CNVs may not be accurately captured.

ii)稀有和个体特异性CNV的检测:参考基线通常主要关注常见的CNV变异,然而,不同样本和数据集之间可能存在差异,对于稀有的或个体特异性的CNV可能无法提供准确的基线信息。这些变异可能对个体的疾病易感性和表型特征产生重要影响。ii) Detection of rare and individual-specific CNVs: Reference baselines usually focus on common CNV variants. However, there may be differences between different samples and data sets, and accurate baseline information may not be provided for rare or individual-specific CNVs. . These variants may have important effects on an individual's disease susceptibility and phenotypic characteristics.

iii)实验环境的改变:实验环境的改变可能包括温度、湿度、光照等因素的变化,或者实验设备、试剂批次的更替。这些变化可能导致实验条件的不一致,造成构建基线的样本和实际分析的样本产生较大的差异,会对分析的结果产生较大的影响。iii) Changes in the experimental environment: Changes in the experimental environment may include changes in temperature, humidity, lighting and other factors, or replacement of experimental equipment and reagent batches. These changes may lead to inconsistencies in experimental conditions, resulting in large differences between the samples used to construct the baseline and the samples actually analyzed, which will have a greater impact on the analysis results.

所以,目前的转录组测序数据大都是用于检测基因和转录本的表达量来估计基因活性,或者是用于识别单核苷酸多态性(SNP)和短的插入缺失,然而,其包含大量关于样本中基因组变异的信息,未得到充分利用。在这些变异中,拷贝数变异(CNV)对于癌症研究非常重要,因为它们是癌症的主要遗传驱动因素。然而,从RNA-seq数据中识别CNV,非常具有挑战性,因为RNA-seq信号对基因组的动态和高度不均匀覆盖使得很难区分缺失和扩增事件以及基因表达水平的动态变化。Therefore, most of the current transcriptome sequencing data are used to detect the expression of genes and transcripts to estimate gene activity, or to identify single nucleotide polymorphisms (SNPs) and short indels. However, they include There is a wealth of information about genomic variation in samples that is underexploited. Among these variations, copy number variations (CNVs) are very important for cancer research because they are the main genetic drivers of cancer. However, identifying CNVs from RNA-seq data is very challenging because the dynamic and highly uneven coverage of the genome by RNA-seq signals makes it difficult to distinguish between deletion and amplification events and dynamic changes in gene expression levels.

因此,仅依赖参考基线和基于深度的传统的CNV分析方法用在转录组数据上可能会有一定的局限性,迫切需要一种灵活而又准确的检测CNV的方法。Therefore, traditional CNV analysis methods that only rely on reference baselines and depth-based analysis may have certain limitations when used on transcriptome data. There is an urgent need for a flexible and accurate method for detecting CNVs.

发明内容Contents of the invention

本发明的目的在于提供一种基于单样本高通量转录组测序数据的拷贝数变异检测方法,克服传统的CNV分析方法的局限性,实现灵活且准确检测CNV。The purpose of the present invention is to provide a copy number variation detection method based on single sample high-throughput transcriptome sequencing data, overcome the limitations of traditional CNV analysis methods, and achieve flexible and accurate detection of CNV.

有鉴于此,本发明的方案如下:In view of this, the solution of the present invention is as follows:

一种基于单样本高通量转录组测序数据的拷贝数变异检测方法,包括如下步骤:A copy number variation detection method based on single sample high-throughput transcriptome sequencing data, including the following steps:

将基因组测序数据比对到人类参考基因组上,计算比对到每个基因上测序片段的数据量,得到表达矩阵;Compare the genome sequencing data to the human reference genome, calculate the amount of data compared to the sequenced fragments on each gene, and obtain the expression matrix;

将表达矩阵输入检测模型获得拷贝数变异检测结果;Input the expression matrix into the detection model to obtain the copy number variation detection results;

所述检测模型基于深度神经网络模型,用预处理后的已知数据库样本为数据集进行训练得到;所述预处理包括对数据库样本基因表达量的标准化,及划分拷贝数变异类型。The detection model is based on a deep neural network model and is trained using pre-processed known database samples as the data set; the pre-processing includes standardizing the gene expression of the database samples and classifying copy number variation types.

进一步地,所述数据比对前对基因组测序数据进行预处理,去除低质量序列,切除连续的低质量碱基。Further, before the data comparison, the genome sequencing data is preprocessed to remove low-quality sequences and excise continuous low-quality bases.

进一步地,所述数据库样本在训练前,将拷贝数类型转换为可供深度学习算法使用的数值。Further, before training, the copy number type of the database sample is converted into a numerical value that can be used by the deep learning algorithm.

进一步地,所述深度网络模型用随机初始化的方式初始化神经网络的权重和偏置。Further, the deep network model initializes the weights and biases of the neural network in a random initialization manner.

进一步地,所述检测模型训练过程中激活隐藏层,将样本基因表达量映射为max(0,x),即当x大于0时输出x,否则输出0。Further, during the training process of the detection model, the hidden layer is activated and the sample gene expression is mapped to max(0,x), that is, when x is greater than 0, x is output, otherwise 0 is output.

进一步地,所述检测模型训练过程中激活输出层,对输入向量进行归一化,取每个元素的概率值。Further, during the training process of the detection model, the output layer is activated, the input vector is normalized, and the probability value of each element is obtained.

进一步地,所述检测模型训练过程中使用交叉熵损失函数,并做最小化处理。Further, the cross-entropy loss function is used during the training process of the detection model and is minimized.

本发明还提供一种基于单样本高通量转录组测序数据的拷贝数变异检测系统,包括:The invention also provides a copy number variation detection system based on single sample high-throughput transcriptome sequencing data, including:

计算比对模块:将基因组测序数据比对到人类参考基因组上,计算比对到每个基因上测序片段的数据量,得到表达矩阵;Computational comparison module: Compare the genome sequencing data to the human reference genome, calculate the amount of data compared to the sequenced fragments on each gene, and obtain the expression matrix;

检测模块:将表达矩阵输入检测模型获得拷贝数变异检测结果;Detection module: input the expression matrix into the detection model to obtain the copy number variation detection results;

所述检测模型基于深度神经网络模型,用预处理后的已知数据库样本为数据集进行训练得到;所述预处理包括对数据库样本基因表达量的标准化,及划分拷贝数变异类型。The detection model is based on a deep neural network model and is trained using pre-processed known database samples as the data set; the pre-processing includes standardizing the gene expression of the database samples and classifying copy number variation types.

本发明还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以上所述检测方法的步骤。The present invention also provides a computer device, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the steps of the above detection method.

本发明还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以上所述检测方法的步骤。The present invention also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of the above detection method are implemented.

相比现有技术,本发明的有益效果包括但不限于:Compared with the existing technology, the beneficial effects of the present invention include but are not limited to:

本发明提出的检测方法使用单样本进行CNV分析可以更好地了解单一个体的基因组结构和变异情况;使用更低的成本和时间,不仅提供了基因的表达水平信息,还能够获取基因的拷贝数信息,可以更好地理解基因组的功能和调控机制,以及拷贝数变异对基因表达的影响;所述检测方法可以识别潜在的疾病相关CNV,从而帮助疾病的诊断和预测。这对于遗传性疾病、罕见病以及复杂疾病的研究具有重要意义。The detection method proposed by the present invention uses a single sample for CNV analysis to better understand the genome structure and variation of a single individual; using lower cost and time, it not only provides information on the expression level of the gene, but also can obtain the copy number of the gene. Information can better understand the function and regulatory mechanism of the genome, as well as the impact of copy number variation on gene expression; the detection method can identify potential disease-related CNVs, thereby aiding the diagnosis and prediction of diseases. This is of great significance for the study of genetic diseases, rare diseases and complex diseases.

附图说明Description of the drawings

图1为本发明所述单样本高通量转录组测序数据的拷贝数变异检测方法流程图。Figure 1 is a flow chart of the copy number variation detection method for single sample high-throughput transcriptome sequencing data according to the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和有益技术效果更加清晰明白,以下结合具体实施方式,对本发明进行进一步详细说明。应当理解的是,本说明书中描述的具体实施方式仅仅是为了解释本发明,并不是为了限定本发明。In order to make the purpose, technical solutions and beneficial technical effects of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments. It should be understood that the specific embodiments described in this specification are only for explaining the present invention and are not intended to limit the present invention.

使用单样本进行CNV分析可以解决一些痛点和限制,包括:Using a single sample for CNV analysis can address several pain points and limitations, including:

样本收集困难:传统的CNV分析通常需要大量的样本才能得出有意义的结果。但是,在某些情况下,获取足够数量的样本可能是困难或昂贵的。使用单样本进行CNV分析可以解决样本收集的问题,使得分析更加便捷和可行。Difficulty in sample collection: Traditional CNV analysis often requires a large number of samples to obtain meaningful results. However, in some cases, obtaining a sufficient number of samples may be difficult or expensive. Using a single sample for CNV analysis can solve the problem of sample collection, making the analysis more convenient and feasible.

构建参考基线的成本高和时间长:传统的CNV分析通常需要较长的样本收集时间和高昂的分析成本。使用单样本进行分析可以节省收集样本的时间和成本,并且可以更快地获取结果。Constructing a reference baseline is costly and time-consuming: Traditional CNV analysis usually requires long sample collection time and high analysis costs. Using a single sample for analysis saves the time and cost of collecting samples and results can be obtained faster.

为了解决转录组数据检测拷贝数变异需要构建基线,从而需要测序大量样本所带来的成本问题,以及传统的DNA CNV检测方法无法用于RNAseq数据的问题,在一个实施例中,提出了基于单样本的高通量基因组捕获测序数据的拷贝数变异检测的方法,其流程图如图1所示,具体包括如下步骤:In order to solve the cost problem caused by the need to construct a baseline for detecting copy number variations in transcriptome data, which requires sequencing a large number of samples, and the problem that traditional DNA CNV detection methods cannot be used for RNAseq data, in one embodiment, a method based on a single The method for detecting copy number variations in high-throughput genome capture sequencing data of samples is shown in Figure 1, which specifically includes the following steps:

1.将患者的基因组测序数据使用fastp软件进行数据质量处理,去除低质量序列,切除连续的低质量碱基,然后将高质量的序列比对到人类参考基因组hg19上。1. Use fastp software to process the patient's genome sequencing data for data quality, remove low-quality sequences, excise continuous low-quality bases, and then align the high-quality sequences to the human reference genome hg19.

2.接着使用基因注释文件(如GTF或GFF文件)来确定每个基因的位置。基因注释文件提供了每个基因的转录本和外显子的位置信息。根据比对结果将reads与基因相关联,并计算比对到每个基因上的reads的数据量,得到一个表达矩阵。2. Then use a gene annotation file (such as a GTF or GFF file) to determine the location of each gene. Gene annotation files provide transcript and exon position information for each gene. According to the alignment results, the reads are associated with the genes, and the data amount of the reads aligned to each gene is calculated to obtain an expression matrix.

3.随后我们下载了公开数据库里的样本数据,每个样本都有基因的表达量和对应的拷贝数变异类型,用于构建模型。具体做法是:3. We then downloaded the sample data from the public database. Each sample has the expression level of the gene and the corresponding copy number variation type, which is used to build the model. The specific steps are:

1)数据预处理:计算每个基因表达量(x)的的平均值(μ)和标准差(σ),对每个样本基因,使用以下公式进行标准化:1) Data preprocessing: Calculate the mean (μ) and standard deviation (σ) of each gene expression (x), and standardize each sample gene using the following formula:

z=(x-μ)/σ;z=(x-μ)/σ;

2)将拷贝数类型分为正常(拷贝数等于2),缺失(拷贝数小于2),增加(拷贝数大于2),使用独热编码(One-Hot Encoding)将拷贝数类型转换为可供深度学习算法使用的数值表示。2) Divide the copy number type into normal (copy number equal to 2), missing (copy number less than 2), increase (copy number greater than 2), and use One-Hot Encoding to convert the copy number type into available Numerical representation used by deep learning algorithms.

4.然后将上述处理得到的基因的表达量和拷贝数数据分为训练集和测试集。4. Then divide the gene expression and copy number data obtained by the above processing into a training set and a test set.

1)初始化网络参数:用随机初始化的方式初始化神经网络的权重和偏置。1) Initialize network parameters: Initialize the weights and biases of the neural network using random initialization.

2)设置输入层(有3个特征)、两个隐藏层(每个具有16个神经元)和一个输出层(具有3个神经元)2) Set up an input layer (with 3 features), two hidden layers (with 16 neurons each) and an output layer (with 3 neurons)

3)激活隐藏层:使用ReLU激活函数激活隐藏层,将输入x映射为max(0,x),即当x大于0时输出x,否则输出0。3) Activating the hidden layer: Use the ReLU activation function to activate the hidden layer and map the input x to max(0,x), that is, when x is greater than 0, x is output, otherwise 0 is output.

4)激活输出层:使用Softmax激函数来激活输出层。将输入向量进行归一化,将每个元素的值转换为介于0和1之间的概率值,并且所有元素的和为1。4) Activate the output layer: Use the Softmax activation function to activate the output layer. The input vector is normalized, converting the value of each element into a probability value between 0 and 1, and the sum of all elements is 1.

5)参数优化:模型的训练过程中使用的交叉熵损失函数,使用Adam优化器来优化最小化交叉熵损失函数,提高模型的准确性。5) Parameter optimization: The cross-entropy loss function used in the training process of the model uses the Adam optimizer to optimize and minimize the cross-entropy loss function to improve the accuracy of the model.

6)验证模型:使用测试集合作为模型的输入,验证模型的准确率,如表1所示。6) Verify the model: Use the test set as the input of the model to verify the accuracy of the model, as shown in Table 1.

表1:Table 1:

基因Gene 预测准确率Prediction accuracy ENSG00000000457.14ENSG00000000457.14 0.91220.9122 ENSG00000000460.17ENSG00000000460.17 0.92980.9298 ENSG00000000938.13ENSG00000000938.13 0.92980.9298 ENSG00000000971.16ENSG00000000971.16 0.91220.9122 ENSG00000001460.18ENSG00000001460.18 0.92980.9298

5.最后一步是将样本的表达矩输入到上述模型中,即得到每个基因的拷贝数预测结果。5. The last step is to input the expression moment of the sample into the above model to obtain the copy number prediction result of each gene.

上述实施例中,所述检测方法可用于检测单一样本中染色体上的一个区域的拷贝数变异存在或不存在,并非能够直接诊断某种或多种疾病。仅作为单一样本CNV结果,以便于更好地理解基因组的功能和调控机制,以及拷贝数变异对基因表达的影响,对于遗传性疾病、罕见病以及复杂疾病的研究具有重要意义。In the above embodiments, the detection method can be used to detect the presence or absence of copy number variation in a region on a chromosome in a single sample, but cannot directly diagnose one or more diseases. Only as a single sample CNV result, in order to better understand the function and regulatory mechanism of the genome, as well as the impact of copy number variation on gene expression, it is of great significance for the study of genetic diseases, rare diseases and complex diseases.

下面是以某基因组测序数据为例的CNV检测实施例:The following is an example of CNV detection using certain genome sequencing data as an example:

(1)测序数据预处理(1) Sequencing data preprocessing

得到fastq数据,统计如表2。Obtain fastq data, the statistics are shown in Table 2.

表2:Table 2:

SamplesSamples TotalreadsTotalreads Totalbases(bp)Totalbases(bp) Q20(%)Q20(%) Q30(%)Q30(%) read1read1 5459396354593963 81890944508189094450 97.5297.52 93.2793.27 read2read2 5459396354593963 81890944508189094450 97.5297.52 93.2793.27

(2)Fastq数据处理(2)Fastq data processing

经过质控后,得到高质量的序列,数据统计如表3。After quality control, high-quality sequences were obtained, and the data statistics are shown in Table 3.

表3:table 3:

(3)序列与参考基因组比对(3) Sequence comparison with reference genome

序列数据与人类参考基因组hg19的比对情况如表4。The alignment of sequence data with the human reference genome hg19 is shown in Table 4.

表4:Table 4:

(4)计算每个基因的reads数,如表5。(4) Calculate the number of reads for each gene, as shown in Table 5.

表5:table 5:

(5)输入检测模型得到拷贝数变异检测结果,如表6。(5) Enter the detection model to obtain the copy number variation detection results, as shown in Table 6.

表6:Table 6:

染色体chromosome 变异类型Variation type ENSG00000007908.16ENSG00000007908.16 GainGain ENSG00000007923.16ENSG00000007923.16 GainGain ENSG00000007933.13ENSG00000007933.13 GainGain ENSG00000007968.7ENSG00000007968.7 NormalNormal ENSG00000008118.10ENSG00000008118.10 NormalNormal ENSG00000008128.23ENSG00000008128.23 NormalNormal ENSG00000008130.15ENSG00000008130.15 NormalNormal ENSG00000009307.16ENSG00000009307.16 NormalNormal

本发明并不仅仅限于说明书和实施方式中所描述,因此对于熟悉领域的人员而言可容易地实现另外的优点和修改,故在不背离权利要求及等同范围所限定的一般概念的精神和范围的情况下,本发明并不限于特定的细节、代表性的方案和这里描述的示例。The present invention is not limited to what is described in the specification and embodiments, and therefore other advantages and modifications can be easily realized by those skilled in the art without departing from the spirit and scope of the general concept as defined by the claims and equivalent scopes. The invention is not limited to the specific details, representative arrangements and examples described herein.

Claims (10)

1. A copy number variation detection method based on single sample high throughput transcriptome sequencing data, comprising the steps of:
comparing the genome sequencing data to a human reference genome, and calculating the data quantity of the sequencing fragments on each gene to obtain an expression matrix;
inputting the expression matrix into a detection model to obtain a copy number variation detection result;
the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the pretreatment comprises the standardization of the gene expression quantity of the database sample and the division of copy number variation types.
2. The method of claim 1, wherein the data alignment is preceded by pretreatment of genomic sequencing data to remove low quality sequences and excision of consecutive low quality bases.
3. The method of claim 1, wherein the database sample converts the copy number type to a value that can be used by a deep learning algorithm prior to training.
4. The method of claim 1, wherein the deep network model initializes weights and biases of the neural network in a random initialization manner.
5. The method according to claim 1, wherein the hidden layer is activated during the training of the detection model, and the sample gene expression level is mapped to max (0, x), i.e., x is output when x is greater than 0, otherwise 0 is output.
6. The method according to claim 1, wherein the output layer is activated during the training process of the detection model, the input vector is normalized, and the probability value of each element is taken.
7. The method according to claim 1, wherein the cross entropy loss function is used in the training process of the detection model and is subjected to a minimization process.
8. A copy number variation detection system based on single sample high throughput transcriptome sequencing data, comprising:
calculating a comparison module: comparing the genome sequencing data to a human reference genome, and calculating the data quantity of the sequencing fragments on each gene to obtain an expression matrix;
and a detection module: inputting the expression matrix into a detection model to obtain a copy number variation detection result;
the detection model is obtained by training a data set by using a preprocessed known database sample based on a deep neural network model; the pretreatment comprises the standardization of the gene expression quantity of the database sample and the division of copy number variation types.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.
CN202311056237.7A 2023-08-21 2023-08-21 Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data Pending CN117095744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311056237.7A CN117095744A (en) 2023-08-21 2023-08-21 Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311056237.7A CN117095744A (en) 2023-08-21 2023-08-21 Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data

Publications (1)

Publication Number Publication Date
CN117095744A true CN117095744A (en) 2023-11-21

Family

ID=88771058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311056237.7A Pending CN117095744A (en) 2023-08-21 2023-08-21 Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data

Country Status (1)

Country Link
CN (1) CN117095744A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648721A (en) * 2019-09-19 2020-01-03 北京市儿科研究所 Method and device for detecting copy number variation by aiming at exon capture technology
CN111210873A (en) * 2020-01-14 2020-05-29 西安交通大学 Exon sequencing data-based copy number variation detection method and system, terminal and storage medium
CN111276187A (en) * 2020-01-12 2020-06-12 湖南大学 Gene expression profile feature learning method based on self-encoder
CN111599407A (en) * 2020-05-13 2020-08-28 北京橡鑫生物科技有限公司 Method and device for detecting copy number variation
CN112634987A (en) * 2020-12-25 2021-04-09 北京吉因加医学检验实验室有限公司 Method and device for detecting copy number variation of single-sample tumor DNA
CN113903395A (en) * 2021-10-28 2022-01-07 聊城大学 An improved particle swarm optimization-based BP neural network copy number variation detection method and system
CN114566209A (en) * 2022-03-03 2022-05-31 四川大学 Training method and application of mycobacterium tuberculosis drug resistance prediction model based on hierarchical attention neural network
CN115171779A (en) * 2022-07-13 2022-10-11 浙江大学 Cancer driver gene prediction device based on graph attention network and multigroup chemical fusion
CN115249513A (en) * 2021-12-14 2022-10-28 聊城大学 A neural network copy number variation detection method and system based on Adaboost integration idea

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648721A (en) * 2019-09-19 2020-01-03 北京市儿科研究所 Method and device for detecting copy number variation by aiming at exon capture technology
CN111276187A (en) * 2020-01-12 2020-06-12 湖南大学 Gene expression profile feature learning method based on self-encoder
CN111210873A (en) * 2020-01-14 2020-05-29 西安交通大学 Exon sequencing data-based copy number variation detection method and system, terminal and storage medium
CN111599407A (en) * 2020-05-13 2020-08-28 北京橡鑫生物科技有限公司 Method and device for detecting copy number variation
CN112634987A (en) * 2020-12-25 2021-04-09 北京吉因加医学检验实验室有限公司 Method and device for detecting copy number variation of single-sample tumor DNA
CN113903395A (en) * 2021-10-28 2022-01-07 聊城大学 An improved particle swarm optimization-based BP neural network copy number variation detection method and system
CN115249513A (en) * 2021-12-14 2022-10-28 聊城大学 A neural network copy number variation detection method and system based on Adaboost integration idea
CN114566209A (en) * 2022-03-03 2022-05-31 四川大学 Training method and application of mycobacterium tuberculosis drug resistance prediction model based on hierarchical attention neural network
CN115171779A (en) * 2022-07-13 2022-10-11 浙江大学 Cancer driver gene prediction device based on graph attention network and multigroup chemical fusion

Similar Documents

Publication Publication Date Title
Feng et al. Leveraging expression from multiple tissues using sparse canonical correlation analysis and aggregate tests improves the power of transcriptome-wide association studies
US20230222311A1 (en) Generating machine learning models using genetic data
Zhang et al. A new algorithm for analysis of oligonucleotide arrays: application to expression profiling in mouse brain regions
CN107408163B (en) Method and apparatus for analyzing gene
WO2022170909A1 (en) Drug sensitivity prediction method, electronic device and computer-readable storage medium
CN109887546B (en) Single-gene or multi-gene copy number detection system and method based on next-generation sequencing
US20230348980A1 (en) Systems and methods of detecting a risk of alzheimer's disease using a circulating-free mrna profiling assay
Wang et al. PHARP: a pig haplotype reference panel for genotype imputation
CN116486913B (en) System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
CN115836349A (en) System and method for evaluating longitudinal biometric data
CA3046660A1 (en) Methods and systems for determining paralogs
Srivastava et al. Heritability estimation approaches utilizing genome‐wide data
CN116525108A (en) SNP data-based prediction method, device, equipment and storage medium
WO2007061770A2 (en) Method and system for analysis of time-series molecular quantities
CN111210873A (en) Exon sequencing data-based copy number variation detection method and system, terminal and storage medium
CN113348512A (en) Method for predicting genotype by using single nucleotide polymorphism data
CN117095744A (en) Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data
US20240120096A1 (en) Computational Method And System For Diagnostic And Therapeutic Prediction From Multimodal Data
WO2025039433A1 (en) Copy number variation detection method based on high-throughput transcriptome sequencing data of single sample
US20150094223A1 (en) Methods and apparatuses for diagnosing cancer by using genetic information
Khodayari Moez et al. Longitudinal linear combination test for gene set analysis
Wang et al. An automated quality control pipeline for eQTL analysis with RNA-seq data
Maguluri et al. Big Data Solutions For Mapping Genetic Markers Associated With Lifestyle Diseases
Qi et al. Computational methods for allele-specific expression in single cells

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20231121