WO2022011855A1 - 一种假阳性结构变异过滤方法、存储介质及计算设备 - Google Patents

一种假阳性结构变异过滤方法、存储介质及计算设备 Download PDF

Info

Publication number
WO2022011855A1
WO2022011855A1 PCT/CN2020/120315 CN2020120315W WO2022011855A1 WO 2022011855 A1 WO2022011855 A1 WO 2022011855A1 CN 2020120315 W CN2020120315 W CN 2020120315W WO 2022011855 A1 WO2022011855 A1 WO 2022011855A1
Authority
WO
WIPO (PCT)
Prior art keywords
purity
data
structural variation
feature
samples
Prior art date
Application number
PCT/CN2020/120315
Other languages
English (en)
French (fr)
Inventor
郑田
王嘉寅
张选平
刘涛
朱晓燕
Original Assignee
西安交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安交通大学 filed Critical 西安交通大学
Publication of WO2022011855A1 publication Critical patent/WO2022011855A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the invention belongs to the technical field of data science, and in particular relates to a false positive structural variation filtering method, a storage medium and a computing device considering diluted sequencing signals.
  • Genome Structural Variations (English name: Structural Variations, English abbreviation: SV) refers to changes in gene structure, which are a kind of complex and directly carcinogenic chromosomal variation. Tumours arise in cells due to the accumulation of genomic variation in their tissue properties.
  • next-generation sequencing technology English name: Next Generation Sequencing, English abbreviation: NGS
  • NGS Next Generation Sequencing, English abbreviation: NGS
  • the identification of genetic structural variation is obtained by comparing and analyzing individual gene sequencing results with reference sequences.
  • the existing structural variation detection methods and software can accurately detect different types of structural variation and determine the size, location and other information of the variation. Accurate identification of structural variants can not only accelerate human research on genetic mechanisms, but also play a very important role in revealing complex disease mechanisms.
  • the first type of methods use features as benchmarks to filter false positives, and filter structural variants that do not pass the feature threshold as false positive structural variants. Therefore, if the feature threshold is not properly set, it is easy to cause misjudgment, and these one-size-fits-all benchmarks will It is difficult to find a threshold setting that perfectly distinguishes false positives and does not delete low-frequency variants by mistake. When dealing with low-purity samples, the accuracy rate is very low;
  • Machine learning filtering methods use samples of fixed purity as a training set. These methods treat the false positive filtering problem as a classification problem and use different features as classification criteria. Although the filtering effect is good, the classification feature baseline obtained by training is only suitable for this fixed feature. When they deal with low-purity samples different from the training samples, the baseline of the classification feature is no longer accurate, and the classification accuracy is significantly reduced, showing a very high false positives.
  • the technical problem to be solved by the present invention is to provide a filtering method, storage medium and equipment that consider the false positive structural variation of the diluted sequencing signal in view of the above-mentioned deficiencies in the prior art. Variation detection is affected by tumor purity and clonal structure, and when the sequencing signal is diluted to produce a large number of false positives, the problem of using transfer learning strategy to achieve false positive filtering.
  • the present invention adopts the following technical solutions to realize:
  • a false-positive structural variant filtering method that considers diluted sequencing signals, including the following steps:
  • transformation matrices after feature dimension reduction are obtained, which contain 23 column vectors. Each column vector is used as a feature, and a new set of all features of structural variation ⁇ ' is obtained.
  • the transformation matrix W is used as the feature data set, and the corresponding label set is the original label set Y p , each candidate structural variation is represented by a vector x′ of 23 features in a row, and the label is the original label y, and the classification model is trained based on the extreme random tree model, Predict true and false positive structural variants;
  • the prediction set of tags Y 'p are classified as true positive structural variants 1, classified as false positives 0 structural variants, structural variation of the filter tag 0 is classified as structural variants of true-positive result as the final output, complete false positive Structural variant filtering.
  • step S2 is specifically:
  • step S3 is specifically:
  • the migration component analysis uses the maximum mean difference to measure the distance between the distributions of the two domains
  • step S301 the target domain data set D t is specifically:
  • n 2 represents the number of samples in the target domain, is the feature space and label of the target domain, p is the sample purity of the target domain, and P is the set of samples of different purity;
  • the source domain dataset D s specifically:
  • n 1 represents the number of samples in the source domain
  • p j is the sample purity of the source domain
  • step S302 the maximum mean difference distance DISTANCE (D s , D t ) is calculated as follows:
  • x i is the data in the source domain
  • x j is the data in the target domain
  • n 1 represents the number of samples in the source domain
  • n 2 represents the number of samples in the target domain
  • step S303 is specifically:
  • the maximum mean difference distance matrix L is calculated, and the calculation method of each element l ij is:
  • the center matrix H is:
  • x i is the data in the source domain
  • x j is the data in the target domain
  • n 1 represents the number of samples in the source domain
  • n 2 represents the number of samples in the target domain
  • K s,s , K t,t are the Gram matrices defined on the source domain and target domain data in the embedding space respectively
  • K s,t is the Gram matrix defined on the cross-domain data
  • K t,s K s, t T .
  • step S4 is specifically:
  • Test set for each purity A training set corresponding to multiple purities other than itself
  • the model trained on each training set uses the test set to classify the true and false structural variants, and obtains the label set ⁇ ' of all purity samples, including m-1 label sets.
  • step S5 the final predicted label set Y' p is:
  • n is the number of samples with different purity.
  • Another technical solution of the present invention is a computer-readable storage medium storing one or more programs, the one or more programs including instructions that, when executed by a computing device, cause the computing device to execute any of the methods described.
  • Another technical solution of the present invention is, a kind of filtering equipment, comprising:
  • One or more processors a memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including using instructions for performing any of the described methods.
  • the present invention has the following beneficial effects:
  • the present invention is a structural variation detection false positive filtering method based on a migration learning strategy considering the diluted sequencing signal, data migration based on the migration learning strategy and then using a machine learning model for classification, which solves the problem of feature selection, tumor purity and clonal structure of the existing methods.
  • the problem of false positives caused by the diluted sequencing signal samples does not require the accurate value of the sample purity, and can be applied to samples of different purities, and shows good performance.
  • the characteristic data of different sample purities are used as the source domain and the target domain respectively, and the data migration is performed by using transfer component analysis (English name: Transfer Component Analysis, English abbreviation: TCA), and the optimal parameters of the method are obtained through multiple experiments. Finally, the feature transformation matrix of the two fields is obtained;
  • the source domain feature transformation matrices of different sample purities are respectively input into the extreme decision tree (English name: Extra Tree, English abbreviation: ET) for training, and the optimal parameters of the model are obtained through grid search, and finally multiple trainings are obtained. Good extreme decision tree model.
  • the fixed sample purity target domain feature transformation matrix is input into each extreme decision tree model as a test set, and a majority voting method is used to determine the final prediction label for the results predicted by all models;
  • the structural variants with false positive labels are filtered, and the output is true positive results.
  • the present invention extracts the initial features from the structural variation detection result file, and combines the migration component analysis method and the extreme decision tree model to use the same model to well adapt to the structural variation detection samples with different degrees of diluted sequencing signals, and The filtering is more accurate and stable.
  • Fig. 1 is the flow chart of the present invention
  • Figure 2 is a graph of the comparison results of a small number of samples in the simulation data set, where (a) is the accuracy, (b) is the recall rate, (c) is the F1 value, and (d) is the precision;
  • Figure 3 shows the comparison result of the wrongly labeled samples in the simulation data set, in which (a) is the accuracy, (b) is the recall rate, (c) is the F1 value, and (d) is the precision;
  • Figure 4 is a comparison chart of the experimental results in the real data set.
  • a layer/element when referred to as being "on" another layer/element, it can be directly on the other layer/element or intervening layers/elements may be present therebetween. element.
  • a layer/element when a layer/element is “on” another layer/element in one orientation, then when the orientation is reversed, the layer/element can be "under” the other layer/element.
  • the transfer learning strategy can indiscriminately judge the entire purity of the samples regardless of the purity of the model training samples, remove false positives, and improve the accuracy of low-frequency mutation detection.
  • Transfer learning involves extracting meaningful latent representations from a pre-trained model for a new, similar goal. It is able to "transfer" knowledge from one domain (called the source) to another domain (called the target). In this way, the knowledge of the false positive filtering machine learning model of a certain sample purity can be used to reconstruct other sample purity models.
  • the invention provides a false positive structural variation filtering method FPTLfilter (Filtering False Positive structural variants based on Transfer Learning) considering the diluted sequencing signal, and the input data is the structural variation candidate extracted from the result file of the existing structural variation detection tool Set feature data, and the output data is the structural variant set after filtering false positive structural variants.
  • FPTLfilter Frtering False Positive structural variants based on Transfer Learning
  • the present invention is based on the general consensus of the following academic circles:
  • Tumor purity and clonal structure will cause the signal of structural variation to be detected to be diluted, the data information will change, the classification baseline obtained by training on fixed samples is no longer applicable, and lower sample purity can lead to false positive variant identification.
  • a method for filtering false positive structural variants considering diluted sequencing signals of the present invention includes the following steps:
  • Running existing structural variation detection tools from different sample purity data to detect structural variation in order to ensure that the range of candidate structural variation sets detected is large enough, a large number of false positive samples can be introduced, and a training set and test with balanced sample labels can be provided for the classification model.
  • Set the filter condition threshold in the detection tool to the lowest level to obtain candidate sets of structural variants with different purities.
  • the result file generated after the paired-end sequencing data generated by the second-generation sequencing technology is aligned with the reference genome sequence contains the alignment information of each read data, such as alignment position, alignment quality, sequence fragment and other information. This information is also included in the VCF (Variant Call Format) file of the structural variation detection result. If a certain information can reflect a certain attribute of the structural variation from some aspects, this information can be extracted as an effective feature for classification. Extracting features from the result file includes the following steps:
  • n is the number of instances.
  • the feature dataset corresponds to a corresponding label set representation category, where 1 represents the true positive structural variant class, 0 represents the false positive structural variant class, and the structural variant sample label dataset with purity p is represented as Y p , specifically:
  • the present invention uses the migration model based on the migration learning method migration component analysis to perform data migration on the structural variation feature data sets of different purities, so as to shorten the distance between the data distributions of different purities. Specifically include the following steps:
  • the structural variation feature set with a fixed purity p in the purity space is used as the target domain dataset D t , specifically:
  • n 2 represents the number of samples in the target domain
  • p is the sample purity of the target domain
  • P is the set of samples of different purity.
  • n 1 represents the number of samples in the source domain
  • p j is the sample purity of the source domain
  • the migration component analysis uses the maximum mean difference (English name: maximum mean discrepancy, English abbreviation: MMD) to measure the distance between the distributions of the two domains;
  • the maximum mean difference distance DISTANCE(D s ,D t ) is calculated as follows:
  • x i is the data in the source domain
  • x j is the data in the target domain
  • x i is the data in the source domain
  • x j is the data in the target domain
  • n 1 represents the number of samples in the source domain
  • n 2 represents the number of samples in the target domain
  • K s,s , K t,t are the Gram matrices defined on the source domain and target domain data in the embedding space respectively
  • K s,t is the Gram matrix defined on the cross-domain data
  • K t,s K s, t T .
  • Z′ i is the set of all purity vectors for each new feature
  • the transformation matrix W is used as the feature data set, and the corresponding label set is the original label set Y p , each candidate structural variation is represented by a vector x′ of 23 features in a row, and the label is the original label y.
  • the present invention is based on an extreme random tree model. to train a classification model to predict true and false positive structural variants, including the following steps:
  • Test set for each purity A training set corresponding to multiple purities other than itself
  • the model trained on each training set uses the test set to classify the true and false structural variants, and obtains the label set of all purity samples Contains m-1 tag sets.
  • Each purity prediction label set in the set ⁇ ' is valid data, and a single label cannot be used as the final classification result.
  • the majority voting method is used to vote on m-1 purity prediction labels, and the result obtained by voting is all predictions
  • the label with the most votes in the label set is used as the final predicted label set for the classification of true and false positive structural variants as follows:
  • n is the number of samples with different purity.
  • Prediction label set Y 'p are classified as true positive structural variants 1, classified as false positives 0 structural variants, structural variation of the filter tag 0 is classified as structural variants of true-positive result as the final output.
  • the necessity of transfer learning is first tested, and the feature datasets before and after data transfer are respectively applied to the extreme decision tree classification model. Less and the label set contains the wrong label.
  • the four metrics of accuracy, precision, recall and F1 value are used to measure the performance of the model.
  • sample purities are 5%, 10%, 15%, 20%, 25%, 30% of the structural variant candidate set samples.
  • the present invention innovatively uses transfer learning for data transfer of samples of different purity, and we first perform a transfer learning necessity test.
  • Each purity structural variant candidate set is a balanced dataset containing 4000 samples, and the ratio of true positive and false positive class samples is 1:1.
  • TCA represents the classification result using the transformation matrix obtained by the migration component analysis
  • BASE represents the classification result of the extracted feature data. The true and false positive classification results are shown in Table 1.
  • Table 1 Classification results of feature data before and after migration component analysis
  • datasize100 (200, 300) respectively represents a single category of the three samples number, the x-axis represents the purity of the sample, and the y-axis represents the value; in Figure 3, the proportion10% (20%, 30%) represents the label error rate of the three samples respectively, the x-axis represents the purity of the sample, and the y-axis represents the value.
  • FPTLfilter can accurately identify false-positive structural variants, adapt well at different purities, can significantly reduce false positives, and is very efficient and stable in low-purity samples.
  • the present invention is a false positive structural variation filtering method considering diluted sequencing signals, which solves the problem that existing algorithms cannot be well applied to samples with different degrees of diluted sequencing signals. Since the migration component analysis is used to perform data migration for tumor samples of different purity, the present invention overcomes the sample characteristic data distribution interval caused by the dilution of the sample sequencing signal, so that the present invention can show good performance under different sample purities.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

本发明公开了一种假阳性结构变异过滤方法、存储介质及设备,首先获取结构变异候选集,然后特征提取,将不同纯度特征数据迁移,进行极端决策树模型分类,对分类结果预测,实现假阳性结构变异过滤。本发明从结构变异检测结果文件中提取初始特征,结合迁移成分分析方法和极端决策树模型能够使用相同的模型很好的适应不同程度被稀释测序信号的结构变异检测样本,并且过滤的准确度更高且稳定。

Description

一种假阳性结构变异过滤方法、存储介质及计算设备 【技术领域】
本发明属于数据科学技术领域,具体涉及一种考虑被稀释测序信号的假阳性结构变异过滤方法、存储介质及计算设备。
【背景技术】
基因组结构变异(英文名称:Structural Variations,英文缩写:SV)指基因结构发生的改变,是一类复杂的直接致癌的染色体变异,是外界自然环境和内在生物体内代谢共同影响下的结果,正常组织细胞中出现肿瘤正是由于其组织特性在基因组中变异的累加造成的。近年来,下一代测序技术(英文名称:Next Generation Sequencing,英文缩写:NGS)的发展使得人们可以更快速地去分析基因,从碱基水平上识别基因中不同类型的结构变异进而追溯疾病产生的原因成为可能。基因结构变异的识别通过将个体基因测序结果与参考序列进行比较、分析得到,目前已有的结构变异检测方法和软件可以准确地检测不同类型的结构变异以及确定变异大小、位置等信息。精确地鉴定结构变异不仅可以加速人类对遗传机制的研究,同时对揭示复杂疾病机理也具有非常重要的作用。
成熟的变异检测方法有很多,基本上都是基于变量的特征进行检测和假阳性过滤。然而,我们发现两个原因可能导致测序信号的稀释和特征的耦合效应,即:
1)肿瘤纯度,2)克隆结构和克隆比例。稀释测序信号会导致低频变异检测失败,检测方法因此降低了滤波阈值;然而,它引入了大量的误报。使用样本纯度,即待测目标物体在总样本中的比例,来测量信号被稀释的程度。当样本纯度低于50%时,变异的精确度将迅速下降(甚至低于25%)。据报道,样品纯度每降低2%, 每兆碱基可引入166个假阳性。随着样本纯度由30%下降到5%,结构变异检测的假阳性率由19.375%上升到38.125%。假阳性将严重影响结构变异检测的准确性,干扰人类疾病相关机制的后续研究。为了解决这个问题,已经开发了许多计算技术来过滤这些误报,并且可以分为两种类型。一类以GATK[13]为代表,通过手动设置一个或多个生物指标阈值,过滤所有不合格的变异位点,包括测序深度信息、支持读段数目以及碱基质量等;另一类通过预先训练的深度学习模型对真阳性和假阳性进行分类。
然而,现有方法存在以下问题:
1)第一类方法使用特征作为基准来过滤误报,将未通过设置特征阈值的结构变异均过滤为假阳性结构变异,因此如果特征阈值设置不合适时易引起误判,这些一刀切的基准会在过滤假阳性的同时删除想要检出的低频变异,很难找到完美区分假阳性并不会误删低频变异的阈值设置,在处理低纯度样品时,准确率非常低;
2)现有方法都没有考虑到肿瘤纯度或克隆结构引起的测序信号被稀释的问题,更没有考虑到不同样本稀释程度不同时,分类基准不再适用的问题。机器学习过滤方法使用固定纯度的样本作为训练集,这些方法将假阳性过滤问题作为一个分类问题,并使用不同的特征作为分类准则。虽然滤波效果很好,但训练得到的分类特征基线只适用于该固定特征,当它们处理不同于训练样本的低纯度样本时,分类特征的基线不再准确,分类精度显著降低,呈现出非常高的假阳性。
另外,纯度是一个连续变量,不能简单地通过增加几个训练集就把它当作一个离散变量。而若针对不同肿瘤纯度或不同克隆结构的样本训练不同的分类模型成本过高,计算量过大,无法达到预期效果,为每个样本都训练模型不切实际且 成本巨大。
【发明内容】
本发明所要解决的技术问题在于针对上述现有技术中的不足,提供一种考虑被稀释测序信号假阳性结构变异过滤方法、存储介质及设备,主要面向第二代基因测序数据中,当基因组结构变异检测受到肿瘤纯度和克隆结构影响,测序信号被稀释产生大量假阳性的情况下,使用迁移学习策略实现假阳性过滤的问题。
为达到上述目的,本发明采用以下技术方案予以实现:
一种考虑被稀释测序信号的假阳性结构变异过滤方法,包括以下步骤:
S1、从不同样本纯度数据运行已有的结构变异检测工具检测结构变异,将检测工具中的过滤条件阈值调整到最低,获取结构变异候选集;
S2、以体现结构变异属性作为分类有效特征,从结果文件中特征提取;
S3、将每个特征向量存为一行,作为一个实例用以表示其对应的候选结构变异,将纯度为p的结构变异样本特征数据集记为X p,纯度为p的结构变异样本标签数据集表示为Y p,结合以上特征和标签,将纯度空间里的所有结构变异候选集记为Η,使用基于迁移学习方法迁移成分分析的迁移模型来对不同纯度的结构变异特征数据集进行数据迁移,拉近不同纯度数据分布的距离,实现不同纯度的特征数据迁移;
S4、不同纯度的结构变异特征数据集迁移后得到两个特征降维后的转换矩阵,含有23个列向量,将每个列向量作为一个特征,得到新的结构变异所有特征集合Θ',将转换矩阵W作为特征数据集,对应的标签集为原标签集Y p,每个候选结构变异用一行23个特征的向量x′表示,标签为原标签y,基于极端随机树模型训练分类模型,对真假阳性结构变异进行预测;
S5、使用多数投票法对m-1个纯度的预测标签进行投票,投票得到的结果为所有预测标签集中票数最多的标签,将该结果作为真假阳性结构变异分类的最终预测标签集合Y' p
S6、预测标签集合Y' p中真阳性结构变异分类为1,假阳结构变异分类为0,过滤标签为0的结构变异,被归类为真阳性的结构变异作为最终输出结果,完成假阳性结构变异过滤。
2.根据权利要求1所述的考虑被稀释测序信号的假阳性结构变异过滤方法,其特征在于,步骤S2具体为:
S201、将所有纯度的集合纯度空间记为P,从不同纯度的结构变异检测结果文件中提取出全部的读数据相关信息;
S202、对于每个候选结构变异,从全部信息中提取出26个特征,将所有特征集合记为Θ。
具体的,步骤S3具体为:
S301、将纯度空间中的固定纯度为p的结构变异特征集作为目标域数据集D t,纯度空间中的其他纯度为p j的结构变异特征集作为源域数据集D s
S302、迁移成分分析利用最大均值差异衡量两个域的分布的距离;
S303、借用支持向量机核函数的思想求解最大均值差异距离;
S304、根据(KLK+μI) -1KLK计算特征分解矩阵,并取前M个特征向量构造纯度p j到纯度p的特征数据转换矩阵W。
进一步的,步骤S301中,目标域数据集D t,具体为:
Figure PCTCN2020120315-appb-000001
其中,n 2表示目标域的样本数目,
Figure PCTCN2020120315-appb-000002
为目标域的特征空间和标签,p为目 标域样本纯度,P为不同纯度样本集合;
源域数据集D s,具体为:
Figure PCTCN2020120315-appb-000003
其中,n 1表示源域的样本数目,
Figure PCTCN2020120315-appb-000004
为源域数据的特征空间和标签,p j为源域样本纯度。
进一步的,步骤S302中,最大均值差异距离DISTANCE(D s,D t)计算如下:
Figure PCTCN2020120315-appb-000005
其中,x i是源域的数据,x j是目标域的数据,
Figure PCTCN2020120315-appb-000006
是源域的数据分布映射,
Figure PCTCN2020120315-appb-000007
是目标域的数据分布映射,n 1表示源域的样本数目,n 2表示目标域的样本数目。
进一步的,步骤S303具体为:
首先计算最大均值差异距离矩阵L,每个元素l ij的计算方式为:
Figure PCTCN2020120315-appb-000008
中心矩阵H为:
Figure PCTCN2020120315-appb-000009
其中,x i是源域的数据,x j是目标域的数据,
Figure PCTCN2020120315-appb-000010
Figure PCTCN2020120315-appb-000011
的单位矩阵,n 1表示源域的样本数目,n 2表示目标域的样本数目;
然后使用线性核函数k(x,y)=x ty映射数据集
Figure PCTCN2020120315-appb-000012
Figure PCTCN2020120315-appb-000013
构造核矩阵K为:
Figure PCTCN2020120315-appb-000014
其中,K s,s,K t,t分别为嵌入空间中源域和目标域数据上定义的Gram矩阵,K s,t为跨域数据上定义的Gram矩阵,K t,s=K s,t T
具体的,步骤S4具体为:
S401、选择纯度p的目标域转换矩阵作为测试集
Figure PCTCN2020120315-appb-000015
S402、设置迭代次数为K,根据CART决策树算法使用全部的训练集样本
Figure PCTCN2020120315-appb-000016
训练每个基分类器,迭代K次,生成K颗决策树以及极端随机树;
S403、对生成的极端随机树模型使用测试集样本
Figure PCTCN2020120315-appb-000017
生成预测结果,对所有基分类器的预测结果进行统计,利用投票决策的方法产生纯度p j的训练集的分类结果,得到标签集
Figure PCTCN2020120315-appb-000018
S404、每个纯度的测试集
Figure PCTCN2020120315-appb-000019
对应除自身外的多个纯度的训练集
Figure PCTCN2020120315-appb-000020
将每个训练集训练出的模型均使用测试集对真假结构变异进行分类,获得所有纯度样本的标签集合γ',包含m-1个标签集。
具体的,步骤S5中,最终预测标签集合Y' p为:
Figure PCTCN2020120315-appb-000021
其中,
Figure PCTCN2020120315-appb-000022
为样本i的预测标签,p为样本纯度,P为样本纯度集合,n为不同纯度样本数量。
本发明的另一个技术方案是,一种存储一个或多个程序的计算机可读存储介质,所述一个或多个程序包括指令,所述指令当由计算设备执行时,使得所述计算设备执行所述的方法中的任一方法。
本发明的另一个技术方案是,一种过滤设备,包括:
一个或多个处理器、存储器及一个或多个程序,其中一个或多个程序存储在 所述存储器中并被配置为所述一个或多个处理器执行,所述一个或多个程序包括用于执行所述的方法中的任一方法的指令。
与现有技术相比,本发明具有以下有益效果:
本发明是一种考虑被稀释测序信号的基于迁移学习策略的结构变异检测假阳性过滤方法,基于迁移学习策略进行数据迁移再使用机器学习模型分类,解决现有方法特征选择以及肿瘤纯度和克隆结构引起的被稀释测序信号样本假阳性问题,不需要样本纯度的准确值,能够适用于不同纯度的样本,且表现出了良好的性能。
进一步的,使用已有结构变异检测工具检测信号被稀释程度不同的样本(以样本纯度来定义测序信号被稀释程度),得到结构变异候选集结果文件,并从结果文件中的读数据信息中提取相关特征;
进一步的,将不同样本纯度的特征数据分别作为源域和目标域,使用迁移成分分析(英文名称:Transfer Component Analysis,英文缩写:TCA)进行数据迁移,通过多次试验得到方法的最优参数,最终得到两个领域的特征转换矩阵;
进一步的,将不同样本纯度的源域特征转换矩阵分别输入到极端决策树(英文名称:Extra Tree,英文缩写:ET)进行训练,通过网格搜索得到模型的最佳参数,最终得到多个训练好的极端决策树模型。
进一步的,将固定样本纯度目标域特征转换矩阵作为测试集输入到各个极端决策树模型中,对所有模型预测的结果使用多数投票法决策出最终预测标签;
进一步的,根据分类得到的标签集,过滤标签为假阳的结构变异,输出为真阳结果。
综上所述,本发明从结构变异检测结果文件中提取初始特征,结合迁移成分 分析方法和极端决策树模型能够使用相同的模型很好的适应不同程度被稀释测序信号的结构变异检测样本,并且过滤的准确度更高且稳定。
【附图说明】
图1为本发明流程图;
图2为仿真数据集中数目较少样本对比结果图,其中,(a)为准确度,(b)为召回率,(c)为F1值,(d)为精确度;
图3为仿真数据集中标签有错样本对比结果图,其中,(a)为准确度,(b)为召回率,(c)为F1值,(d)为精确度;
图4为真实数据集中的实验结果对比图。
【具体实施方式】
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,不是全部的实施例,而并非要限制本发明公开的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要的混淆本发明公开的概念。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
在附图中示出了根据本发明公开实施例的各种结构示意图。这些图并非是按比例绘制的,其中为了清楚表达的目的,放大了某些细节,并且可能省略了某些细节。图中所示出的各种区域、层的形状及它们之间的相对大小、位置关系仅是示例性的,实际中可能由于制造公差或技术限制而有所偏差,并且本领域技术人员根据实际所需可以另外设计具有不同形状、大小、相对位置的区域/层。
本发明公开的上下文中,当将一层/元件称作位于另一层/元件“上”时,该层/元件可以直接位于该另一层/元件上,或者它们之间可以存在居中层/元件。另外,如果在一种朝向中一层/元件位于另一层/元件“上”,那么当调转朝向时,该层/元件可以位于该另一层/元件“下”。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
下面结合附图对本发明做进一步详细描述:
迁移学习策略能够无论模型训练样本的纯度如何,都能无差别地判断样本的全部纯度,去除假阳性,提高低频突变检测的准确性。转移学习包括从预先训练的模型中提取有意义的潜在表征,以用于一个新的、类似的目标。它能够将一个域(称为源)的知识“转移”到另一个域(称为目标)。这样,就可以利用一定样本纯度的假阳性过滤机器学习模型的知识来重建其他样本纯度模型。所要解决的技术问题有:
一、解决特征选择繁琐复杂的问题;
二、解决肿瘤纯度和克隆结构引起的被稀释测序信号问题;
三、本发明不会受到测序软件和检测工具的影响。
本发明提供了一种考虑被稀释测序信号的假阳性结构变异过滤方法FPTLfilter(Filtering False Positive structural variants based on Transfer Learning),输入数据为从现有结构变异检测工具结果文件中提取出的结构变异候选集特征数据,输出数据为过滤假阳性结构变异后的结构变异集合。
本发明基于以下学术界的普遍共识:
1.目前常用检测算法通过二代测序技术产生的读对与参考序列进行比对得到的读数据信息,确定结构变异的不同类型以及变异大小、位置等信息;
2.肿瘤纯度和克隆结构会导致待检测结构变异信号被稀释,数据信息发生改变,在固定样本上训练得到的分类基线不再适用,较低的样本纯度可产生假阳性的变异识别。
请参阅图1,本发明一种考虑被稀释测序信号的假阳性结构变异过滤方法,包括以下步骤:
S1、获取结构变异候选集
从不同样本纯度数据运行已有的结构变异检测工具检测结构变异,为了保证检测出的候选结构变异集合范围足够大,能够引进大量的假阳性样本,为分类模型提供样本标签均衡的训练集和测试集,将检测工具中的过滤条件阈值调整到最低,得到不同纯度的结构变异候选集。
S2、特征提取
第二代测序技术产生的双末端测序数据比对到参考基因组序列后生成的结果文件包含了每个读数据的比对信息,如比对位置、比对质量、序列片段等信息。结构变异检测结果VCF(Variant Call Format)文件中同样包含这些信息,如果某个信息能够从某些方面体现结构变异的某个属性,则这个信息可被提取出作为分 类的有效特征。从结果文件中提取特征,具体包含以下步骤:
S201、将所有纯度的集合纯度空间记为P={p i,i=1,2,…,m},其中,p i是样本的纯度,m是所有纯度的数目,也是不同纯度肿瘤样本的数目。从不同纯度的结构变异检测结果文件中提取出全部的读数据相关信息。
S202、根据读数据比对映射出的一致性和完整性以及其他比对属性对不同的结构变异会呈现出不同的特点,对于每个候选结构变异,从全部信息中共提取出了26个特征,将所有特征集合记为Θ={Z i,i=1,2,…,26},其中,Z i是每个特征的所有纯度向量集合。对于不同的结构变异检测软件,结果文件中的特征不同,提取出的特征也不尽相同,即特征提取这一步骤提取的特征并非固定的且可扩展。
S3、不同纯度特征数据迁移
将每个特征向量存为一行,作为一个实例用以表示其对应的候选结构变异,将纯度为p的结构变异样本特征数据集记为X p,具体为:
Figure PCTCN2020120315-appb-000023
其中,
Figure PCTCN2020120315-appb-000024
是26维的行向量,n是实例的数目。
特征数据集对应一个相应的标签集表示类别,其中,1表示真阳性结构变异类,0表示假阳性结构变异类,纯度为p的结构变异样本标签数据集表示为Y p,具体为:
Figure PCTCN2020120315-appb-000025
其中,
Figure PCTCN2020120315-appb-000026
是每个特征向量对应的标签。
结合以上特征和标签,将纯度空间里的所有结构变异候选集记为Η={(X p,Y p),p∈P}。
本发明使用基于迁移学习方法迁移成分分析的迁移模型来对不同纯度的结构 变异特征数据集进行数据迁移,来拉近不同纯度数据分布的距离。具体包括以下步骤:
S301、选择源域和目标域数据集;
将纯度空间中的固定纯度为p的结构变异特征集作为目标域数据集D t,具体为:
Figure PCTCN2020120315-appb-000027
其中,n 2表示目标域的样本数目,
Figure PCTCN2020120315-appb-000028
为目标域的特征空间和标签,p为目标域样本纯度,P为不同纯度样本集合。
纯度空间中的其他纯度为p j的结构变异特征集作为源域数据集D s,具体为:
Figure PCTCN2020120315-appb-000029
其中,n 1表示源域的样本数目,
Figure PCTCN2020120315-appb-000030
为源域数据的特征空间和标签,p j为源域样本纯度。
S302、迁移成分分析利用最大均值差异(英文名称:maximum mean discrepancy,英文缩写:MMD)来衡量两个域的分布的距离;
最大均值差异距离DISTANCE(D s,D t)计算如下:
Figure PCTCN2020120315-appb-000031
其中,x i是源域的数据,x j是目标域的数据,
Figure PCTCN2020120315-appb-000032
是源域的数据分布映射,
Figure PCTCN2020120315-appb-000033
是目标域的数据分布映射。
S303、借用支持向量机核函数的思想求解最大均值差异距离;
首先计算最大均值差异距离矩阵L,它的每个元素l ij的计算方式为:
Figure PCTCN2020120315-appb-000034
以及中心矩阵H:
Figure PCTCN2020120315-appb-000035
其中,x i是源域的数据,x j是目标域的数据,
Figure PCTCN2020120315-appb-000036
Figure PCTCN2020120315-appb-000037
的单位矩阵,n 1表示源域的样本数目,n 2表示目标域的样本数目;
然后使用线性核函数k(x,y)=x ty映射数据集
Figure PCTCN2020120315-appb-000038
Figure PCTCN2020120315-appb-000039
构造核矩阵K:
Figure PCTCN2020120315-appb-000040
其中,K s,s,K t,t分别为嵌入空间中源域和目标域数据上定义的Gram矩阵,K s,t为跨域数据上定义的Gram矩阵,K t,s=K s,t T
S304、根据(KLK+μI) -1KLK计算特征分解矩阵,并取前M个特征向量构造纯度p j到纯度p的特征数据转换矩阵W如下:
Figure PCTCN2020120315-appb-000041
其中,
Figure PCTCN2020120315-appb-000042
是降维后的源域转换矩阵,
Figure PCTCN2020120315-appb-000043
是降维后的目标域转换矩阵,经过多次试验查找得到最优特征维度,设置M为23。
S4、极端决策树模型分类
不同纯度的结构变异特征数据集迁移后得到两个特征降维后的转换矩阵,含有23个列向量,将每个列向量作为一个特征,可得到新的结构变异所有特征集合Θ'如下:
Θ'={Z′ i,i=1,2,…,23}
其中,Z′ i是每个新的特征的所有纯度向量集合;
将转换矩阵W作为特征数据集,对应的标签集为原标签集Y p,每个候选结构变异用一行23个特征的向量x′来表示,标签为原标签y,本发明基于极端随机树模型来训练分类模型,对真假阳性结构变异进行预测,具体包括以下步骤:
S401、选择纯度p的目标域转换矩阵作为测试集
Figure PCTCN2020120315-appb-000044
其他纯度p j的源域转换矩阵均作为训练集
Figure PCTCN2020120315-appb-000045
S402、设置迭代次数为K,根据CART决策树算法使用全部的训练集样本
Figure PCTCN2020120315-appb-000046
训练每个基分类器,迭代K次,生成K颗决策树以及极端随机树。
S403、对生成的极端随机树模型使用测试集样本
Figure PCTCN2020120315-appb-000047
生成预测结果,对所有基分类器的预测结果进行统计,利用投票决策的方法产生纯度p j的训练集的分类结果,得到标签集如下:
Figure PCTCN2020120315-appb-000048
其中,
Figure PCTCN2020120315-appb-000049
是预测的标签。
S404、每个纯度的测试集
Figure PCTCN2020120315-appb-000050
对应除自身外的多个纯度的训练集
Figure PCTCN2020120315-appb-000051
将每个训练集训练出的模型均使用测试集对真假结构变异进行分类,获得所有纯度样本的标签集合
Figure PCTCN2020120315-appb-000052
包含m-1个标签集。
S5、分类结果预测
集合γ'中的每个纯度预测标签集合均为有效数据,不能用单独的标签作为最终的分类结果,使用多数投票法对m-1个纯度的预测标签进行投票,投票得到的结果为所有预测标签集中票数最多的标签,将该结果作为真假阳性结构变异分类 的最终预测标签集合如下:
Figure PCTCN2020120315-appb-000053
其中,
Figure PCTCN2020120315-appb-000054
为样本i的预测标签,p为样本纯度,P为样本纯度集合,n为不同纯度样本数量。
S6、假阳性结构变异过滤
预测标签集合Y' p中真阳性结构变异分类为1,假阳结构变异分类为0,过滤标签为0的结构变异,被归类为真阳性的结构变异作为最终输出结果。
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。通常在此处附图中的描述和所示的本发明实施例的组件可以通过各种不同的配置来布置和设计。因此,以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
为了验证本发明的有效性,首先测试迁移学习的必要性,将数据迁移前后的特征数据集分别应用于极端决策树分类模型,然后为了验证本发明的可行性,测试了结构变异候选集样本数较少和标签集包含错误标签的情况。使用准确度、精确度、召回率和F1值这四个指标来衡量模型的性能表现。
指标名称:真阳性(TP)、假阳性(FP)、真阴性(TN)和假阴性(FN)。
定义准确度Accuracy=(TP+TN)/(TP+TN+FN+FP);
定义精确度Precision=TP/(FP+TP);
定义召回率Recall=TP/(TP+FN);
定义F1值F1-score=(2·Precision·Recall)/(Precision+Recall)。
在仿真数据集上进行了测试,使用已有结构变异检测软件Speedseq获得了六种不同样本纯度P={5,10,15,20,25,30}(样本纯度分别为5%,10%,15%,20%,25%,30%)的结构变异候选集样本。由于目前已有算法均未考虑样本纯度导致的假阳性问题,本发明创新性的将迁移学习用于不同纯度样本的数据迁移,我们首先进行迁移学习必要性测试。每个纯度结构变异候选集均为包含4000个样本的平衡数据集,真阳和假阳类别样本比例为1:1。“TCA”表示使用迁移成分分析得到的转换矩阵用于分类的结果,“BASE”表示提取出的特征数据用于分类的结果,真假阳性分类结果如表1所示。
表1:迁移成分分析前后特征数据分类结果
Figure PCTCN2020120315-appb-000055
可见,对特征数据通过迁移成分分析后再用于分类模型明显提高了每个纯度的准确度,精确度,召回率和F1值,可以验证迁移学习对不同纯度结构变异特 征数据进行数据迁移大大提升了分类模型的综合性能,且对于低纯度样本更为有效。
为了验证本发明的可行性,在结构变异候选集样本数较少和标签集包含错误标签情况下进行了实验。对于数目较少样本,分别选择200、400和600个样本来测试,其中真阳和假阳类别样本数目相同;对于标签有错样本,使用4000个样本的特征数据集,标签的错误率分别设置为10%、20%和30%,错误率针对所有标签集样本,会导致样本类别不平衡。数目较少样本和标签有错样本的实验结果如表2和表3所示,结果对比图如图2和图3所示,图2中datasize100(200,300)分别表示三个样本的单个类别数目,x轴表示样本的纯度,y轴表示取值;图3中proportion10%(20%,30%)分别表示三个样本的标签错误率,x轴表示样本的纯度,y轴表示取值。
表2:数目较少样本实验结果
Figure PCTCN2020120315-appb-000056
Figure PCTCN2020120315-appb-000057
表3:标签有错样本实验结果
Figure PCTCN2020120315-appb-000058
为了进一步验证本发明过滤真假阳性结构变异的能力,从Gene+公共数据库中获得了4组肺癌和4组乳腺癌数据,以测试在真实数据上的性能,这两类癌症的肿瘤纯度可能非常低,并且会严重影响它们的结构变异检测准确性。随后通过BWA-0.7.5a和GATK MUTect映射读取的原始序列的管道,使用CNVkit检测真实的结构变异信息。对每组数据随机选择50个真阳性样本和50个假阳性样本, 组成100个均衡样本的结构变异候选集,并将我们的模型应用在8组数据集上进行真假阳性结构变异识别,并通过与公共数据库中的标准结果进行比较来对标签进行标注,分类结果如表4所示,绘制的对比结果图如图4所示,其中,x轴表示真实数据集的标号,y轴表示取值,图中四个指标分别表示准确度,召回率,F1值和精确度。
表4:真实数据集实验结果
数据集 1 2 3 4 5 6 7 8
准确度 73.00 81.00 81.00 87.00 84.00 90.00 86.00 99.00
召回率 88.00 90.00 78.00 90.00 96.00 94.00 90.00 98.00
F1值 76.52 82.57 80.41 87.38 85.71 90.38 86.54 98.99
精确度 67.69 76.27 82.98 84.91 77.42 87.04 83.33 100.00
与仿真数据集结果相同,FPTLfilter能够准确识别假阳性结构变异,在不同的纯度下都能很好地适应,可以显著减少假阳性,并且在低纯度样本下效率非常高且稳定。
综上所述,本发明一种考虑被稀释测序信号的假阳性结构变异过滤方法,解决了现有算法不能良好适用于不同程度被稀释测序信号的样本的问题。由于采用了迁移成分分析对不同纯度肿瘤样本进行数据迁移,本发明克服了样本测序信号被稀释导致的样本特征数据分布间隔,从而使得本发明在不同的样本纯度下都能表现出良好的性能。
以上内容仅为说明本发明的技术思想,不能以此限定本发明的保护范围,凡是按照本发明提出的技术思想,在技术方案基础上所做的任何改动,均落入本发明权利要求书的保护范围之内。

Claims (10)

  1. 一种假阳性结构变异过滤方法,其特征在于,包括以下步骤:
    S1、从不同样本纯度数据运行已有的结构变异检测工具检测结构变异,将检测工具中的过滤条件阈值调整到最低,获取结构变异候选集;
    S2、以体现结构变异属性作为分类有效特征,从结果文件中特征提取;
    S3、将每个特征向量存为一行,作为一个实例用以表示其对应的候选结构变异,将纯度为p的结构变异样本特征数据集记为X p,纯度为p的结构变异样本标签数据集表示为Y p,结合以上特征和标签,将纯度空间里的所有结构变异候选集记为Η,使用基于迁移学习方法迁移成分分析的迁移模型来对不同纯度的结构变异特征数据集进行数据迁移,拉近不同纯度数据分布的距离,实现不同纯度的特征数据迁移;
    S4、不同纯度的结构变异特征数据集迁移后得到两个特征降维后的转换矩阵,含有23个列向量,将每个列向量作为一个特征,得到新的结构变异所有特征集合Θ',将转换矩阵W作为特征数据集,对应的标签集为原标签集Y p,每个候选结构变异用一行23个特征的向量x′表示,标签为原标签y,基于极端随机树模型训练分类模型,对真假阳性结构变异进行预测;
    S5、使用多数投票法对m-1个纯度的预测标签进行投票,投票得到的结果为所有预测标签集中票数最多的标签,将该结果作为真假阳性结构变异分类的最终预测标签集合Y' p
    S6、预测标签集合Y' p中真阳性结构变异分类为1,假阳结构变异分类为0,过滤标签为0的结构变异,被归类为真阳性的结构变异作为最终输出结果,完成假阳性结构变异过滤。
  2. 根据权利要求1所述的方法,其特征在于,步骤S2具体为:
    S201、将所有纯度的集合纯度空间记为P,从不同纯度的结构变异检测结果文件中提取出全部的读数据相关信息;
    S202、对于每个候选结构变异,从全部信息中提取出26个特征,将所有特征集合记为Θ。
  3. 根据权利要求1所述的方法,其特征在于,步骤S3具体为:
    S301、将纯度空间中的固定纯度为p的结构变异特征集作为目标域数据集D t,纯度空间中的其他纯度为p j的结构变异特征集作为源域数据集D s
    S302、迁移成分分析利用最大均值差异衡量两个域的分布的距离;
    S303、借用支持向量机核函数的思想求解最大均值差异距离;
    S304、根据(KLK+μI) -1KLK计算特征分解矩阵,并取前M个特征向量构造纯度p j到纯度p的特征数据转换矩阵W。
  4. 根据权利要求3所述的方法,其特征在于,步骤S301中,目标域数据集D t,具体为:
    Figure PCTCN2020120315-appb-100001
    其中,n 2表示目标域的样本数目,
    Figure PCTCN2020120315-appb-100002
    为目标域的特征空间和标签,p为目标域样本纯度,P为不同纯度样本集合;
    源域数据集D s,具体为:
    Figure PCTCN2020120315-appb-100003
    其中,n 1表示源域的样本数目,
    Figure PCTCN2020120315-appb-100004
    为源域数据的特征空间和标签,p j为源域样本纯度。
  5. 根据权利要求3所述的方法,其特征在于,步骤S302中,最大均值差异距离DISTANCE(D s,D t)计算如下:
    Figure PCTCN2020120315-appb-100005
    其中,x i是源域的数据,x j是目标域的数据,
    Figure PCTCN2020120315-appb-100006
    是源域的数据分布映射,
    Figure PCTCN2020120315-appb-100007
    是目标域的数据分布映射,n 1表示源域的样本数目,n 2表示目标域的样本数目。
  6. 根据权利要求3所述的方法,其特征在于,步骤S303具体为:
    首先计算最大均值差异距离矩阵L,每个元素l ij的计算方式为:
    Figure PCTCN2020120315-appb-100008
    中心矩阵H为:
    Figure PCTCN2020120315-appb-100009
    其中,x i是源域的数据,x j是目标域的数据,
    Figure PCTCN2020120315-appb-100010
    Figure PCTCN2020120315-appb-100011
    的单位矩阵,n 1表示源域的样本数目,n 2表示目标域的样本数目;
    然后使用线性核函数k(x,y)=x ty映射数据集
    Figure PCTCN2020120315-appb-100012
    Figure PCTCN2020120315-appb-100013
    构造核矩阵K为:
    Figure PCTCN2020120315-appb-100014
    其中,K s,s,K t,t分别为嵌入空间中源域和目标域数据上定义的Gram矩阵,K s,t为跨域数据上定义的Gram矩阵,K t,s=K s,t T
  7. 根据权利要求1所述的方法,其特征在于,步骤S4具体为:
    S401、选择纯度p的目标域转换矩阵作为测试集
    Figure PCTCN2020120315-appb-100015
    S402、设置迭代次数为K,根据CART决策树算法使用全部的训练集样本
    Figure PCTCN2020120315-appb-100016
    训练每个基分类器,迭代K次,生成K颗决策树以及极端随机树;
    S403、对生成的极端随机树模型使用测试集样本
    Figure PCTCN2020120315-appb-100017
    生成预测结果,对所有基分类器的预测结果进行统计,利用投票决策的方法产生纯度p j的训练集的分类结果,得到标签集
    Figure PCTCN2020120315-appb-100018
    S404、每个纯度的测试集
    Figure PCTCN2020120315-appb-100019
    对应除自身外的多个纯度的训练集
    Figure PCTCN2020120315-appb-100020
    将每个训练集训练出的模型均使用测试集对真假结构变异进行分类,获得所有纯度样本的标签集合Υ',包含m-1个标签集。
  8. 根据权利要求1所述的方法,其特征在于,步骤S5中,最终预测标签集合Y' p为:
    Figure PCTCN2020120315-appb-100021
    其中,
    Figure PCTCN2020120315-appb-100022
    为样本i的预测标签,p为样本纯度,P为样本纯度集合,n为不同纯度样本数量。
  9. 一种存储一个或多个程序的计算机可读存储介质,其特征在于,所述一个或多个程序包括指令,所述指令当由计算设备执行时,使得所述计算设备执行根据权利要求1至8所述的方法中的任一方法。
  10. 一种计算设备,其特征在于,包括:
    一个或多个处理器、存储器及一个或多个程序,其中一个或多个程序存储在所述存储器中并被配置为所述一个或多个处理器执行,所述一个或多个程序包括用于执行根据权利要求1至8所述的方法中的任一方法的指令。
PCT/CN2020/120315 2020-07-15 2020-10-12 一种假阳性结构变异过滤方法、存储介质及计算设备 WO2022011855A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010681632.4 2020-07-15
CN202010681632.4A CN111863135B (zh) 2020-07-15 2020-07-15 一种假阳性结构变异过滤方法、存储介质及计算设备

Publications (1)

Publication Number Publication Date
WO2022011855A1 true WO2022011855A1 (zh) 2022-01-20

Family

ID=72984289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120315 WO2022011855A1 (zh) 2020-07-15 2020-10-12 一种假阳性结构变异过滤方法、存储介质及计算设备

Country Status (2)

Country Link
CN (1) CN111863135B (zh)
WO (1) WO2022011855A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117096070A (zh) * 2023-10-19 2023-11-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种基于领域自适应的半导体加工工艺异常检测方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927753A (zh) * 2021-02-22 2021-06-08 中南大学 一种基于迁移学习识别蛋白质和rna复合物界面热点残基的方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658983A (zh) * 2018-12-20 2019-04-19 深圳市海普洛斯生物科技有限公司 一种识别和消除核酸变异检测中假阳性的方法和装置
CN109903815A (zh) * 2019-02-28 2019-06-18 北京化工大学 基于特征挖掘的基因翻转变异检测方法
CN110084314A (zh) * 2019-05-06 2019-08-02 西安交通大学 一种针对靶向捕获基因测序数据的假阳性基因突变过滤方法
US20200105373A1 (en) * 2018-09-28 2020-04-02 10X Genomics, Inc. Systems and methods for cellular analysis using nucleic acid sequencing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012034251A2 (zh) * 2010-09-14 2012-03-22 深圳华大基因科技有限公司 一种基因组结构性变异检测方法和系统
AU2017100960A4 (en) * 2017-07-13 2017-08-10 Macau University Of Science And Technology Method of identifying a gene associated with a disease or pathological condition of the disease
CN109280702A (zh) * 2017-07-21 2019-01-29 深圳华大基因研究院 确定个体染色体结构异常的方法和系统
CN111326212B (zh) * 2020-02-18 2023-06-23 福建和瑞基因科技有限公司 一种结构变异的检测方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200105373A1 (en) * 2018-09-28 2020-04-02 10X Genomics, Inc. Systems and methods for cellular analysis using nucleic acid sequencing
CN109658983A (zh) * 2018-12-20 2019-04-19 深圳市海普洛斯生物科技有限公司 一种识别和消除核酸变异检测中假阳性的方法和装置
CN109903815A (zh) * 2019-02-28 2019-06-18 北京化工大学 基于特征挖掘的基因翻转变异检测方法
CN110084314A (zh) * 2019-05-06 2019-08-02 西安交通大学 一种针对靶向捕获基因测序数据的假阳性基因突变过滤方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117096070A (zh) * 2023-10-19 2023-11-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种基于领域自适应的半导体加工工艺异常检测方法
CN117096070B (zh) * 2023-10-19 2024-01-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种基于领域自适应的半导体加工工艺异常检测方法

Also Published As

Publication number Publication date
CN111863135B (zh) 2022-06-07
CN111863135A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
CN108038352B (zh) 结合差异化分析和关联规则挖掘全基因组关键基因的方法
Hanczar et al. Ensemble methods for biclustering tasks
Yu et al. Self-paced learning for k-means clustering algorithm
CN107292330A (zh) 一种基于监督学习和半监督学习双重信息的迭代式标签噪声识别算法
WO2022011855A1 (zh) 一种假阳性结构变异过滤方法、存储介质及计算设备
WO2023217290A1 (zh) 基于图神经网络的基因表型预测
CN111009321A (zh) 一种机器学习分类模型在青少年孤独症辅助诊断中的应用方法
CN112633601A (zh) 疾病事件发生概率的预测方法、装置、设备及计算机介质
Mukhopadhyay Large-scale mode identification and data-driven sciences
CN111860656B (zh) 分类器训练方法、装置、设备以及存储介质
CN109376790A (zh) 一种基于渗流分析的二元分类方法
CN104200134A (zh) 一种基于局部线性嵌入算法的肿瘤基因表数据特征选择方法
CN110010204B (zh) 基于融合网络和多打分策略的预后生物标志物识别方法
Sudharson et al. Enhancing the Efficiency of Lung Disease Prediction using CatBoost and Expectation Maximization Algorithms
Hao et al. Vp-detector: A 3d multi-scale dense convolutional neural network for macromolecule localization and classification in cryo-electron tomograms
Yuan et al. Self-organizing maps for cellular in silico staining and cell substate classification
CN109191452B (zh) 一种基于主动学习的腹腔ct图像腹膜转移自动标记方法
CN117195027A (zh) 基于成员选择的簇加权聚类集成方法
CN112287036A (zh) 一种基于谱聚类的离群点检测方法
CN109214466A (zh) 一种基于密度的新型聚类算法
Ashraf et al. Iterative weighted k-NN for constructing missing feature values in Wisconsin breast cancer dataset
CN104778479B (zh) 一种基于稀疏编码提取子的图像分类方法及系统
US9569584B2 (en) Combining RNAi imaging data with genomic data for gene interaction network construction
CN109272020B (zh) 一种肌电数据中离群点的处理方法和系统
Su et al. Whole slide cervical image classification based on convolutional neural network and random forest

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20944873

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20944873

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03/08/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20944873

Country of ref document: EP

Kind code of ref document: A1