WO2019047109A1 - 一种hpv精确分型的生物信息学分析方法及系统 - Google Patents

一种hpv精确分型的生物信息学分析方法及系统 Download PDF

Info

Publication number
WO2019047109A1
WO2019047109A1 PCT/CN2017/100927 CN2017100927W WO2019047109A1 WO 2019047109 A1 WO2019047109 A1 WO 2019047109A1 CN 2017100927 W CN2017100927 W CN 2017100927W WO 2019047109 A1 WO2019047109 A1 WO 2019047109A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
reads
hpv
sample
seq
Prior art date
Application number
PCT/CN2017/100927
Other languages
English (en)
French (fr)
Inventor
柴相花
王书元
刘强
袁玉英
张红云
刘娜
尹烨
Original Assignee
深圳华大基因股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因股份有限公司 filed Critical 深圳华大基因股份有限公司
Priority to PCT/CN2017/100927 priority Critical patent/WO2019047109A1/zh
Priority to CN201780093704.XA priority patent/CN111032885B/zh
Publication of WO2019047109A1 publication Critical patent/WO2019047109A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/70Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving virus or bacteriophage

Definitions

  • the invention belongs to the field of bioinformatics and relates to a bioinformatics analysis method and system for accurate classification of HPV.
  • Human papillomavirus is an epithelial virus belonging to the genus Papillomavirus A of the papovavirus family. It is a spherical DNA virus that can cause squamous epithelial proliferation of human skin mucosa.
  • HPV isolated which can be divided into four categories according to the invading tissue parts and the disease-causing strength: (1) low-risk skin types (such as HPV2, 3, 7, 10, etc.) can cause Skin sputum; (2) high-risk skin type (such as HPV5, 20, 38, etc.) can cause benign skin spasm, actinic keratosis, non-melanoma skin cancer, etc.; (3) low-risk mucosa (such as HPV6, 11 13,13, etc.) can cause benign genital warts; (4) high-risk mucosa (such as HPV16, 18, 31, 33, etc.) can cause malignant tumors, the number of cancers induced accounts for 5% of all human cancers, equivalent to viruses One third of all cancers induced.
  • low-risk skin types such as HPV2, 3, 7, 10, etc.
  • high-risk skin type such as HPV5, 20, 38, etc.
  • benign skin spasm such as HPV6, 11 13,13, etc.
  • benign genital warts such as HP
  • HPV16 has the highest degree of malignancy, and about 50% of cervical cancers worldwide are caused by HPV16. Cervical cancer is the most common gynecological tumor and the second most common malignant tumor that threatens women's lives. In 2012, there were approximately 528,000 cases of cervical cancer, with a death toll of 266,000. About 70% of cervical cancers occur in developing countries. According to statistics, about 70% of cervical cancer is caused by HPV16 and HPV18 infection. Therefore, accurate and efficient HPV type identification is an important means to prevent cervical cancer, and it is also an important measure to reduce female mortality.
  • the detection methods for HPV genotyping are mainly molecular biological methods, and generally include three types: (1) nucleic acid hybridization detection methods, including Southern blotting, in situ hybridization and spot blot hybridization, wherein Southern blotting is HPV.
  • nucleic acid hybridization detection methods including Southern blotting, in situ hybridization and spot blot hybridization, wherein Southern blotting is HPV.
  • the gold standard for genotyping, and the presence of HPV can be linked to morphology, but this method is low in sensitivity, time consuming, large in the amount of purified DNA, and not suitable for the detection of easily degradable DNA
  • Signal amplification detection method including HPV and HC2
  • this method can be used for quantitative detection of HPV, and is also an FDA-approved test method with low false positive rate and high sensitivity.
  • this method is limited by patent and requires permission to use, and is not suitable for HPV specific type.
  • nucleic acid amplification assays including microarray analysis, PCR, PCR-RFLP, Real-time PCR, Abbott Real-time PCR, HPV genome sequencing, etc. This method is flexible in terms of viral load and genotype, has very high sensitivity, and can perform multi-sample detection, but The amplification signal for certain specific types of HPV is low, and previously amplified material contamination may result in false positives.
  • 201080070484.7 discloses a method and system for biological analysis of HPV precise typing, which groups the sequencing fragments obtained by high-throughput sequencing, and compares with the reference genome sequence to determine the HPV type or negative of the sequence fragment.
  • the sequence fragments of the determined type are combined according to the samples, and the number and proportion of the sequence fragments of the determined type are selected to determine whether the HPV type of each sample is determined to be negative.
  • the method utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete the detection of infected HPV types.
  • the sequencing amount of the library is the average sequencing amount in an ideal case, and then it is judged whether or not the type is infected according to whether the ratio of the number of sequence fragments supporting the HPV type to the total number of sequence fragments reaches a predetermined threshold, and not only the modification is modified in the process.
  • the type discriminant criterion of the method adopts the number of absolute sequence fragments, which is greatly influenced by the absolute data amount of the sample, and the false positive rate is high.
  • HPV typing detection technology has become an urgent problem to be solved in the field.
  • the present invention provides a bioinformatics analysis method and system for HPV accurate typing to overcome the shortcomings of the prior art that the accuracy is poor, the sensitivity is low, the specificity is poor, the false negative rate and the false positive rate are high.
  • the invention provides a bioinformatics analysis method for HPV accurate typing, comprising the following steps:
  • NGS high-throughput sequencing technology
  • the HPV type is determined by determining the HPV type of reads sequence using the LDA model, and finally confirming the HPV type of each read sequence.
  • the Bayesian classifier can be expressed as follows: for a sample with a given eigenvalue x, if the posterior probability of positive
  • the above assumptions cannot be strictly established, and the overall mean ⁇ 0,1 and the covariance matrix ⁇ are unknown, and thus the above Bayesian classifier is not available.
  • the mean can be estimated from the sample.
  • Covariance matrix The above formula is still used for classification, this is the LDA model.
  • the threshold C can be adjusted according to needs, for example, it is more important to reduce the false negative rate when performing related detection than to reduce the false positive rate, and a value of C ⁇ 0.5 should be selected.
  • models that can be considered in the present invention include a logistic regression model, an LDA model, a QDA model, and the like.
  • logistic regression has the disadvantage of model instability in the case where the eigenvalues of the two classifications (infected and uninfected) are far apart.
  • the classification boundary of LDA is (high-dimensional) plane, and the classification boundary of QDA is curved surface, the influence of large random fluctuation of eigenvalue on LDA is much less than that of QDA.
  • the present invention selects an LDA model for type determination.
  • the threshold is selected based on the assumption that the false negative rate is less than 5%, and the sum of the false negative rate and the false positive rate is minimized.
  • the sum of the false negative rate and the false positive rate is 7% to 10%, and in one embodiment of the present invention, the sum of the false negative rate and the false positive rate is 10%.
  • the analysis method further comprises the step of pre-processing
  • the pre-processing step comprises: filtering the sequence fragments obtained by the high-throughput sequencing technology to remove the unqualified sequence, further reducing the influence of the unqualified sequence, and further improving the accuracy of the detection analysis, thereby obtaining “ Clean "sequence.
  • the filtering specifically comprises the following steps:
  • step c) when the sequencing quality of the bases in the reads sequence is lower than the sequencing quality threshold and lower than the sequencing quality
  • the sequence of the reads is determined as a failed sequence and filtered; otherwise, the process proceeds to step c);
  • step d) when the number of undetermined bases in the sequencing result of the reads sequence exceeds 10% of the number of bases of the entire sequence, the reads sequence is determined as a failed sequence and filtered; otherwise, proceeds to step d);
  • Step 2 when the sequencing result of the reads sequence is aligned with the linker sequence library, if the sequencing linker sequence exists in the reads sequence, the reads sequence is determined as a failed sequence and filtered; otherwise, the qualified reads sequence is determined.
  • step clustering of step 2) specifically includes:
  • the HPV reference sequence set of step 3) comprises a HBB (ie hemoglobin subunit beta) sequence set and a HPV type sequence set for negative control; HBB as internal quality control, mainly To identify false negatives due to insufficient DNA or PCR amplification failure.
  • HBB hemoglobin subunit beta
  • the statistic of step 3) is that the comparison result is performed according to one row of each sample, and each type of column is counted to obtain a reads distribution matrix file; the statistical result file may also be outputted as shown in Table 1.
  • the total number of reads is 3327, where the number of reads on the alignment is 1115, and the number of reads on the unmatched is 2212.
  • the reads on the comparison with HBB the reads on the comparison with HBB.
  • the number is 1110, the number of reads on the HPV16, HPV18, HPV31, and HPV35 is 0, and the number of reads on the HPV33 and HPV45 is 1.
  • step 3 the clustered reads sequence is compared with the HPV reference sequence set, preferably by BWA (V0.6.2-r126) software, and the compared files are output; any other option may be selected.
  • the software is applied, and the present invention is not specifically limited.
  • the HPV typing of step 4) comprises the following steps:
  • each sample is judged to be negative or positive, if it is negative, the result is output; if it is positive, proceed to step h);
  • Negative or positive one-by-one HPV type is judged, that is, the type of HPV infected with each sample is judged.
  • said determining whether each sample is negative or positive in total comprises the following steps:
  • the score calculation formula is:
  • the preset threshold C ranges from 0.4 to 0.6, and may be, for example, 0.4, 0.42, 0.44, 0.46, 0.48, 0.5, 0.52, 0.54, 0.56, 0.58 or 0.6 and all points between them, limited to the length The restrictions are not listed here, more preferably 0.5;
  • the calculating and analyzing by the training set specifically includes: calculating a parameter by using the following formula based on the training set sample with Where N 0 is the negative sample size and N 1 is the positive sample size:
  • the scores of various types in the training set are obtained, and the obtained scores are combined with the pathological analysis results to adjust the preset threshold C.
  • the negative or positive determination of the HPV type by:
  • the number of positive samples of the HPV type is ⁇ 9
  • an LDA model is established for the HPV type, and the number of reads on the total alignment, the number of HBB reads, and the number of HPV type reads are characteristic values to the HPV.
  • the total number of negative samples and the total number of positive samples are the corresponding variables, and it is judged that each of the reads sequence samples is negative or positive;
  • the relative number of reads, the number of HBB reads, and the number of HPV type reads on the total alignment are relative values
  • the determining whether each of the reads sequence samples is negative or positive further includes:
  • the present invention provides a bioinformatics analysis system for performing HPV accurate typing according to the analysis method of the first aspect, comprising:
  • Receiver module used to receive the sequencing fragments obtained by high-throughput sequencing technology, and obtain each sample Read sequence
  • a clustering module connected to the receiving module, configured to group the reads sequence according to the label sequence and the primer sequence to obtain a clustered reads sequence;
  • the comparison statistics module is connected to the clustering module, and is used for comparing and screening the clustered reads sequence with the HPV reference sequence set, determining the HPV type or negative of the read reads sequence, and performing statistics;
  • HPV typing module used to determine the HPV type of reads sequence using the LDA model for HPV typing, and finally confirm that the HPV type of each reads sequence is negative.
  • the system further comprises a pre-processing module, which is mainly used for filtering each read sequence, removing the unqualified sequence, and obtaining a "clean" read sequence.
  • a pre-processing module which is mainly used for filtering each read sequence, removing the unqualified sequence, and obtaining a "clean" read sequence.
  • the "clean" reads sequence is a sequence that satisfies one of the following conditions:
  • the number of "N" bases in the sequence is less than 10% of the number of bases in the entire sequence
  • the present invention has at least the following beneficial effects:
  • the bioinformatics typing method and system for accurate classification of HPV provided by the invention overcomes the shortcomings of poor precision, low sensitivity, poor specificity, false negative rate and high false positive rate in the prior art, and provides accurate classification of HPV type. It provides accurate classification results for HPV common screening and clinical trials, and provides protection for cervical cancer, oral cancer and prostate cancer.
  • FIG. 1 is a schematic flow chart of a bioinformatics analysis method for HPV accurate typing of the present invention
  • Figure 2 is a graph of the performance evaluation ROC analysis results.
  • FIG. 1 The entire flow chart of the HPV accurate typing bioinformatics analysis method is shown in Figure 1, which includes sequencing, sample preprocessing, group clustering, alignment and statistics, establishing an LDA model and testing for each HPV typing.
  • the result of the set is determined as follows:
  • This example is based on 3331 samples of pathological analysis on the Miseq platform SE150 (ie, these samples are known to be negative or HPV type), and 65 types of HPV types are validated, including 16 major types (including 14 High-risk type and 2 low-risk types) and 49 secondary types.
  • the number of the sample has been randomized.
  • NGS high-throughput sequencing technology
  • a sequencing quality threshold of a predetermined unqualified base for example, if the average quality of the sequencing is less than 20, it is considered to be a non-conforming sequence
  • step c) when the sequencing quality of the base in the reads sequence of the sample is lower than the sequencing quality threshold, specifically, when the average base quality value of the sequence is less than 15, the reads sequence of the sample is determined as a failed sequence and filtered; Otherwise, proceed to step c);
  • the reads sequence of the sample is determined as not Pass the sequence and filter it; otherwise, proceed to step d);
  • Original sequence number 534036 Clean sequence number and its ratio to the original sequence 499902 93.61 Number of joint contamination sequences and their ratio to clean sequences 21459 4.29 Number of library contamination sequences and their ratio to clean sequences 12424 2.49 Low-quality sequence numbers and their ratio to clean sequences 25 0.01 Number of sequences containing N bases and their ratio to clean sequences 226 0.05
  • Tag identification Tag sequence Tag identification Tag sequence
  • MGIP-001 SEQ No. 1
  • TACGCTGTAC MGIP-086 SEQ No. 86
  • MGIP-002 (SEQ No. 2) TATGTGTACT MGIP-087 (SEQ No. 87) TATCGTCGTC MGIP-003 (SEQ No. 3) TGACTCAGAC MGIP-088 (SEQ No. 88) TCATCGAGCT MGIP-004 (SEQ No. 4) CTAGATGTCA MGIP-089 (SEQ No. 89) ACTATCGCTA MGIP-005 (SEQ No. 5) GATGACTCTC MGIP-090 (SEQ No. 90) GCTACTGATG MGIP-006 (SEQ No. 6) TGTAGTGAGT MGIP-091 (SEQ No.
  • GCTATAGTCA MGIP-107 (SEQ No. 107) CAGAGTCATG MGIP-023 (SEQ No. 23) CGTCTCATGC MGIP-108 (SEQ No. 108) AGTACGATGC MGIP-024 (SEQ No. 24) ACGATGCTAT MGIP-109 (SEQ No. 109) GCTCTCACTG MGIP-025 (SEQ No. 25) GAGTGTACTA MGIP-110 (SEQ No. 110) TAGCTCGCTG MGIP-026 (SEQ No. 26) GTCATACGTG MGIP-141 (SEQ No. 111) GTGAGCTATC MGIP-027 (SEQ No.
  • ATCTGAGTAC MGIP-142 (SEQ No. 112) CAGTCTGATA MGIP-028 (SEQ No. 28) CGATAGCATC MGIP-143 (SEQ No. 113) TACATGCTCT MGIP-029 (SEQ No. 29) ACTGATCTCA MGIP-144 (SEQ No. 114) TAGTCTCGCT MGIP-030 (SEQ No. 30) CTCGATACTA MGIP-145 (SEQ No. 115) CGCTACGACT MGIP-031 (SEQ No. 31) CATGTGACTG MGIP-146 (SEQ No. 116) TCGATCTGTA MGIP-032 (SEQ No.
  • MGIP-046 (SEQ No. 46) GTAGTGCTCT MGIP-161 (SEQ No. 131) TGTCGCATAT MGIP-047 (SEQ No. 47) CTGACGAGCT MGIP-162 (SEQ No. 132) ACACTGCTCA MGIP-048 (SEQ No. 48) ACACGCACTA MGIP-163 (SEQ No. 133) ATACTGTGAC MGIP-049 (SEQ No. 49) CTCGCACTAC MGIP-164 (SEQ No. 134) CTACGCATCA MGIP-050 (SEQ No. 50) AGATCTCACT MGIP-165 (SEQ No. 135) ACGAGCTAGA MGIP-051 (SEQ No.
  • ATACTAGTGT MGIP-166 (SEQ No. 136) GTCGATGAGA MGIP-052 (SEQ No. 52) ATATCTCGTA MGIP-167 (SEQ No. 137) CGCTGTGATC MGIP-053 (SEQ No. 53) TGACTGCGTA MGIP-168 (SEQ No. 138) TCGTCACTAT MGIP-054 (SEQ No. 54) TGTAGACGTA MGIP-169 (SEQ No. 139) CTCTGTATGC MGIP-055 (SEQ No. 55) AGAGACTATG MGIP-170 (SEQ No. 140) ACTATGAGCT MGIP-056 (SEQ No.
  • CATGAGTAGA MGIP-171 (SEQ No. 141) CACTGCTCTC MGIP-057 (SEQ No. 57) TGACAGCTAC MGIP-172 (SEQ No. 142) ACTGAGCATC MGIP-058 (SEQ No. 58) CGCTAGACAT MGIP-173 (SEQ No. 143) TCTATGATAC MGIP-059 (SEQ No. 59) CGTAGATATG MGIP-174 (SEQ No. 144) CTCACTATCA MGIP-060 (SEQ No. 60) TGAGTCTGCT MGIP-175 (SEQ No. 145) TCGACGCACT MGIP-061 (SEQ No.
  • CATCACGCAC MGIP-191 (SEQ No. 161) CGTGTCGCTC MGIP-077 (SEQ No. 77) AGCATGTGAT MGIP-192 (SEQ No. 162) ATCGCATCGT MGIP-078 (SEQ No. 78) GCTATGTAGT MGIP-193 (SEQ No. 163) GCTGATGTAC MGIP-079 (SEQ No. 79) AGACGTAGCT MGIP-194 (SEQ No. 164) TGCGACGTGC MGIP-080 (SEQ No. 80) CAGACATAGA MGIP-195 (SEQ No. 165) ATCAGATCTC MGIP-081 (SEQ No.
  • MGPA (SEQ No. 171) TTTGTTACTGTGGTGGATACTAC MGPB (SEQ No. 172) TTTGTTACCGTTGTTGATACTAC MGPC (SEQNo.173) TTTGTTACTAAGGTAGATACCACTC MGPD (SEQ No. 174) TTTGTTACTGTTGTGGATACAAC MGP31 (SEQ No. 175) TTTGTTACTATGGTAGATACCACAC MGPG (SEQ No. 176) GAAAAATAAACTGTAAATCATATTCCT MGPH (SEQ No. 177) GAAAAATAAATTGTAAATCATACTC MGPI (SEQ No. 178) GAAATATAAATTGTAAATCAAATTC MGPJ (SEQ No.
  • GAAAAATAAACTGTAAATCATATTC MGP18 (SEQ No. 180) GAAAAATAAACTGCAAATCATATTC GP5+ (SEQ No. 181) TTTGTTACTGTGGTAGATACTAC MS3 (SEQNo. 182) AATATATGTGTGCTTATTTG MS10 (SEQ No. 183) AGATTAGGGAAAGTATTAGA MGP58 (SEQ No. 184) TTTGTTACTGTAGTTGATACCACTC MGP52 (SEQ No. 185) TTTGTCACAGTTGTGGATACCACTC
  • the clustered reads sequence is compared to the HPV reference sequence set, and the compared statistical result file StatMap.txt and Reads distribution matrix file RDisMat.txt are obtained.
  • the StatMap.txt file counts the total number of reads, the number of reads on the alignment, and the number of reads on the unmatched in each sample (one total of 3331 samples) per line per sample; In the upper reads, the number of reads on the HBB and the number of reads in each HPV type are shown in Table 4.
  • Example 5 establishes an LDA model for each HPV classification
  • 60% of the 3331 samples in the SAM file are randomly divided into 1999 samples as the training set, and the remaining 40%, ie, 1332 samples, are used as the test set.
  • the training set is used to establish the LDA model and threshold C for each HPV classification.
  • N 0 is a negative sample amount
  • N 1 is a positive sample amount
  • the training parameters with Bring into the score calculation formula, get the score of each sample population in the training set, compare the score with C (0.5), if it is greater than C (0.5), it is judged as positive, otherwise it is judged as negative;
  • a) will parameters with Bring into the following formula, calculate the score of each sample in the test set, compare the score with the threshold C (0.5); if it is greater than C (0.5), judge it as positive, and enter step b); otherwise, judge negative;
  • the performance evaluation mainly compares the method and system of the present invention with the old NGS-based HPV detection technology, and the performance evaluation strategy mainly uses the receiver operating characteristic curve (ROC) analysis, and the results are shown in FIG. 2 .
  • the method and system of the present invention i.e., HPV-AGM
  • the specificity and sensitivity are superior to the old model. type.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Virology (AREA)
  • Biotechnology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种HPV精确分型的生物信息学分型方法及系统,所述方法包括:接收高通量测序技术得到的测序片段,得到每个样本的reads序列;将所有样本的reads序列进行分组聚类,将聚类后的reads序列与HPV参考序列集进行比对和筛选,确定筛选后的reads序列的匹配结果;对确定HPV型别的reads序列采用LDA模型进行HPV分型,最终确认每个reads序列的HPV型别。

Description

一种HPV精确分型的生物信息学分析方法及系统 技术领域
本发明属于生物信息学领域,涉及一种HPV精确分型的生物信息学分析方法及系统。
背景技术
人乳头瘤病毒(HPV)是一种嗜上皮性病毒,属于乳多空病毒科的乳头瘤空泡病毒A属,是球形DNA病毒,能引起人体皮肤黏膜的鳞状上皮增殖。至今,被分离出的HPV已经有170多种,根据侵犯的组织部位和致病强弱不同可分为四类:(1)皮肤低危型(如HPV2、3、7、10等)可以引起皮肤疣;(2)皮肤高危型(如HPV5、20、38等)可以引起良性皮肤疣,光化性角化病,非黑瘤皮肤癌等;(3)黏膜低危型(如HPV6、11、13、32等)可以引起良性生殖器湿疣;(4)黏膜高危型(如HPV16、18、31、33等)可以引发恶性肿瘤,诱发的癌症数量占人类所有癌症数量的5%,相当于病毒诱发的所有癌症数量的1/3。其中,HPV16恶性程度最高,在世界范围内,约50%的宫颈癌是由HPV16引起的。宫颈癌是最常见的妇科肿瘤,也是威胁女性生命安全的第二大恶性肿瘤。2012年,约有528,000例宫颈癌病例,死亡人数达266,000人。约70%的宫颈癌发生在发展中国家。据统计,约70%的宫颈癌是由HPV16和HPV18感染所致。因此精准高效的进行HPV型别鉴定是有效预防宫颈癌的重要手段,也是降低女性死亡率的一个重要举措。
目前,用于HPV基因分型的检测方法主要是分子生物学方法,大致包括三种:(1)核酸杂交检测法,包括Southern印迹,原位杂交和斑点印记杂交等,其中Southern印迹法是HPV基因分型的金标准,同时HPV的存在可以与形态 学联系起来,但是这种方法灵敏度低,耗时长,纯化DNA的起始量大,并且不适用于容易降解的DNA的检测;(2)信号放大检测法,包括
Figure PCTCN2017100927-appb-000001
HPV和HC2,这种方法可以进行HPV定量检测,也是FDA批准的检测方法,假阳性率低,灵敏度高,但是这种方法受专利的限制,需要得到许可才能使用,同时不适合HPV特定型别的鉴定及多重HPV感染的检测;(3)核酸扩增检测法,包括微阵列分析,
Figure PCTCN2017100927-appb-000002
PCR,PCR-RFLP,Real-time PCR,Abbott Real-time PCR,HPV genome sequencing等,这种方法在病毒载量和基因型方面比较灵活,有非常高的灵敏度,且可以进行多样本检测,但是对某些特定型别的HPV的扩增信号较低,先前放大的材料污染可能导致假阳性。
201080070484.7公开了一种HPV精确分型的生物学分析的方法及系统,该方法将高通量测序获得的测序片段进行分组,与参考基因组序列进行比对后确定序列片段的HPV型别或阴性,对确定型别的序列片段按照样本进行合并,根据确定型别的序列片段的数量和比例进行筛选,最终确定每个样本的HPV型别或者确定为阴性。该方法利用生物信息学的分析方法及技术手段,实现了快速检测大量样本、快速完成对感染HPV型别的检测,然而在型别鉴定过程中,需要将每个样品的序列数量按比例缩放到文库的测序量为理想情况下的平均测序量,然后根据支持HPV型别的序列片段数占总序列片段数的比例是否达到预定阈值来判断是否感染了该型别,该过程中不仅修改了每个样品的总序列片段数,而且仅根据比例是否达到预定阈值来判断是否感染了该型别,判断依据较为单薄无力,因此并不能实现对HPV的精确分型。此外,该方法的型别判别标准采用的是绝对序列片段数,受样本绝对数据量的影响较大,假阳性率较高。
因此,提供一种高精准、高灵敏度、高特异性、低假阴性率和低假阳性率 的HPV分型检测技术成为本领域亟待解决的问题。
发明内容
针对上述问题,本发明提供一种HPV精确分型的生物信息学分析方法及系统,以克服现有技术精确度差、灵敏度低、特异性差、假阴性率和假阳性率高的缺点。
本发明提供一种HPV精确分型的生物信息学分析方法,包括以下步骤:
1)接收高通量测序技术(NGS)得到的测序片段,得到reads序列;
2)将reads序列进行分组聚类,得到每个样本的reads序列;
3)将每个样本的reads序列与HPV参考序列集进行比对和筛选,确定筛选后的reads序列的匹配结果(即每个样本中的总reads数、比对上的reads数、未比对上的reads数;在比对上的reads中,与HBB比对上的reads数和各HPV分型的reads数),并进行统计;
4)对确定HPV型别的reads序列采用LDA模型进行HPV分型,最终确认每个reads序列的HPV型别。
在LDA模型(Linear Discriminant Analysis)的分类分析中,假设在阴性、阳性两个分类(y=0与y=1)中特征值矢量均为正态分布,分别具有均值μ0,1与协方差矩阵∑,且两分类的先验概率为π0,1,则贝叶斯分类器可表示为如下形式:对给定特征值x的某样本,若阳性的后验概率
Figure PCTCN2017100927-appb-000003
则将对象归为阳性(y=1),否则归为阴性(y=0),其中C=0.5为阈值。实际分析时,上述假设不可能严格成立,且全体的均值μ0,1与协方差矩阵∑是 未知量,因而上述贝叶斯分类器是无法获得的。然而,在上述假设近似成立的情况下,可由样本估计均值
Figure PCTCN2017100927-appb-000004
与协方差矩阵
Figure PCTCN2017100927-appb-000005
仍然应用上述公式进行分类,此即LDA模型。此时阈值C可依据需要调节,例如进行相关检测时降低假阴性率比降低假阳性率更重要,则应选用C<0.5的值。
根据需求,在本发明中可考虑的模型有逻辑回归模型、LDA模型、QDA模型等。但结合数据特点,在两个分类(感染与未感染)特征值相差较远的情况下,逻辑回归有模型不稳定的缺点。然而,因为LDA的分类边界为(高维)平面,而QDA的分类边界为曲面,特征值大幅随机波动对LDA的影响要远小于对QDA的影响。鉴于实验上无法消除特征值的大幅随机波动,本发明选择LDA模型进行型别的判定。
在分类问题中,一般假阳性率(FPR)下降则假阴性率(FNR)上升,反之亦然。因此,在本发明中,阈值的选择依据是在保证假阴性率小于5%的前提下,尽量降低假阴性率与假阳性率之和。
优选地,假阴性率和假阳性率之和为7%~10%,本发明的一个实施例中,假阴性率和假阳性率之和为10%。
优选地,所述分析方法还包括预处理的步骤;
优选地,所述预处理步骤具体包括:对高通量测序技术得到的序列片段进行过滤,除去不合格的序列,以进一步降低不合格序列的影响,进一步提高检测分析的准确性,从而得到“干净的”序列。
优选地,所述过滤具体包括以下步骤:
a)预设不合格碱基的测序质量阈值和比例阈值;
b)当reads序列中碱基的测序质量低于所述测序质量阈值,且低于测序质 量阈值的碱基个数占整条序列碱基个数的比例超过所述比例阈值时,将该reads序列判定为不合格序列并加以过滤;否则,进入步骤c);
c)当reads序列的测序结果中不确定的碱基个数超过整条序列碱基个数的10%时,将该reads序列判定为不合格序列并加以过滤;否则,进入步骤d);
d)当reads序列的测序结果与接头序列库进行比对时,如果reads序列中存在测序接头序列,则将该reads序列判定为不合格序列并加以过滤;否则,判定为合格的reads序列,进行步骤2)。
优选地,步骤2)所述分组聚类具体包括:
e)将reads序列按照标签序列和引物序列进行聚类;
f)截取每个reads序列中对应的标签序列和引物序列并进行标识,得到聚类后每个样本的reads序列。
优选地,步骤3)所述HPV参考序列集包括用于阴性质控的HBB(即人类基因组的血红蛋白β亚基,hemoglobin subunit beta)序列集和HPV型别序列集;HBB作为内部质控,主要为了识别由于DNA量不足或PCR扩增失败导致的假阴性。
优选地,步骤3)所述统计为将比对结果按照每个样本一行,每种型别一列进行统计,得到reads分布矩阵文件;统计的结果文件也可以如表1所示的形式输出。
例如,编号为S001的样本,总的reads数为3327,其中比对上的reads数为1115,未比对上的reads数为2212;在比对上的reads中,与HBB比对上的reads数为1110,与HPV16、HPV18、HPV31、HPV35比对上的reads数均为0,与HPV33和HPV45比对上的reads数均为1。
表1
Figure PCTCN2017100927-appb-000006
在本发明中,步骤3)中将聚类后的reads序列与HPV参考序列集进行比对优选运用BWA(V0.6.2-r126)软件进行,并输出比对后的文件;也可选用其他任何适用的软件进行,本发明没有具体限制。
优选地,步骤4)所述HPV分型包括以下步骤:
g)根据步骤3)的reads分布矩阵判断每个样本总体为阴性或阳性,若为阴性,则输出结果;若为阳性,进入步骤h);
h)逐个HPV型别判断阴性或阳性,即判断每个样本感染的HPV的型别。
优选地,所述判断每个样本总体为阴性或阳性包括以下步骤:
a’)预设阈值C,通过训练集计算和分析,调整预设的阈值C;
b’)针对训练集样本观测数据,将训练参数
Figure PCTCN2017100927-appb-000007
Figure PCTCN2017100927-appb-000008
带入到分值计算公式中, 得到每个样本总体的分值,所述分值计算公式为:
Figure PCTCN2017100927-appb-000009
c’)将分值与预设的阈值C进行比较,若大于C,则判定为阳性,否则判定为阴性;
优选地,预设的阈值C的范围为0.4~0.6,例如可以是0.4、0.42、0.44、0.46、0.48、0.5、0.52、0.54、0.56、0.58或0.6及其之间所有的点值,限于篇幅的限制,在此不再一一列举,更优选为0.5;
优选地,所述通过训练集计算和分析具体包括:基于训练集样本,运用下面公式计算出参数
Figure PCTCN2017100927-appb-000010
Figure PCTCN2017100927-appb-000011
其中N0为阴性样本量,N1为阳性样本量:
Figure PCTCN2017100927-appb-000012
再通过公式:
Figure PCTCN2017100927-appb-000013
得到训练集中各种型别的分值,将得到的分值与病理分析结果结合,用于调整预设的阈值C。
优选地,所述逐个HPV型别判断阴性或阳性包括:
若该HPV型别的阳性样本数量≥9,则对该HPV型别建立LDA模型,以总比对上的reads数、HBB reads数和该HPV型别reads数为特征值,以该HPV 型别的阴性样本总数量和阳性样本总数量为相应变量,判断每个reads序列样本为阴性或阳性;
若该HPV型别的阳性样本数量小于9,则将其余所有具有≥9的阳性样本数量的HPV型别建立的LDA模型用于该HPV型别,取平均结果后,判断每个reads序列样本为阴性或阳性;
优选地,所述总比对上的reads数、HBB reads数和该HPV型别reads数均使用相对值;
优选地,所述判断每个reads序列样本为阴性或阳性还具体包括:
d’)针对上述步骤中测试集样本的观测数据,将训练参数
Figure PCTCN2017100927-appb-000014
Figure PCTCN2017100927-appb-000015
带入到分值计算公式中:
Figure PCTCN2017100927-appb-000016
得到测试集中每个样本总体的分值,将该分值与C进行比较,若大于C则判定为阳性,否则判定为阴性;
e’)对测试集中的阳性样本进行分型:依次对每个HPV型别考虑,若在训练集中曾对该HPV型别建立LDA模型,则将该模型应用于测试集中的阳性样本上;若在训练集中不曾对该HPV型别建立LDA模型,则将所有HPV型别上曾建立的LDA模型应用于该HPV型别,取平均结果;
f’)输出每个测试集样本的判定结果。
第二方面,本发明提供一种如第一方面所述的分析方法进行HPV精确分型的生物信息学分析系统,包括:
接收模块:用于接收高通量测序技术得到的测序片段,得到每个样本的 reads序列;
聚类模块:与所述接收模块相连,用于将reads序列根据标签序列和引物序列进行分组聚类,得到聚类后的reads序列;
比对统计模块:与所述聚类模块相连,用于将聚类后的reads序列与HPV参考序列集进行比对和筛选,确定筛选后的reads序列的HPV型别或阴性,并进行统计;
HPV分型模块:用于对确定HPV型别的reads序列采用LDA模型进行HPV分型,最终确认每个reads序列的HPV型别或确定为阴性。
优选地,所述系统还包括预处理模块,所述预处理模块主要用于每个reads序列的过滤,除去不合格的序列,得到“干净”的reads序列。
在本发明中,所述“干净”的reads序列是满足以下条件之一的序列:
1)序列中“N”碱基的个数小于整条序列碱基个数的10%;
2)序列平均碱基质量值大于15;
3)没有接头污染的序列;
4)没有文库污染的序列。
与现有技术相比,本发明至少具有以下有益效果:
本发明提供的HPV精确分型的生物信息学分型方法及系统,克服了现有技术精确度差、灵敏度低、特异性差、假阴性率和假阳性率高的缺点,提供HPV型别的精准分型,为HPV普通筛查和临床实验提供精准的分型结果,为宫颈癌、口腔癌和前列腺癌等的预防提供有利保障。
附图说明
图1是本发明的HPV精确分型的生物信息学分析方法的流程示意图;
图2是性能评估ROC分析结果图。
具体实施方式
下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本发明,而非对本发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部结构。
所述HPV精确分型的生物信息学分析方法的整个流程图如图1所示,其包括测序、样本预处理、分组聚类、比对和统计、建立每个HPV分型的LDA模型和测试集的结果判定,具体如下:
实施例1测序
本实施例基于Miseq平台SE150的3331份有病理分析结果的样本(即已知这些样本属于阴性或HPV分型),对65种HPV型别进行验证性判定,包括16种主要型别(包含14种高危型别和2种低危型别)和49种次要型别。
在本实施例这,样本的编号已经过随机化处理。
利用高通量测序技术(NGS)进行测序,得到每个样本的reads序列。
实施例2样本预处理
对所有样本的reads序列进行过滤,除去不合格的序列:
a)预设不合格碱基的测序质量阈值;例如,测序平均质量低于20,则认为是不合格序列;
b)当样本的reads序列中碱基的测序质量低于所述测序质量阈值,具体的,序列的平均碱基质量值小于15时,将该样本的reads序列判定为不合格序列并加以过滤;否则,进入步骤c);
c)当样本的reads序列的测序结果中不确定的碱基个数超过整条序列碱基个数(例如Illumina GA测序结果中的N)的10%时,将该样本的reads序列判定为不合格序列并加以过滤;否则,进入步骤d);
d)当样本的reads序列的测序结果与接头序列库进行比对时,如果样本的reads序列中存在测序接头序列,则将该样本的reads序列判定为不合格序列并加以过滤;否则,进入步骤e);
e)当一个文库中出现标签序列污染时,即实验过程中并没有对标签1序列对应的孔上样,但是标签1出现了序列,则认为不合格序列并加以过滤;否则判定为合格的reads序列,进行后续步骤;
d)将预处理统计文件输出为StatRaw.txt。具体内容以单样本为例,如表2所示。
表2预处理统计文件
原始序列数 534036  
干净序列数及其占原始序列比率 499902 93.61
接头污染序列数及其占干净序列的比率 21459 4.29
文库污染序列数及其占干净序列的比率 12424 2.49
低质量的序列数及其占干净序列的比率 25 0.01
含N碱基序列数及其占干净序列的比率 226 0.05
实施例3分组聚类
a)提供标签序列和引物序列文件,具体的序列如表3所示;
表3-1标签序列
标签标识 标签序列 标签标识 标签序列
MGIP-001(SEQ No.1) TACGCTGTAC MGIP-086(SEQ No.86) CTACGATGTA
MGIP-002(SEQ No.2) TATGTGTACT MGIP-087(SEQ No.87) TATCGTCGTC
MGIP-003(SEQ No.3) TGACTCAGAC MGIP-088(SEQ No.88) TCATCGAGCT
MGIP-004(SEQ No.4) CTAGATGTCA MGIP-089(SEQ No.89) ACTATCGCTA
MGIP-005(SEQ No.5) GATGACTCTC MGIP-090(SEQ No.90) GCTACTGATG
MGIP-006(SEQ No.6) TGTAGTGAGT MGIP-091(SEQ No.91) AGCTCGATCA
MGIP-007(SEQ No.7) TCATCGTAGA MGIP-092(SEQ No.92) CACATATCGT
MGIP-008(SEQ No.8) TAGCATCTGT MGIP-093(SEQ No.93) ACGTCGTGAT
MGIP-009(SEQ No.9) CTATACGTGC MGIP-094(SEQ No.94) TACGATGATG
MGIP-010(SEQ No.10) CGACTGTAGA MGIP-095(SEQ No.95) GAGACTGACT
MGIP-011(SEQ No.11) GATGTCATGT MGIP-096(SEQ No.96) AGTGCTAGAT
MGIP-012(SEQ No.12) GTGTAGATAC MGIP-097(SEQ No.97) AGCTGCGTGT
MGIP-013(SEQ No.13) AGCTGACGAT MGIP-098(SEQ No.98) TGATACGCTC
MGIP-014(SEQ No.14) ATGATATAGT MGIP-099(SEQ No.99) TCTCGACTCA
MGIP-015(SEQ No.15) ATGTGCTCTA MGIP-100(SEQ No.100) CTAGAGATAT
MGIP-016(SEQ No.16) CATACGCTCA MGIP-101(SEQ No.101) ATAGACGCAT
MGIP-017(SEQ No.17) CTGATATCTA MGIP-102(SEQ No.102) ACGCACTCAC
MGIP-018(SEQ No.18) GCACTAGATG MGIP-103(SEQ No.103) ATCGTAGATC
MGIP-019(SEQ No.19) AGTACGCATG MGIP-104(SEQ No.104) AGTAGCTGTC
MGIP-020(SEQ No.20) TAGCTCATCT MGIP-105(SEQ No.105) CGATATACTG
MGIP-021(SEQ No.21) AGCATACACT MGIP-106(SEQ No.106) GCTCGATATA
MGIP-022(SEQ No.22) GCTATAGTCA MGIP-107(SEQ No.107) CAGAGTCATG
MGIP-023(SEQ No.23) CGTCTCATGC MGIP-108(SEQ No.108) AGTACGATGC
MGIP-024(SEQ No.24) ACGATGCTAT MGIP-109(SEQ No.109) GCTCTCACTG
MGIP-025(SEQ No.25) GAGTGTACTA MGIP-110(SEQ No.110) TAGCTCGCTG
MGIP-026(SEQ No.26) GTCATACGTG MGIP-141(SEQ No.111) GTGAGCTATC
MGIP-027(SEQ No.27) ATCTGAGTAC MGIP-142(SEQ No.112) CAGTCTGATA
MGIP-028(SEQ No.28) CGATAGCATC MGIP-143(SEQ No.113) TACATGCTCT
MGIP-029(SEQ No.29) ACTGATCTCA MGIP-144(SEQ No.114) TAGTCTCGCT
MGIP-030(SEQ No.30) CTCGATACTA MGIP-145(SEQ No.115) CGCTACGACT
MGIP-031(SEQ No.31) CATGTGACTG MGIP-146(SEQ No.116) TCGATCTGTA
MGIP-032(SEQ No.32) CGCATCACTA MGIP-147(SEQ No.117) ACAGCTATGT
MGIP-033(SEQ No.33) GCATATATCT MGIP-148(SEQ No.118) ATAGTCATGC
MGIP-034(SEQ No.34) CTGATGCGAC MGIP-149(SEQ No.119) AGACTCTCGT
MGIP-035(SEQ No.35) TCTCAGAGTC MGIP-150(SEQ No.120) TATGACGAGT
MGIP-036(SEQ No.36) CAGTGCGAGT MGIP-151(SEQ No.121) TGTGTCTAGA
MGIP-037(SEQ No.37) ATCTCTGATG MGIP-152(SEQ No.122) GAGATGTCTG
MGIP-038(SEQ No.38) CTGTCTGTGT MGIP-153(SEQ No.123) GCGTCATCGT
MGIP-039(SEQ No.39) ATGAGTCGTC MGIP-154(SEQ No.124) ATACAGAGTA
MGIP-040(SEQ No.40) GCATACTGAC MGIP-155(SEQ No.125) GTGCTCGTCA
MGIP-041(SEQ No.41) CTGCTCGCAT MGIP-156(SEQ No.126) GTCATCTGCT
MGIP-042(SEQ No.42) CTCTAGTGCT MGIP-157(SEQ No.127) TACTGACGTG
MGIP-043(SEQ No.43) CGTCGTGCTA MGIP-158(SEQ No.128) CTACACTATC
MGIP-044(SEQ No.44) CGACTACTAT MGIP-159(SEQ No.129) GCGTGCGATA
MGIP-045(SEQ No.45) GCACGTCGAT MGIP-160(SEQ No.130) TGACATGCGT
MGIP-046(SEQ No.46) GTAGTGCTCT MGIP-161(SEQ No.131) TGTCGCATAT
MGIP-047(SEQ No.47) CTGACGAGCT MGIP-162(SEQ No.132) ACACTGCTCA
MGIP-048(SEQ No.48) ACACGCACTA MGIP-163(SEQ No.133) ATACTGTGAC
MGIP-049(SEQ No.49) CTCGCACTAC MGIP-164(SEQ No.134) CTACGCATCA
MGIP-050(SEQ No.50) AGATCTCACT MGIP-165(SEQ No.135) ACGAGCTAGA
MGIP-051(SEQ No.51) ATACTAGTGT MGIP-166(SEQ No.136) GTCGATGAGA
MGIP-052(SEQ No.52) ATATCTCGTA MGIP-167(SEQ No.137) CGCTGTGATC
MGIP-053(SEQ No.53) TGACTGCGTA MGIP-168(SEQ No.138) TCGTCACTAT
MGIP-054(SEQ No.54) TGTAGACGTA MGIP-169(SEQ No.139) CTCTGTATGC
MGIP-055(SEQ No.55) AGAGACTATG MGIP-170(SEQ No.140) ACTATGAGCT
MGIP-056(SEQ No.56) CATGAGTAGA MGIP-171(SEQ No.141) CACTGCTCTC
MGIP-057(SEQ No.57) TGACAGCTAC MGIP-172(SEQ No.142) ACTGAGCATC
MGIP-058(SEQ No.58) CGCTAGACAT MGIP-173(SEQ No.143) TCTATGATAC
MGIP-059(SEQ No.59) CGTAGATATG MGIP-174(SEQ No.144) CTCACTATCA
MGIP-060(SEQ No.60) TGAGTCTGCT MGIP-175(SEQ No.145) TCGACGCACT
MGIP-061(SEQ No.61) TAGTCGTATG MGIP-176(SEQ No.146) TGACGATCTC
MGIP-062(SEQ No.62) CATACACGAC MGIP-177(SEQ No.147) ACGTATGCTC
MGIP-063(SEQ No.63) CGCTCAGAGA MGIP-178(SEQ No.148) CACGTACTCA
MGIP-064(SEQ No.64) GTGAGTCTCA MGIP-179(SEQ No.149) CGCACGTACT
MGIP-065(SEQ No.65) TGTACTACTA MGIP-180(SEQ No.150) AGTACACTAT
MGIP-066(SEQ No.66) GCTGTGCGAC MGIP-181(SEQ No.151) CTGCGACTGC
MGIP-067(SEQ No.67) TGAGATAGTC MGIP-182(SEQ No.152) CATACGACAT
MGIP-068(SEQ No.68) CGATGTATAT MGIP-183(SEQ No.153) TAGCTACGAC
MGIP-069(SEQ No.69) ATATGCTACT MGIP-184(SEQ No.154) ACTCGTGTCT
MGIP-070(SEQ No.70) CACTCGCTGT MGIP-185(SEQ No.155) CTGTGTCACT
MGIP-071(SEQ No.71) TGACGTGATG MGIP-186(SEQ No.156) TCATCTCATG
MGIP-072(SEQ No.72) ACATCATCAC MGIP-187(SEQ No.157) TACTACACTA
MGIP-073(SEQ No.73) CTACATAGAC MGIP-188(SEQ No.158) GTAGTACATA
MGIP-074(SEQ No.74) AGTCTACATA MGIP-189(SEQ No.159) GAGCTAGAGA
MGIP-075(SEQ No.75) AGTCACTGCT MGIP-190(SEQ No.160) TGTATAGTGC
MGIP-076(SEQ No.76) CATCACGCAC MGIP-191(SEQ No.161) CGTGTCGCTC
MGIP-077(SEQ No.77) AGCATGTGAT MGIP-192(SEQ No.162) ATCGCATCGT
MGIP-078(SEQ No.78) GCTATGTAGT MGIP-193(SEQ No.163) GCTGATGTAC
MGIP-079(SEQ No.79) AGACGTAGCT MGIP-194(SEQ No.164) TGCGACGTGC
MGIP-080(SEQ No.80) CAGACATAGA MGIP-195(SEQ No.165) ATCAGATCTC
MGIP-081(SEQ No.81) TGCGTCATCA MGIP-196(SEQ No.166) CGAGCTGTGC
MGIP-082(SEQ No.82) TACATAGCTC MGIP-197(SEQ No.167) ATATGTCTGT
MGIP-083(SEQ No.83) ATGTGAGAGA MGIP-198(SEQ No.168) TACGTATGTA
MGIP-084(SEQ No.84) CGTCGTCTGT MGIP-199(SEQ No.169) GACACTACTC
MGIP-085(SEQ No.85) CGTGTAGACT MGIP-200(SEQ No.170) CGATGACTCA
表3-2引物序列
引物标识 引物序列
MGPA(SEQ No.171) TTTGTTACTGTGGTGGATACTAC
MGPB(SEQ No.172) TTTGTTACCGTTGTTGATACTAC
MGPC(SEQNo.173) TTTGTTACTAAGGTAGATACCACTC
MGPD(SEQ No.174) TTTGTTACTGTTGTGGATACAAC
MGP31(SEQ No.175) TTTGTTACTATGGTAGATACCACAC
MGPG(SEQ No.176) GAAAAATAAACTGTAAATCATATTCCT
MGPH(SEQ No.177) GAAAAATAAATTGTAAATCATACTC
MGPI(SEQ No.178) GAAATATAAATTGTAAATCAAATTC
MGPJ(SEQ No.179) GAAAAATAAACTGTAAATCATATTC
MGP18(SEQ No.180) GAAAAATAAACTGCAAATCATATTC
GP5+(SEQ No.181) TTTGTTACTGTGGTAGATACTAC
MS3(SEQNo.182) AATATATGTGTGCTTATTTG
MS10(SEQ No.183) AGATTAGGGAAAGTATTAGA
MGP58(SEQ No.184) TTTGTTACTGTAGTTGATACCACTC
MGP52(SEQ No.185) TTTGTCACAGTTGTGGATACCACTC
b)按照标签序列和引物序列对预处理后的reads序列进行聚类;
c)截取每个reads序列中对应的标签序列和引物序列,标识标签序列和引物序列到每个reads序列的标识符中;
d)得到聚类之后的reads序列,并将聚类统计文件输出为StatEff.txt。
实施例4比对和统计
运用BWA(V0.6.2-r126)软件,把聚类之后的reads序列比对到HPV参考序列集上,得到比对后的统计结果文件StatMap.txt和Reads分布矩阵文件RDisMat.txt。StatMap.txt文件按照每个样本一行,每种型别一列进行统计每个样本(共3331份样本)中的总reads数、比对上的reads数、未比对上的reads数;在比对上的reads中,与HBB比对上的reads数和各HPV分型的reads数,如表4所示。
表4StatMap文件结果
Figure PCTCN2017100927-appb-000017
Figure PCTCN2017100927-appb-000018
实施例5建立每个HPV分型的LDA模型
在本实施例中,随机划分SAM文件中的3331份样本中的60%即1999份样本作为训练集,其余40%即1332份样本作为测试集。训练集用于建立各HPV分型的LDA模型和阈值C。
a)根据以往经验预设一个阈值C为0.5;
b)基于训练集样本,运用以下公式计算参数
Figure PCTCN2017100927-appb-000019
Figure PCTCN2017100927-appb-000020
Figure PCTCN2017100927-appb-000021
其中,N0为阴性样本量,N1为阳性样本量;
c)根据以下公式,计算出训练集中各HPV型别的分值:
Figure PCTCN2017100927-appb-000022
d)基于阈值C(0.5),针对训练集样本观测数据,将训练参数
Figure PCTCN2017100927-appb-000023
Figure PCTCN2017100927-appb-000024
带入到 分值计算公式中,得到训练集中每个样本总体的分值,将该分值与C(0.5)进行比较,若大于C(0.5)则判定为阳性,否则判定为阴性;
e)以病理分析结果为准,考察模型效果,计算模型的假阴性率与假阳性率。一般而言,两者均与阈值C有关。尝试不同的阈值C(即分别尝试C为0.4、0.42、0.46、0.48、0.52、0.54、0.58和0.6),找到最佳值(C=0.5)使得模型假阴性率不大于5%且假阴性率与假阳性率之和最小;
f)依次考虑训练集中的每个HPV分型:若某一样本中某一HPV型别的阳性样本的数量≥9(例如样本编号S007中,HPV45的样本数量为4052),则对该HPV型别建立LDA模型;忽略某一样本中某一HPV型别的阳性样本的数量<9的型别。
实施例6测试集的结果判定
a)将参数
Figure PCTCN2017100927-appb-000025
Figure PCTCN2017100927-appb-000026
带入到以下公式中,计算测试集中每个样本的分值,将分值与阈值C(0.5)进行比较;若大于C(0.5),则判定为阳性,并进入步骤b);否则判定为阴性;
Figure PCTCN2017100927-appb-000027
b)对测试集中的阳性样本进行分型:依次对每个HPV型别考虑:若在训练集中曾对该HPV型别建立LDA模型,则将该模型应用于测试集中的阳性样本上;若在训练集中不曾对该HPV型别建立LDA模型,则将所有HPV型别上曾建立的LDA模型应用于该HPV型别,取平均结果并将结果输出为HPV-GR.txt。
性能评估
受篇幅限制,展示20个样本的旧系统和实施例六(即HPV-AGM)的结果 如下表5所示。
表5
Figure PCTCN2017100927-appb-000028
性能评估主要将本发明的方法和系统与旧的基于NGS的HPV检测技术相比较,性能评估策略主要采用受试者工作特征曲线(ROC:receiver operating characteristic curve)分析,结果如图2所示。从图2可以看出,本发明的方法和系统(即HPV-AGM)的准确率达到99.7%,并且特异度和灵敏度均优于旧模 型。
注意,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本发明的范围由所附的权利要求范围决定。

Claims (10)

  1. 一种HPV精确分型的生物信息学分析方法,其特征在于,所述分析方法包括以下步骤:
    1)接收高通量测序技术得到的测序片段,得到reads序列;
    2)将reads序列进行分组聚类,得到每个样本的reads序列;
    3)将每个样本的reads序列与HPV参考序列集进行比对和筛选,确定筛选后的reads序列的匹配结果,并进行统计;
    4)对确定HPV型别的reads序列采用LDA模型进行HPV分型,最终确认每个reads序列的HPV型别。
  2. 根据权利要求1所述的分析方法,其特征在于,步骤2)所述分组聚类具体包括:
    e)将reads序列按照标签序列和引物序列进行聚类;
    f)截取每个reads序列中对应的标签序列和引物序列并进行标识,得到聚类后每个样本的reads序列。
  3. 根据权利要求1-2中任一项所述的分析方法,其特征在于,步骤3)所述HPV参考序列集包括HBB序列集和HPV型别序列集。
  4. 根据权利要求1-3中任一项所述的分析方法,其特征在于,步骤3)所述统计为将比对结果按照每个样本一行,每种型别一列进行统计,得到reads分布矩阵文件。
  5. 根据权利要求1-4中任一项所述的分析方法,其特征在于,步骤4)所述HPV分型包括以下步骤:
    g)根据步骤3)的reads分布矩阵文件判断每个样本总体为阴性或阳性,若为阴性,则输出结果;若为阳性,进入步骤h);
    h)逐个HPV型别判断阴性或阳性,即判断每个样本感染的HPV的型别。
  6. 根据权利要求5所述的分析方法,其特征在于,所述判断每个样本总体为阴性或阳性包括以下步骤:
    a’)预设阈值C,通过训练集计算和分析,调整预设的阈值C;优选地,预设的阈值C的范围为0.4~0.6,更优选为0.5;优选地,所述通过训练集计算和分析具体包括:基于训练集样本,运用下面公式计算出参数
    Figure PCTCN2017100927-appb-100001
    Figure PCTCN2017100927-appb-100002
    其中N0为阴性样本量,N1为阳性样本量:
    Figure PCTCN2017100927-appb-100003
    b’)针对训练集样本观测数据,将训练参数
    Figure PCTCN2017100927-appb-100004
    Figure PCTCN2017100927-appb-100005
    带入到分值计算公式中,得到训练集中每个样本总体的分值,所述分值计算公式为:
    Figure PCTCN2017100927-appb-100006
    c’)将分值与预设的阈值C进行比较,若大于C,则判定为阳性,否则判定为阴性;
    优选地,所述判断每个样本为阴性或阳性还具体包括:
    d’)将训练参数
    Figure PCTCN2017100927-appb-100007
    Figure PCTCN2017100927-appb-100008
    带入到分值计算公式中,得到测试集中每个样本总体的分值,将该分值与C进行比较,若大于C则判定为阳性,否则判定为阴性分值公式:
    Figure PCTCN2017100927-appb-100009
    e’)对测试集中的阳性样本进行分型:依次对每个HPV型别考虑,若在训练集中曾对该HPV型别建立LDA模型,则将该模型应用于测试集中的阳性样本上;若在训练集中不曾对该HPV型别建立LDA模型,则将所有HPV型别上曾建立的LDA模型应用于该HPV型别,取平均结果;
    f’)输出每个测试样本的判定结果。
  7. 根据权利要求5所述的分析方法,其特征在于,所述逐个HPV型别判断阴性或阳性包括:
    若该HPV型别的阳性样本数量≥9,则对该HPV型别建立LDA模型,以总比对上的reads数、HBB reads数和该HPV型别reads数为特征值,以该HPV型别的阴性样本总数量和阳性样本总数量为相应变量,判断每个reads序列样本为阴性或阳性;
    若该HPV型别的阳性样本数量小于9,则将其余所有具有≥9的阳性样本数量的HPV型别建立的LDA模型用于该HPV型别,取平均结果后,判断每个reads序列样本为阴性或阳性;
    优选地,所述总比对上的reads数、HBB reads数和该HPV型别reads数均使用相对值。
  8. 根据权利要求1-7任一项所述的分析方法,其特征在于,所述分析方法还包括预处理的步骤;
    优选地,所述预处理的步骤具体包括:对高通量测序技术得到的测序片段进行过滤,除去不合格的序列;
    优选地,所述过滤具体包括:
    a)预设不合格碱基的测序质量阈值和比例阈值;
    b)当reads序列中碱基的测序质量低于所述测序质量阈值,且低于测序质量阈值的碱基个数占整条序列碱基个数的比例超过所述比例阈值时,将该样本的reads序列判定为不合格序列并加以过滤;否则,进入步骤c);
    c)当reads序列的测序结果中不确定的碱基个数超过整条序列碱基个数的10%时,将该reads序列判定为不合格序列并加以过滤;否则,进入步骤d);
    d)当reads序列的测序结果与接头序列库进行比对时,如果reads序列中存在测序接头序列,则将该reads序列判定为不合格序列并加以过滤;否则,判定为合格的reads序列,进行步骤2)。
  9. 一种采用权利要求1-8中任一项所述的分析方法进行HPV精确分型的生物信息学分析系统,其特征在于,所述系统包括:
    接收模块:用于接收高通量测序技术得到的测序片段,得到每个样本的reads序列;
    聚类模块:与所述接收模块相连,用于将reads序列根据标签序列和引物序列进行分组聚类,得到聚类后的reads序列;
    比对统计模块:与所述聚类模块相连,用于将聚类后的reads序列与HPV参考序列集进行比对和筛选,确定筛选后的reads序列的HPV型别或阴性,并进行统计;
    HPV分型模块:与所述比对统计模块相连,用于对确定HPV型别的reads序列采用LDA模型进行HPV分型,最终确认每个reads序列归属的HPV型别或确定为阴性。
  10. 根据权利要求9所述的分析系统,其特征在于,所述系统还包括预处理模块,所述预处理模块用于每个reads序列的过滤,除去不合格的序列。
PCT/CN2017/100927 2017-09-07 2017-09-07 一种hpv精确分型的生物信息学分析方法及系统 WO2019047109A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/100927 WO2019047109A1 (zh) 2017-09-07 2017-09-07 一种hpv精确分型的生物信息学分析方法及系统
CN201780093704.XA CN111032885B (zh) 2017-09-07 2017-09-07 一种hpv精确分型的生物信息学分析方法及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/100927 WO2019047109A1 (zh) 2017-09-07 2017-09-07 一种hpv精确分型的生物信息学分析方法及系统

Publications (1)

Publication Number Publication Date
WO2019047109A1 true WO2019047109A1 (zh) 2019-03-14

Family

ID=65633303

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100927 WO2019047109A1 (zh) 2017-09-07 2017-09-07 一种hpv精确分型的生物信息学分析方法及系统

Country Status (2)

Country Link
CN (1) CN111032885B (zh)
WO (1) WO2019047109A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023164470A1 (en) * 2022-02-23 2023-08-31 The University Of North Carolina At Chapel Hill Methods of treatment for hpv malignancies

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102367487A (zh) * 2011-09-09 2012-03-07 中国医学科学院病原生物学研究所 一种高精确度检测人类乳头状瘤病毒基因型的方法
CN102884203A (zh) * 2010-02-26 2013-01-16 崇实大学校产学协力团 用于对查询序列的基因型与亚型进行分类的方法
CN103261442A (zh) * 2010-12-02 2013-08-21 深圳华大基因健康科技有限公司 Hpv 精确分型的生物信息学分析的方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NZ603996A (en) * 2009-02-02 2014-03-28 Chromocell Corp Cells or cell lines that stably express a bitter taste receptor and methods of making them
CN101921874B (zh) * 2010-06-30 2013-09-11 深圳华大基因科技有限公司 基于Solexa测序法的检测人类乳头瘤病毒的方法
CN105378104A (zh) * 2013-03-15 2016-03-02 威拉赛特公司 用于样品分类的方法和组合物

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102884203A (zh) * 2010-02-26 2013-01-16 崇实大学校产学协力团 用于对查询序列的基因型与亚型进行分类的方法
CN103261442A (zh) * 2010-12-02 2013-08-21 深圳华大基因健康科技有限公司 Hpv 精确分型的生物信息学分析的方法及系统
CN102367487A (zh) * 2011-09-09 2012-03-07 中国医学科学院病原生物学研究所 一种高精确度检测人类乳头状瘤病毒基因型的方法

Also Published As

Publication number Publication date
CN111032885B (zh) 2024-05-17
CN111032885A (zh) 2020-04-17

Similar Documents

Publication Publication Date Title
Burger et al. HPV mRNA tests for the detection of cervical intraepithelial neoplasia: a systematic review
CN104603291B (zh) 黑素细胞病变中的分子恶性肿瘤
AU2018305609B2 (en) Enhancement of cancer screening using cell-free viral nucleic acids
Ouh et al. Prevalence of human papillomavirus genotypes and precancerous cervical lesions in a screening population in the Republic of Korea, 2014–2016
US20130122499A1 (en) System and method of detecting local copy number variation in dna samples
Knapp et al. Normalizing for individual cell population context in the analysis of high-content cellular screens
Karakitsos et al. A preliminary study of the potential of tree classifiers in triage of high-grade squamous intraepithelial lesions
CN113555121B (zh) 一种胃癌预后标志物的筛选和分类方法、胃癌预后标志物和检测胃癌预后的试剂及应用
Oner et al. Obtaining spatially resolved tumor purity maps using deep multiple instance learning in a pan-cancer study
CN111613324A (zh) 一种机器学习模型高通量分析乙型肝炎病毒基因组rt/s区序列特征预测肝癌风险的方法
WO2019047109A1 (zh) 一种hpv精确分型的生物信息学分析方法及系统
CN112575123B (zh) 引物组合、探针组合以及人乳头瘤病毒核酸检测试剂盒
WO2012071685A1 (zh) Hpv精确分型的生物信息学分析的方法及系统
CN116096920A (zh) 用于诊断呼吸道病原体和预测covid-19相关结果的方法
CN113151460A (zh) 一种识别肺腺癌肿瘤细胞的基因标志物及其应用
US7687232B2 (en) Method for estimating the risk of carcinoma development
WO2016176846A1 (zh) 检测染色体非整倍性的试剂盒、装置和方法
JPWO2018223066A5 (zh)
CN111621565B (zh) 弥漫性大b细胞淋巴瘤分子分型试剂盒及分型装置
CN113710818A (zh) 病毒相关联的癌症风险分层
CN111593140A (zh) 一种高危型人乳头瘤病毒的检测和分型试剂盒
CN109468381A (zh) 用于宫颈癌筛查的甲基化位点及其检测引物
CN107460234B (zh) 血清48-lncRNA作为肝脏慢性疾病诊断标记物的应用
WO2024062867A1 (ja) 対象のがん罹患の可能性を分析する方法
CN108841956B (zh) 长链非编码RNAs的应用

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17924663

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17924663

Country of ref document: EP

Kind code of ref document: A1