WO2023087527A1 - 一种测序接头及其测序分析系统 - Google Patents

一种测序接头及其测序分析系统 Download PDF

Info

Publication number
WO2023087527A1
WO2023087527A1 PCT/CN2022/071549 CN2022071549W WO2023087527A1 WO 2023087527 A1 WO2023087527 A1 WO 2023087527A1 CN 2022071549 W CN2022071549 W CN 2022071549W WO 2023087527 A1 WO2023087527 A1 WO 2023087527A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
sequence
internal index
sequences
internal
Prior art date
Application number
PCT/CN2022/071549
Other languages
English (en)
French (fr)
Inventor
欧阳川
王珺
周逸文
王江浩
刘紫丹
Original Assignee
杭州杰毅生物技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州杰毅生物技术有限公司 filed Critical 杭州杰毅生物技术有限公司
Publication of WO2023087527A1 publication Critical patent/WO2023087527A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • the invention relates to the technical field of molecular biology, in particular to nucleic acid sequencing, and in particular to a sequencing linker and a sequencing analysis system thereof.
  • mNGS next-generation sequencing
  • a conventional TruSeq sequencing adapter is shown in Figure 1a.
  • the last bit of the Read1 sequencing primer contains T, so the insertion fragment will be detected first during sequencing, and the T base will not be detected.
  • change the index sequence sequencing primers to obtain the index sequence.
  • the Read 1 sequencing part takes about 500 minutes, and it takes about 50 minutes to complete the Index 1 tag sequencing. That is, it takes about 550 minutes ( ⁇ 9 hours) for the overall sequencing to be completed, and the sequencer can obtain all the sequences and distinguish which specific sample it is.
  • the purpose of the present invention is to provide a sequencing adapter, which can realize sequencing and analysis of multiple samples while ensuring high sequencing quality and throughput, which greatly shortens the turnaround time (TAT) and improves the timeliness of detection.
  • TAT turnaround time
  • the sequencing time is long, accounting for 50% of the overall TAT time; 2.
  • the analysis takes one hour , and only after all the sequencing is completed, that is, at least 14 hours later, after obtaining the index sequence Index of each sample, the data can be split and then the sequence comparison analysis can be started.
  • a sequencing linker in the form of a partially complementary Y-shaped structure, one of the strands from 5' to 3' sequentially includes: the internal Index sequence, the Index1 sequencing primer binding region sequence, the Index1 sequence, and the chip
  • the internal tag sequence regions of the two chains are completely complementary to each other, and the sequencing primer binding region sequences are partially complementary to each other.
  • An Index2 sequence can also be added between the P5 sequence and the sequence of the binding region of the Read1 sequencing primer.
  • the new linker will appear during the sequencing process, and at a fixed position (T-A junction), the sequencing results are all T bases.
  • T-A junction the sequencing results are all T bases.
  • the proportion of bases in each cycle will show a very high proportion of T bases in the ninth cycle, which will cause a single base fluorescence intensity If the signal is too strong, all other bases have no signal, and the balance ratio between the four bases A/T/C/G is broken, which increases the difficulty for the sequencer to analyze the specific base, and will be judged as the base at this position by the analysis software
  • There is a problem with the sequencing quality of the base resulting in a large proportion of sequencing sequences failing to pass the quality control, which greatly reduces the effective data output.
  • the first few bases of the sequencing are particularly important, which play a role in positioning the cluster position. Therefore, within the first ten cycles, if the entire sequencing chip has the same base in a certain cycle, it will greatly Reduced sequencing quality and quantity.
  • the present invention further optimizes the internal Index sequence designed to distinguish different samples, and designs internal Index sequences with two to four or more lengths, and the length difference between adjacent long and short internal Index sequences can be one Base, two bases or multiple bases, but in order to save sequencing costs and reduce the time spent on sequencing internal Index, the length difference is preferably one base.
  • the length difference is preferably one base.
  • All the internal Index sequences used should be combined to achieve a basic balance of the base ratios of the internal Index sequences in the sequencing cycles at each position, so that the Index can improve the sequencing quality of the first 10 bases as much as possible.
  • the optimal combination of three internal Index sequence lengths is 6bp, 7bp and 8bp, and the adapters of each internal Index length should account for one-third of the total adapters One or so; or optimally use a combination of four internal Index sequence lengths of 6bp, 7bp, 8bp and 9bp, and the joints of each internal Index length should account for about a quarter of the total joints.
  • one internal Index sequence is 6 bases long, and the other internal Index sequence is 7 bases long, and the two types of samples are mixed 50% each. It will appear that the seventh base sequencing result shows that 50% of the sequence is T (the T-A junction of the 6-base internal Index sequence), and the remaining 50% of the sequence is the seventh of the 7-base internal Index sequence. bases (and not allowed to be designed as T). In this combination, when the eighth base is sequenced, 50% of the signal will be T again (at the T-A junction of the 7-base internal Index sequence) Index. Starting from the ninth base, all sequences are those in the insert. If there are three to four combinations of internal Index sequences of different lengths, it is better to distribute the base ratio evenly in each cycle.
  • one internal Index sequence is 6 bases long, one internal Index sequence is 7 bases long, and one internal Index sequence is 8 bases long, each accounting for 1 /3 mixed.
  • a combination of four internal Index sequences of different lengths one internal Index sequence length of 6 bases, one internal Index sequence length of 7 bases, one internal Index sequence length of 8 bases, and one The length of each internal Index sequence is 9 bases, each accounting for 1/4 of the mix.
  • the length difference between adjacent long and short internal Index sequences can be one base, two bases or multiple bases, but the Index is preferably one base, such as 6 bases, 7 bases and 8 bases base combination.
  • all internal Index sequences used are combined to achieve a basic balance of the base ratios of the internal Index sequences in each round of sequencing cycles.
  • the ratios of the four bases of ATCG in the internal index sequence in each round of sequencing cycles are respectively controlled at 8% to 50%. Appropriate, the ratio is optimally controlled at 12.5% to 37.5%.
  • all internal Index sequences used should also meet: (1) The minimum Hamming distance between any two internal Index sequences is 3; (2) Index sequences containing more than three identical consecutive bases are excluded; (3) ) The first two bases of the internal Index should not be "GG". Generally speaking, the longer the length of the Index sequence, the more types of Indexes that ATCG four bases can combine to create. In order to design enough Indexes for multi-sample sequencing, and the minimum Hamming distance between any two Index sequences is greater than or equal to 3, the sequence length of the internal Index should be more than 6 bases.
  • the internal index sequence can be measured after a few cycles to distinguish each sample, so it is not necessary to wait for the completion of all sequencing (9-10 hours) before starting to analyze specific samples the sequence of.
  • the measured sequence will become longer and longer.
  • the present invention can realize real-time analysis to obtain comparison and analysis results of sequences of different lengths.
  • Another object of the present invention is to provide a new sequencing analysis system (see Figure 2b) for the above-mentioned new joint structure, which can perform analysis while sequencing, and obtain sequence comparison and analysis results by real-time analysis.
  • the system has the advantages of real-time cycle analysis, short analysis time and high accuracy.
  • the sequencing analysis system of the present invention comprises:
  • Sequencing monitoring module used to monitor the sequencing progress in real time and trigger analysis tasks.
  • the sequencing monitoring module will regularly scan the sequencing directory to monitor the sequencing progress. When the sequencing reaches a sufficient length (the shortest length is 22bp), the monitoring program sends a signal to trigger the subsequent analysis steps, and the extended sequence will continue to be analyzed in real time as the sequencing progresses, and the next round of analysis can be started immediately after the completion of the previous round of analysis. round of analysis.
  • Data generation module used to convert the BCL files generated by sequencing into fastq files and filter low-quality sequences
  • sequence data is split into corresponding samples using a specific analysis program for specially designed adapters.
  • the data generation module converts the BCL files generated by sequencing into fastq files, and conducts quality control on the sequencing data, removes low-quality data and sequences containing adapters, and ensures the reliable quality of data entering the subsequent analysis process.
  • the use of specially designed adapters during sequencing enables it to be used to distinguish different samples, and is also suitable for extremely fast analysis processes, and uses specific analysis programs to split sequence data into corresponding samples.
  • Data filtering module used to remove human sequences from the sequences that passed the quality control.
  • the data filtering module compares the sequences that pass the quality control with the human genome database using the rapid comparison software, and removes the human sequences on the comparison. Output unaligned sequences to obtain non-human data with human sequences removed.
  • Data analysis module used to compare non-human sequence to pathogenic microorganism genome database
  • the data analysis module compares the non-human source data with the genome database of pathogenic microorganisms to obtain the comparison results of microbial sequences. For sequences with multiple alignment results, the system will select the alignment results whose alignment scores are within the score interval [L, U], and calculate the nearest common ancestor (LCA) of the taxon to which these reference sequences belong, as the The final alignment of the sequences.
  • Report generation module used for statistical analysis and comparison results, and output analysis reports.
  • the report generation module counts the number of sequences detected by each taxon according to the sequence comparison results. For taxon with smaller nodes, it not only counts the number of sequences on this taxon, but also counts the number of sequences on the taxon and all its child nodes , and count the number of unique and complete alignments for each taxon.
  • the present invention has the following advantages:
  • the internal index is located between the sequencing primer and the insert fragment. During extremely fast analysis, the index is measured first, so that the sequences from different samples can be separated in the early stage of sequencing without waiting for the sequencing to be completed.
  • Index adopts at least two or more different lengths (preferably three lengths, respectively 6/7/8bp). Index sequences of different lengths avoid conventional methods from appearing at the same cycle, which is the result of T, thereby reducing the quality of sequencing.
  • the analysis software begins to analyze the pathogen information after the sequencer obtains 22 sequences, and continues to follow up and analyze each cycle to achieve the purpose of NGS real-time analysis.
  • Accompanying drawing 5 is the detailed sequence structure of the sequencing joint used in embodiment 1;
  • Accompanying drawing 6 is the base ratio comparison chart of each cycle number when using the sequencing joint of the present invention and the traditional Illumina TruSeq joint sequencing in embodiment 1;
  • a sequencing linker in this embodiment has a Y-shaped structure with partially complementary pairs.
  • One of the strands sequentially includes from 5' to 3': internal Index sequence, Index1 sequencing primer binding region sequence, Index1 sequence, and P7 sequence combined with the chip probe. From 5' to 3' on the other strand, it includes: the P5 sequence combined with the chip probe, the Read1 sequencing primer binding region sequence, the internal Index sequence and the T base overhang.
  • the internal tag sequence regions of the two chains are completely complementary to each other, and the sequencing primer binding region sequences are partially complementary to each other. Its structure is shown in Figure 5, using three types of internal Index sequences, the lengths are 6bp, 7bp and 8bp respectively.
  • a total of 48 internal Index sequences are designed. They are divided into 16 groups, each group has 6bp, 7bp and 8bp internal Index sequences.
  • Internal Index sequences meet the following requirements: (1) The minimum Hamming distance between any two internal Index sequences is 3 (2) Index sequences containing more than three identical consecutive bases are excluded. (3) The first two bases of the internal Index should not be "GG”. (4) The seventh base of 7bp Index and the seventh base of 8bp Index should not be T, and the eighth base of 8bp Index should not be T. (5) The ratio of bases at each sequencing position of the Index in the combination is manually adjusted to achieve a relative balance.
  • the use of optimized internal Index adapters can ensure a high percentage of qualified clusters and Q30 scores, and there is no significant difference between these sequencing quality indicators and the data of TruSeq adapters.
  • the analysis results show that in the first report of the ultra-fast analysis with a sequencing read length of 22bp, the system has been able to detect positive pathogenic bacteria sensitively; as the sequencing progresses, the number of detected pathogenic bacteria sequences increases slowly, and after several cycles becoming steady. Therefore, for positive samples infected by pathogens, the system can detect positive pathogens at a very early stage and give reliable analysis results.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Zoology (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Wood Science & Technology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Immunology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Plant Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种测序接头,呈部分互补配对的Y字形结构,其中一条链从5'至3'依次包括:内部Index序列、Index1测序引物结合区域序列、Index1序列、与芯片探针结合的P7序列;另一条链上从5'至3'依次包括:与芯片探针结合的P5序列、Read1测序引物结合区域序列、内部Index序列和T碱基垂悬。该测序接头采用了用于区分不同样品的内部Index序列,可以在保证较高的测序质量和通量下,实现多个样本的边测序边分析,缩短了周转时间,提高了检测的时效。

Description

一种测序接头及其测序分析系统 技术领域
本发明涉及分子生物学技术领域,尤其是核酸测序,具体涉及一种测序接头及其测序分析系统。
背景技术
自2014年新英格兰医学杂志发表宏基因组二代测序(mNGS)确诊钩体病的首例临床应用案例以来,mNGS在新发病原体鉴定、罕见重要病原体诊断等方面取得诸多进展,临床上也认可了mNGS在急危重症感染领域的应用。病原mNGS是指把疑似感染部位的样本,提取样本中的核酸,将核酸片段接上和测序芯片可以杂交的DNA接头,接头上含有可以区分不同样本的标签序列(Index),通过高通量测序仪测序,把测得的序列和含有各种病原体的数据库进行比对,可以快速锁定病原体。同时通过区分标签序列Index,就可以实现在一次运行中同时并行测序多个样本,充分利用测序通量,并降低成本。
常规的TruSeq测序接头如附图1a,接头末端有一个T碱基垂悬,用来和加入到目标片段中的样本中的末端A碱基垂悬互补进行T-A连接。Read1测序引物最后一位含有T,因此测序时会直接先测到插入片段,而不会测到T碱基。Read 1测序完成后,换标签序列测序引物来获得标签序列。一般用来做病原体测序的时候,Read 1测序部分约需要500分钟,完成Index 1标签测序大概需要50分钟。也就是整体测序完大概需要550分钟(~9小时),测序仪才可以获得全部的序列,并可以区分是哪个具体样本。
综上所述,加上文库制备的时间(4小时)和测序时间(9~10小时),整体上从开始准备样本到最终可以开始分析每一个样本需要14个小时。如果是Illumina NextSeq类似通量的测序仪,每次产生约20G数据,需要分析一个小时左右。因此从最初样本到产生结果至少要15个小时,大致流程耗时如附图1b所示。检测时效性差,亟需改进。
发明内容
本发明的目的是提供一种测序接头,可以在保证较高的测序质量和通量下,实现多个样本的边测序边分析,极大地缩短了周转时间(TAT),提高了检测的时效。
通过对现有测序接头检测过程的分析,为了提高检测的时效性,有两个关键的时间点需要去解决,1.测序时间长,占了整体TAT时间的50%;2.分析需要一个小时,并且只有等待全部测序结束后,也就是需要进行至少14个小时以后,获得每一个样本的标签序列Index后才能拆分数据再开始进行序列比对分析。
为了达到上述目的,本发明采用如下的技术方案:
一种测序接头(如附图2a所示),呈部分互补配对的Y字形结构,其中一条链从5’至3’ 依次包括:内部Index序列、Index1测序引物结合区域序列、Index1序列、与芯片探针结合的P7序列;另一条链上从5’至3’依次包括:与芯片探针结合的P5序列、Read1测序引物结合区域序列、内部Index序列和T碱基垂悬。这两条链的内部标签序列区域完全互补配对,测序引物结合区域序列部分互补配对。在P5序列与Read1测序引物结合区域序列之间还可以增加Index2序列。
新的接头由于在Read1测序引物结合区域下游添加了内部Index序列,会出现在测序的过程中,到固定的位置(T-A连接处),测序结果都是T碱基。如附图3所示,单独采用长度为8bp的内部Index序列接头测序时各循环数下碱基比例会在第九个循环时出现极高比例的T碱基,这样会造成单一碱基荧光强度过强,而其它碱基一律没有信号,A/T/C/G四个碱基之间的平衡比例被打破,增加了测序仪分析具体碱基的难度,会被分析软件判断为该位置碱基的测序质量存在问题,导致较大比例的测序序列无法通过质控,大大减少了有效数据产出。对于二代测序仪来说,测序刚开始的几个碱基尤为重要,起到定位簇位置的作用,因此前十个循环内,出现整个测序芯片在某一个循环都是相同的碱基会大大降低测序质量和数量。
为了解决此问题,本发明进一步优化设计用于区分不同样品的内部Index序列,设计具有两种至四种以上长度的内部Index序列,且相邻长短的内部Index序列之间的长度差可以是一个碱基、两个碱基或者多个碱基,但为了节约测序成本和减少花在测序内部Index上的时间,长度差优选为一个碱基。在使用的时候必须要有不同长度的内部Index序列接头组合使用,避免T-A连接的T碱基出现在同一个测序循环中。使用的所有内部Index序列在组合后应达到内部Index序列在各位置轮测序循环中的碱基比例基本平衡,Index这样尽可能提高前10个碱基测序质量。
附图4为采用6bp、7bp和8bp三种长短的内部Index序列接头时的各轮循环下的碱基比例结果,通过混合不同的Index长度的接头,从而错开了碱基T出现的循环,从附图4可以看到有三个循环出现稍高比例的T,并不是都集中出现在同一个循环,这样优化后可以得到高质量的测序结果。
推荐至少两种至四种以上的内部Index序列组合完成多样本的标记和测序。且各种内部Index长度的接头的实际使用比例要达到均衡。为了节约测序成本和减少花在测序内部Index上的时间,最优使用为6bp、7bp和8bp的三种内部Index序列长度组合,且每种内部Index长度的接头要占接头总量的三分之一左右;或最优使用为6bp、7bp、8bp和9bp的四种内部Index序列长度组合,且每种内部Index长度的接头要占接头总量的四分之一左右。比如两种长短的内部Index序列组合的时候,一种内部Index序列长度为6个碱基,一种内部Index序 列长度为7个碱基,两类样本各50%混合。这样会出现,第七个碱基测序结果为50%的序列为T(6碱基内部Index序列的T-A连接处),剩下的50%的序列为7碱基长度的内部Index序列的第七个碱基(且不允许设计为T)。这个组合在测序到第八个碱基就会出现50%的信号又是T的现象(7碱基内部Index序列的T-A连接处)Index。从第九个碱基开始,所有的序列都是插入片段中的序列。如果有三至四种不同长度的内部Index序列组合,能更好地在各个循环把碱基比例均匀分配。如三种不同长短的内部Index序列组合,一种内部Index序列长度为6个碱基,一种内部Index序列长度为7个碱基,一种内部Index序列长度为8个碱基,各占1/3混合。或者是四种不同长短的内部Index序列组合,一种内部Index序列长度为6个碱基,一种内部Index序列长度为7个碱基,一种内部Index序列长度为8个碱基,还有一种内部Index序列长度为9个碱基,各占1/4混合。
相邻长短的内部Index序列之间的长度差可以是一个碱基、两个碱基或者多个碱基,但Index优选为一个碱基,如6个碱基,7个碱基和8个碱基的组合。
本发明进一步优选,使用的所有内部Index序列在组合后达到内部Index序列在各轮测序循环中的碱基比例基本平衡。。一般而言,在一次测序中文库数(或使用的Index数)大于等于4个时,内部Index序列在各轮测序循环中的ATCG四种碱基的比例分别各自控制在8%~50%为合适,比例控制在12.5%~37.5%为最优。
除上述要求外,使用的所有内部Index序列还应满足:(1)任意两个内部Index序列的最小汉明距离为3;(2)排除含有三个以上相同连续碱基的Index序列;(3)内部Index的前两个碱基不应该是“GG”。一般而言,Index序列的长度越长,ATCG四种碱基能组合创造出的Index种类越多。为了设计出足够多的Index用于多样本测序,且任意两个Index序列之间的最小汉明距离大于等于3,内部Index的序列长度为6个碱基以上为宜。
由于改变了测序序列的产生方式,测序开始后,几个循环后就可以测得内部Index序列来区分每一个样本,因此可以不用等待全部测序(9~10个小时)完成以后才开始分析具体样本的序列。另外由于测序循环数越多,测得的序列会越来越长,本发明随着测序的进展,可实现实时分析得到不同长度序列的比对和分析结果。
本发明的另一目的是针对上述新的接头结构,提供新的测序分析系统(见附图2b),进行边测序边分析,实时分析得到序列比对和分析结果。本系统具有实时循环分析、分析时间短、准确性高的优点。
本发明的测序分析系统包括:
1.测序监控模块:用于实时监控测序进度并触发分析任务。
测序监控模块会定时扫描测序目录,监测测序进度。当测序进行到足够长度(最短长度 为22bp)时,由监控程序发出信号触发后续分析步骤,并随着测序进行对延伸后的序列持续进行实时分析,可在完成上一轮分析后马上启动下一轮分析。
2.数据生成模块:用于将测序生成的BCL文件转换成fastq文件,并过滤低质量序列;
同时针对特殊设计的接头使用特异性分析程序将序列数据拆分至对应的样本中。
数据生成模块将测序生成的BCL文件转换成fastq文件,并对测序数据进行质控,去除低质量数据和含接头序列,保证进入后续分析流程的数据质量可靠。同时,在测序时使用特殊设计的接头使其既可用于区分不同的样本,也适用于极速分析过程,并使用特异性分析程序将序列数据拆分至对应的样本中。
3.数据过滤模块:用于去除通过质控的序列中的人源序列。
数据过滤模块将通过质控的序列用快速比对软件与人源基因组数据库进行比对,去除比对上的人源序列。输出未比对上的序列,得到去除了人源序列的非人源数据。
4.数据分析模块:用于将非人源序列比对到病原微生物基因组数据库中;
数据分析模块将非人源数据与病原微生物基因组数据库进行比对,得到微生物序列比对结果。对于有多条比对结果的序列,系统会选取比对得分在得分区间[L,U]内的比对结果,计算这些参考序列所属分类单元(taxon)的最近共同祖先(LCA),作为该序列的最终比对结果。得分区间的确定方式为:U=min(S max,s max),L=max(U-R,S min),其中S max代表理论上比对最高得分,S min代表理论上比对最低得分,s max代表该序列的比对结果的最高得分,R代表得分区间范围参数,默认值为20。在分析比对结果时,同时记录每条序列所比对上的物种是否唯一,是否是完全比对等信息。
5.报告生成模块:用于统计分析比对结果,输出分析报告。
报告生成模块根据序列的比对结果,统计每个分类单元测到的序列数目,对于含有更小节点的taxon,既统计本taxon上的序列数目,也统计该taxon及其所有子节点的序列数目,并统计每个taxon的唯一比对和完全比对的序列数目。
通过实施上述技术方案,相比于现有技术的核酸测序,本发明具有如下的优点:
1.内部Index位于测序引物和插入片段之间,极速分析时候,首先测得Index,这样就可以在测序早期将来自于不同样本的序列分开,不用等待测序全部完成。
2.Index至少采用两种以上不同长短的长度(优选3种长度,分别是6/7/8bp)。不同长短的Index序列,避免了常规方法会出现在同一个循环处,都是T的结果,从而降低了测序质量。
3.Index每一个位置的碱基要求碱基比例分布均匀。
4.分析软件在测序仪获得22个序列以后,就开始分析病原体信息,每个循环持续跟进分析,达到NGS实时分析的目的。
5.结合不同长短的Index接头和实时分析的方法,把原本上机后至少需要11个小时才能知道结果,缩短到测序开始后5个小时左右就可以第一时间知道样本中微生物的基本情况,达到NGS极速分析的目的。
附图说明
附图1a为现有技术中常规测序接头结构示意图和测序流程示意图;
附图1b为现有技术中常规测序接头系统各流程耗时分布图;
附图2a为本发明所述的测序结构示意图和测序流程示意图;
附图2b为使用本发明的测序接头的测序分析系统流程示意图;
附图3为单独采用长度为8bp的内部Index序列接头测序时各循环数下碱基比例;
附图4为采用长度为6bp、7bp和8bp三种长度的内部Index序列接头测序时各循环数下碱基比例;
附图5为实施例1中所使用的测序接头详细序列结构;附图6为实施例1中使用本发明测序接头和传统的Illumina TruSeq接头测序时各循环数下碱基比例对比图;
附图7为实施例1中使用本发明测序接头和传统的Illumina TruSeq接头测序时测序质量和最终文库数据量对比图;
附图8为实施例2中嗜肺军团菌在各分析循环测得的序列数;
附图9为实施例2中克氏柠檬酸杆菌在各分析循环测得的序列数。
具体实施方式
需要说明的是,以下实施案例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施案例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施案例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施案例技术方案的范围。
实施例1
本实施例的一种测序接头,呈部分互补配对的Y字形结构。其中一条链从5’至3’依次包括:内部Index序列、Index1测序引物结合区域序列、Index1序列、与芯片探针结合的P7序列。另一条链上从5’至3’依次包括:与芯片探针结合的P5序列、Read1测序引物结合区域序列、内部Index序列和T碱基垂悬。这两条链的内部标签序列区域完全互补配对,测序引物结合区域序列部分互补配对。其结构如附图5所示,采用三种长短的内部Index序列,长度分别为6bp、7bp和8bp。
本实施例共设计了48个内部Index序列。他们被分为16组,每组都有6bp、7bp和8bp长短的内部Index序列。
内部Index序列满足以下要求:(1)任意两个内部Index序列的最小汉明距离为3(2)排除含有三个以上相同连续碱基的Index序列。(3)内部Index的前两个碱基不应该是“GG”。(4)7bp Index的第7个碱基和8bp Index的第7个碱基不应为T,8bp Index的第8个碱基不应为T。(5)组合内的Index各个测序位置的碱基比例都是人工调整以达到相对平衡。
具体序列和设计如下:
Figure PCTCN2022071549-appb-000001
Figure PCTCN2022071549-appb-000002
使用上述内部Index接头和传统的Illumina TruSeq接头各建了153个文库,然后分别分批上机测序:内部Index接头文库分成8次上机测序,每次上机约18~20个文库,且每种内部Index长度的接头要占该轮测序中使用的接头总量的三分之一左右;TruSeq接头文库分成5次上机测序,每次上机约30-31个文库。对比两种接头的测序质量,结果如附图6(两种接头测序时各循环数下碱基比例对比)和附图7所示(两种接头测序时测序质量和最终文库数据量对比)。
如图6所示,使用优化后的内部Index接头能够提供较平衡的碱基比例,仅仅是第9个循环测到的T碱基比例稍高(相对TruSeq接头而言),但对测序质量没有影响。
如图7所示,使用优化后的内部Index接头能够保证较高的合格簇百分比和Q30分数,这些测序质量指标和TruSeq接头的数据对比时并无显著差异。使用优化后的内部Index接头拆分数据时既可以单独使用内部Index拆分,也可以使用内部Index+Index1做双Index拆分,并且最终得到的文库数据量和使用TruSeq时也无显著差异。
实施例2
为了评估本系统的分析性能,我们用本发明的测序分析系统对两个临床阳性样本进行了分析。其中样本1的临床结果为嗜肺军团菌感染,样本2的临床结果为克氏柠檬酸杆菌感染。两个样本的分析时间及检测结果如下表1所示。嗜肺军团菌在各分析循环测得的序列数见附 图8,克氏柠檬酸杆菌在各分析循环测得的序列数见附图9。
表1.临床样本检测分析时间统计
样本 提取建库时间 第一份报告时间 总时间
样本1 4h05min 5h09min 9h14min
样本2 3h46min 5h 8h46min
分析结果表明,在测序读长为22bp的极速分析第一份报告中,本系统已经能够敏感地检测到阳性病原菌;随着测序进行,检测到的病原菌序列数缓慢升高,在几个循环后趋于稳定。因此,对于病原体感染阳性样本,本系统能够在极早期检出阳性病原体,并给出可靠的分析结果。

Claims (10)

  1. 一种测序接头,其特征在于,呈部分互补配对的Y字形结构,其中一条链从5’至3’依次包括:内部Index序列、Index1测序引物结合区域序列、Index1序列、与芯片探针结合的P7序列;另一条链上从5’至3’依次包括:与芯片探针结合的P5序列、Read1测序引物结合区域序列、内部Index序列和T碱基垂悬。
  2. 根据权利要求1所述的一种测序接头,其特征在于,在使用时,采用不同长短的内部Index序列接头进行组合完成多样本的标记和测序。
  3. 根据权利要求2所述的一种测序接头,其特征在于,相邻长短的内部Index序列之间的长度差为一个碱基。
  4. 根据权利要求2所述的一种测序接头,其特征在于,在使用时,采用两种至四种长度的内部Index序列接头进行组合完成多样本的标记和测序。
  5. 根据权利要求1所述的一种测序接头,其特征在于,使用的所有内部Index序列在组合后达到内部Index序列在各轮测序循环中的碱基比例基本平衡。
  6. 根据权利要求5所述的一种测序接头,其特征在于,在一次测序中使用的内部Index数大于等于4个时,内部Index序列在各轮测序循环中的ATCG四种碱基的比例分别各自控制在8%~50%为合适。
  7. 根据权利要求6所述的一种测序接头,其特征在于,在一次测序中使用的内部Index数大于等于4个时,内部Index序列在各轮测序循环中的ATCG四种碱基的比例分别各自控制在12.5%~37.5%最优。
  8. 根据权利要求1所述的一种测序接头,其特征在于,使用的所有内部Index序列应满足:(1)任意两个内部Index序列的最小汉明距离为3;(2)排除含有三个以上相同连续碱基的Index序列;(3)内部Index序列的前两个碱基不应该是“GG”。
  9. 根据权利要求1所述的一种测序接头,其特征在于,在芯片探针结合的P5序列与Read1测序引物结合区域序列之间可以增加Index2序列。
  10. 一种基于上述测序接头的测序分析系统,其特征在于,包括:
    测序监控模块:用于实时监控测序进度并触发分析任务;
    数据生成模块:用于将测序生成的BCL文件转换成fastq文件,并过滤低质量序列;同时针对特殊设计的接头使用特异性分析程序将序列数据拆分至对应的样本中;
    数据过滤模块:用于去除通过质控的序列中的人源序列;
    数据分析模块:用于将非人源序列比对到病原微生物基因组数据库中;
    报告生成模块:用于统计分析比对结果,输出分析报告。
PCT/CN2022/071549 2021-11-19 2022-01-12 一种测序接头及其测序分析系统 WO2023087527A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111374708.XA CN114107290A (zh) 2021-11-19 2021-11-19 一种测序接头及其测序分析系统
CN202111374708.X 2021-11-19

Publications (1)

Publication Number Publication Date
WO2023087527A1 true WO2023087527A1 (zh) 2023-05-25

Family

ID=80396782

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071549 WO2023087527A1 (zh) 2021-11-19 2022-01-12 一种测序接头及其测序分析系统

Country Status (2)

Country Link
CN (1) CN114107290A (zh)
WO (1) WO2023087527A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130196331A1 (en) * 2010-03-31 2013-08-01 Etsuko Miyamoto Constitution of tool for analyzing biomolecular interaction and analysis method using same
CN108949941A (zh) * 2018-06-25 2018-12-07 北京莲和医学检验所有限公司 低频突变检测方法、试剂盒和装置
CN109439729A (zh) * 2018-12-27 2019-03-08 上海鲸舟基因科技有限公司 检测低频变异用的接头、接头混合物及相应方法
CN109680054A (zh) * 2019-01-15 2019-04-26 北京中源维康基因科技有限公司 一种低频dna突变的检测方法
US20210024993A1 (en) * 2018-03-22 2021-01-28 Inivata Ltd. Methods of sequencing nucleic acids and error correction of sequence reads
CN112626189A (zh) * 2020-04-24 2021-04-09 北京吉因加医学检验实验室有限公司 基因测序仪的短接头、双index接头引物和双index建库体系
CN112795990A (zh) * 2019-11-14 2021-05-14 广州华大基因医学检验所有限公司 一种灵活多变的降低污染及pcr偏倚的多标签二代测序文库接头

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012055929A1 (en) * 2010-10-26 2012-05-03 Illumina, Inc. Sequencing methods
US11447818B2 (en) * 2017-09-15 2022-09-20 Illumina, Inc. Universal short adapters with variable length non-random unique molecular identifiers
CN108893466B (zh) * 2018-06-04 2021-04-13 上海奥根诊断技术有限公司 测序接头、测序接头组和超低频突变的检测方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130196331A1 (en) * 2010-03-31 2013-08-01 Etsuko Miyamoto Constitution of tool for analyzing biomolecular interaction and analysis method using same
US20210024993A1 (en) * 2018-03-22 2021-01-28 Inivata Ltd. Methods of sequencing nucleic acids and error correction of sequence reads
CN108949941A (zh) * 2018-06-25 2018-12-07 北京莲和医学检验所有限公司 低频突变检测方法、试剂盒和装置
CN109439729A (zh) * 2018-12-27 2019-03-08 上海鲸舟基因科技有限公司 检测低频变异用的接头、接头混合物及相应方法
CN109680054A (zh) * 2019-01-15 2019-04-26 北京中源维康基因科技有限公司 一种低频dna突变的检测方法
CN112795990A (zh) * 2019-11-14 2021-05-14 广州华大基因医学检验所有限公司 一种灵活多变的降低污染及pcr偏倚的多标签二代测序文库接头
CN112626189A (zh) * 2020-04-24 2021-04-09 北京吉因加医学检验实验室有限公司 基因测序仪的短接头、双index接头引物和双index建库体系

Also Published As

Publication number Publication date
CN114107290A (zh) 2022-03-01

Similar Documents

Publication Publication Date Title
CN110349629B (zh) 一种利用宏基因组或宏转录组检测微生物的分析方法
CN106906211B (zh) 一种分子接头及其应用
WO2014023167A1 (zh) 检测α珠蛋白基因拷贝数的方法和系统
CN105442054B (zh) 对血浆游离dna进行多目标位点扩增建库的方法
CN109971827B (zh) 血浆dna的建库方法和建库试剂盒
CN110734967B (zh) 一种接头组合物及其应用
CN102965428A (zh) 一种检验鉴别遗传性心肌肥厚相关基因突变的试剂盒
CN111073961A (zh) 一种基因稀有突变的高通量检测方法
WO2021227129A1 (zh) 一种通用型高通量测序接头及其应用
CN112359093B (zh) 血液中游离miRNA文库制备和表达定量的方法及试剂盒
CN113463202B (zh) 一种新的rna高通量测序的方法、引物组和试剂盒及其应用
CN112251422B (zh) 含独特分子标签序列的转座酶复合体及其应用
WO2021203461A1 (zh) 一种用于纳米孔测序建库的位置锚定条码系统
CN109706219A (zh) 构建测序文库的方法、试剂盒、上机方法及测序数据的拆分方法
CN105567681A (zh) 一种基于高通量基因测序无创活检病毒的方法及标签接头
CN110021352A (zh) 一种基于miRBase数据库的植物有参的miRNA数据分析方法
CN111676276A (zh) 一种快速精准确定基因编辑突变情况的方法及其应用
CN110970091B (zh) 标签质控的方法及装置
WO2023087527A1 (zh) 一种测序接头及其测序分析系统
CN108103143B (zh) 一种目标区域多重pcr与快速文库构建的方法
CN108728515A (zh) 一种使用duplex方法检测ctDNA低频突变的文库构建和测序数据的分析方法
CN115948607B (zh) 同时检测多种病原体基因的方法和试剂盒
CN116065240A (zh) 一种高通量构建rna测序文库的方法及试剂盒
CN111926394A (zh) 基于宏基因组学的建库方法和检测试剂盒
CN106520961B (zh) 玉米微卫星标记位点开发方法与微卫星标记位点内的微卫星标记的长度检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894076

Country of ref document: EP

Kind code of ref document: A1