CN115620809B

CN115620809B - Nanopore sequencing data analysis method and device, storage medium and application

Info

Publication number: CN115620809B
Application number: CN202211621058.9A
Authority: CN
Inventors: 郎继东
Original assignee: Qitan Technology Ltd Beijing
Current assignee: Qitan Technology Ltd Beijing
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-07
Anticipated expiration: 2042-12-16
Also published as: CN115620809A

Abstract

The invention discloses a nanopore sequencing data analysis method, a nanopore sequencing data analysis device, a storage medium and application. The invention can obtain multi-dimensional biological information through one-time detection, and can effectively distinguish biological samples by using the characteristics of the long reading length and the direct reading modification site of nanopore sequencing, thereby providing a basis for reading and medication guidance of subsequent clinical reports. Compared with a second-generation sequencing methylation detection and tissue tracing solution, the method greatly simplifies experimental operation steps, greatly reduces the complexity of an experiment, the sequencing data quantity and the sequencing cost, brings greater economic benefit, reduces the difficulty of data analysis, shortens the detection period, and is more suitable for actual clinical detection requirements.

Description

Nanopore sequencing data analysis method and device, storage medium and application

Technical Field

The invention relates to the field of biological information, in particular to a nanopore sequencing data analysis method, a nanopore sequencing data analysis device, a storage medium and a method for acquiring gene information in a biological sample and classifying the biological sample.

Background

DNA methylation (DNA methylation) is a form of chemical modification of DNA, which means that a methyl group is covalently bonded to the cytosine carbon position 5 of a genomic CpG dinucleotide under the action of DNA methyltransferase. Numerous studies have shown that DNA methylation can cause changes in chromatin structure, DNA conformation, DNA stability, and the way DNA interacts with proteins, thereby controlling gene expression. Thus, DNA methylation is becoming increasingly important.

In recent years, the rapid development of liquid biopsy detection technology based on the combination of next generation sequencing technology (NGS) with cell free DNA (cfDNA) and circulating tumor DNA (ctDNA) in blood provides an opportunity to understand organs and tissues from blood samples, and to trace the origin of cfDNA tissues and organs through information on the degradation rule of cfDNA nucleosomes, organ-specific methylation sites, and the like, and through correlation between adjacent methylation sites. The research provides potential for accurately detecting the disease conditions of different parts of the body by detecting the methylation of the cfDNA/ctDNA, and provides a foundation for clinical application. Currently, the research of methylation in the field of tumor diagnosis and treatment is rapidly developing.

Nanopore Sequencing Technology (NST), also known as fourth generation sequencing technology or single molecule real-time DNA sequencing technology, is a technology that can sequence each DNA molecule individually without PCR amplification during DNA sequencing. Compared with the NGS sequencing technology, the method can detect short read lengths of hundreds of bases, the sequencing read length of the nanopore sequencing technology can reach thousands to tens of thousands of bases, and even can reach ultra-long read lengths of several megabases, so that the method is favorable for analyzing the characteristics and length distribution of the original fragments of cfDNA/ctDNA; and simultaneously, the modification information of the sequencing sequence can be directly read, namely the change of ionic current can be recorded by a system when single-stranded DNA passes through the nanopore, and the current is different when methylated DNA and unmethylated DNA pass through the nanopore, so that the methylation level of the DNA at different sites can be measured.

Currently, one of the commonly used detection techniques for DNA methylation of liquid biopsy samples (e.g., cfDNA) is to obtain the methylation level of each site on the whole genome by second-generation high-throughput sequencing, and experimentally processing DNA mainly uses three methods of bisulfite conversion, enrichment of methylated antibodies or MBD (Methyl-CpG-Binding Domain) affinity, and restriction enzyme digestion and bisulfite (RRBS) Binding with restriction enzymes. Although some panel designs targeting methylation sites reduce data volume and sequencing cost, the experimental process is not optimized, the operation is still complicated, and the treatment mode such as bisulfite can degrade DNA to different degrees, thereby causing partial methylation information loss and obscuring DNA fragmentation characteristics. The analysis aiming at the characteristics and the length distribution of the DNA fragments is more in touch with the pain point of the NGS sequencing reading length, and can not be completed by obtaining a large amount of sequencing data and complicated analysis through the high-depth sequencing of the whole genome while the complete analysis can not be carried out. Particularly, due to different experimental principles, service products for DNA methylation detection, analysis of DNA fragment characteristics and length distribution, and cancer-targeted hotspot detection of a liquid biopsy sample are performed independently, i.e., the multi-dimensional information cannot be obtained simultaneously in one detection, which not only increases the initial consumption of DNA and the difficulty and complexity of the experiment, but also greatly increases the cost of sequencing and data analysis.

The information in this background is only for the purpose of illustrating the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art that is known to a person skilled in the art.

Disclosure of Invention

To solve at least some of the technical problems of the prior art, the present invention provides a nanopore sequencing data analysis method, apparatus, storage medium, and applications. The invention utilizes the nanopore sequencing technology to better solve at least part of problems in the prior art in both experiments and data analysis. Specifically, the present invention includes the following.

In a first aspect of the invention, there is provided a nanopore sequencing data analysis method, comprising:

acquiring current signal data of nanopore sequencing, wherein the current signal data at least comprises a time sequence current signal Ion-A which comprises information of at least two dimensions of a transverse time dimension and a longitudinal signal intensity dimension;

carrying out base recognition analysis on Ion-A to obtain sequencing Data-A, and analyzing the fragment characteristics based on the Data-A;

carrying out methylation detection on a target site based on Ion-A to obtain methylation information of the target site; and

classifying the biological sample according to the fragment characteristics and the methylation information.

In certain embodiments, the method of nanopore sequencing data analysis according to the first aspect, wherein the fragment characteristics comprise at least one of a length distribution characteristic, a motif characteristic, and a tissue characteristic.

In certain embodiments, the method of nanopore sequencing data analysis according to the first aspect, the analysis of the length distribution features comprises screening sequences in the sequencing data to retain sequencing read sequence results that have unique alignments in the human reference genome and are not soft-cut, performing length statistics of the screened read sequences and profiling the lengths to obtain the length distribution features.

In certain embodiments, the nanopore sequencing data analysis method of the first aspect, the analysis of the motif signature comprises screening sequences in the sequencing data to retain sequencing read-long sequence results that have unique alignments in the human reference genome and are not soft-cut, counting the frequency or relative abundance of motifs of k-mers before each read-long sequence, wherein 4< = k < =10, resulting in a motif signature.

In certain embodiments, the method for nanopore sequencing data analysis according to the first aspect, wherein the analysis of the tissue features comprises screening sequences in the sequencing data to retain sequencing read sequence results that have unique alignment and are not soft cut in a human reference genome, screening sequence fragments with a specified length range, performing comparative analysis and calculating correlation with expression profile data of reference samples of a cell line and a primary tissue, and performing tissue tracing analysis to obtain the tissue features.

In certain embodiments, the nanopore sequencing data analysis method according to the first aspect, the methylation detection comprises sliding the Ion-a in a time direction by a prescribed step size to obtain a set DST composed of different current signal fragments, and performing similarity alignment analysis on each current signal fragment in the set DST with a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a subset of methylated signal fragments and a subset of unmethylated signal fragments.

In certain embodiments, the nanopore sequencing data analysis method according to the first aspect, the methylation detection further comprises methylation discrimination based on similarity of alignment, comprising interpreting the targeted site as methylated if the number of results for each current signal fragment in the set DST aligned with the subset of methylated signal fragments/the number of results for each subset of unmethylated signal fragments aligned >1, and interpreting the targeted site as unmethylated if the number of results for each current signal fragment in the set DST aligned with the subset of methylated signal fragments/the number of results for each subset of unmethylated signal fragments aligned < 1.

In certain embodiments, the nanopore sequencing data analysis method of the first aspect, the constructing of the reference signal fragment set DSR comprises synthesizing a first sequence fragment comprising a methylation targeting site and a corresponding second sequence fragment comprising a non-methylation targeting site, nanopore sequencing a first reference signal fragment corresponding to the first sequence fragment and a second reference signal fragment corresponding to the second sequence fragment, composing a subset of methylation signal fragments from a plurality of the first reference signal fragments, composing a subset of non-methylation signal fragments from a plurality of the second reference sequence fragments.

In a second aspect of the present invention, there is provided a nanopore sequencing data analysis device, comprising:

a. the data acquisition module is arranged to acquire current signal data for nanopore sequencing, and comprises a time sequence current signal Ion-A;

b. a Data processing module, configured to perform segment feature analysis and result analysis of target site methylation detection based on the current signal Ion-A, preferably, the segment feature analysis includes performing base recognition analysis on Ion-A to obtain sequencing Data-A; preferably, the analysis of the results of the methylation detection of the target site comprises sliding cutting the Ion-A in a time direction by a specified step size to obtain a set DST consisting of different current signal fragments, and performing similarity comparison analysis on each current signal fragment in the set DST with a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a methylated signal fragment subset and an unmethylated signal fragment subset; preferably, the methylation discrimination is further performed according to the similarity of the comparison, which comprises interpreting the targeted site as methylation if the number of results of comparison of each current signal segment in the set DST with the subset of methylated signal segments/the number of results of comparison with the subset of unmethylated signal segments is >1, and interpreting the targeted site as unmethylated if the number of results of comparison of each current signal segment in the set DST with the subset of methylated signal segments/the number of results of comparison with the subset of unmethylated signal segments is < 1.

c. A data storage module for storing at least the set of reference signal segments DSR;

preferably, further comprising:

d. and the display module is used for displaying the interpretation result obtained after the analysis of the data processing module.

In a third aspect of the present invention, there is provided a computer storage medium having a computer program stored therein, the computer program, when executed by a computer, implementing the method of the first aspect.

In a fourth aspect of the present invention, there is provided a method of obtaining genetic information in a biological sample, comprising the steps of sequencing DNA in the biological sample using nanopore technology, and analysing the sequencing data using a method according to the first aspect;

preferably, the biological sample is selected from at least one of blood, saliva and urine;

preferably, the DNA is selected from free DNA.

The method for identifying the biological sample and detecting the targeted hot spot mutation based on the nanopore sequencing technology can obtain multi-dimensional biological information through one-time detection, and can effectively distinguish the biological sample by utilizing the characteristics of the long read length and the direct read modification site of the nanopore sequencing, thereby providing a basis for reading and medication guidance of subsequent clinical reports. The results obtained by the method of the invention can be further analyzed and then applied to the detection and postoperative monitoring of cancer samples from liquid biopsy sources, such as the monitoring of Minimal Residual Disease (MRD), and can also be applied to early screening and traceability analysis of cancers to a certain extent.

In conclusion, compared with the methylation detection and tissue tracing solution of the NGS of the second-generation sequencing technology, the method greatly simplifies the experimental operation steps, greatly reduces the experimental complexity, the sequencing data quantity and the sequencing cost, brings greater economic benefit, simultaneously reduces the difficulty of data analysis, shortens the detection period, and is more suitable for the actual clinical detection requirements.

Drawings

Fig. 1 is an exemplary biological sample analysis flow diagram.

Fig. 2 is another exemplary biological sample analysis flow diagram.

FIG. 3 is a schematic diagram of the identification of methylation sites in the current signal.

FIG. 4 sample distribution of fragment lengths of mutant and wild type of T790M of EGFR gene.

FIG. 5 is a schematic diagram of an exemplary nanopore sequencing data analysis device.

Description of reference numerals:

the system comprises a 100-nanopore sequencing data analysis device, a 110-data acquisition module, a 120-data storage module, a 130-data processing module, a 140-data display module, a 210-internet or cloud terminal, and a 220-nanopore sequencer.

Detailed Description

Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but as a more detailed description of certain aspects, features and embodiments of the invention.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that the upper and lower limits of the range, and each intervening value therebetween, is specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference herein for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control. Unless otherwise indicated, "%" is percent by weight.

Data analysis method

(1) Acquiring current signal data for nanopore sequencing, wherein the current signal data at least comprises a time sequence current signal Ion-A which comprises information of at least two dimensions of a transverse time dimension and a longitudinal signal intensity dimension;

(2) Carrying out base recognition analysis on the Ion-A to obtain sequencing base Data-A, and analyzing the fragment characteristics based on the Data-A;

(3) Carrying out methylation detection on a target site based on Ion-A to obtain methylation information of the target site;

(4) Classifying the biological sample according to the fragment characteristics and the methylation information.

It is understood by those skilled in the art that the numbers (1), (2), etc. are only for the purpose of distinguishing different steps, and do not indicate the order of the steps. The order of the above steps is not particularly limited as long as the object of the present invention can be achieved. In addition, two or more of the above steps may be combined and performed simultaneously. It will be appreciated by those skilled in the art that additional steps or operations may be included before or after steps (1) - (4) above, or between any of these steps, for example to further optimize and/or improve the methods of the present invention.

Step (1)

In the present invention, the step (1) is a data acquisition step. The current signal data of the present invention can be obtained directly from data from a biological sample generated by a sequencer, or can be retrieved from a memory storing such current signal data, such as a hard disk, a computer, the internet, or a cloud. The memory or cloud that stores such data may be one, or multiple. The nanopore sequencing current signal data provided by the invention are current signals directly generated when bases pass through a nanopore during sequencing, and comprise time sequence current signals Ion-A and the like. The current signal Ion-A comprises data of at least two dimensions, namely a transverse time dimension and a longitudinal signal intensity dimension.

Step (2)

In the invention, the step (2) is a fragment characteristic analysis step, which comprises the steps of carrying out base recognition analysis on Ion-A to obtain sequencing base Data-A, and analyzing fragment characteristics based on the Data-A.

In the invention, the base recognition analysis refers to recognition based on a deep learning model or algorithm, and can be performed by using a known model or algorithm, for example, the open source baseteller algorithm of nanocall uses a traditional machine learning algorithm HMM model to describe the relationship between nucleotide sequence information and electric signals, in the algorithm, an observed value is the electric signal of each event, and a hidden state is a DNA sequence with the length of k nucleotides corresponding to the current event. Where k is a hyperparameter representing how many nucleotides together the model considers the current event's electrical signal to be determined by being within the pore size. Deep learning algorithm is used by the deep learning algorithm to solve the basefilling problem, and RNN (recurrent neural network) model suitable for sequence problem is used to translate the electric signal. The open source basecaller algorithm of Chiron improves translation precision by using a complex deep learning framework of CNN + RNN + CTC decoder. Although all three basecalaler algorithms perform sequence translation based on the events partitioning results given by the ONT, in practical applications, the events partitioning results given by the ONT are not necessarily accurate due to the existence of a large amount of modification information contained in the real DNA sequence.

In the present invention, the segment characteristic analysis includes at least one of a length distribution characteristic, a motif characteristic, and a tissue characteristic. Preferably, the purpose of fragment feature analysis is to determine whether the DNA in the liquid sample is characteristic of cfDNA or of ctDNA.

In the present invention, it is preferred to include screening of sequencing data, for example screening of sequencing read-long (reads) sequences to retain sequencing read-long sequence results with unique alignment and non-soft-cut (soft-clipping) in the human reference genome, prior to fragment profiling.

In an exemplary embodiment, the fragment feature analysis of the present invention includes a length distribution feature analysis, and the specific analysis method is not particularly limited, and illustratively includes counting the lengths of the screened read-length sequences and plotting the length distribution to obtain the length distribution feature.

In an exemplary embodiment, the fragment feature analysis of the present invention includes tissue feature analysis, and a specific analysis method is not particularly limited. Illustratively, the method comprises the step of further screening a certain length range of sequence fragments from the screened read-length sequences, and comparing, analyzing and calculating the correlation between the sequence fragments and expression profile data of a cell line and a reference sample of primary tissues, thereby performing tissue tracing analysis and obtaining tissue characteristics. Wherein, the range of the certain length is preferably 120-180bp, such as 120-160bp, 130-180bp and 140-150bp.

In an exemplary embodiment, the fragment feature analysis of the present invention includes motif feature analysis, and the specific analysis method is not particularly limited, and illustratively, it includes counting the frequency or relative abundance of motif of k-mer before each read sequence for the result of the sequence after screening, where k is a natural number of 4 or more, such as 5, 7, 9, and the like. Further preferably, k is 10 or less, 8 or less, or the like.

In certain embodiments, the fragment signature analysis of the invention comprises: firstly, screening read-length sequences to reserve a sequencing read-length sequence result which has unique comparison and is not soft-clipping in a human reference genome, secondly, carrying out length statistics of the read-length sequences and making a length distribution map to obtain length distribution characteristics; then counting the frequency or relative abundance of the motif of each front k-mer of the read-length sequence to obtain the characteristics of the motif; finally, screening out sequence segments with a certain length range, and comparing, analyzing and calculating correlation with expression profile data of reference samples of cell lines and primary tissues so as to analyze tissue tracing, wherein the length range is preferably 120-180bp, and tissue characteristics are obtained; wherein the features of the cfDNA include, but are not limited to, fragment length enrichment of about 167bp (corresponding to length distribution features), strong correlation with lymphocyte cell lines or myeloid cell lines or bone marrow tissue (corresponding to tissue features), and relative abundance values of cancer-associated motifs at normal sample values levels (corresponding to motif features); where the characteristics of ctDNA include, but are not limited to, fragment length enrichment of about 100-160bp (corresponding to length distribution characteristics), strong association with cancer cell lines (corresponding to tissue characteristics), and low relative abundance values for cancer-associated motifs compared to normal sample values (corresponding to motif characteristics).

Step (3)

Step (3) of the present invention is targeted site methylation detection based on Ion-a, which generally comprises sliding cutting Ion-a in a time direction by a prescribed step length to obtain a set DST consisting of different current signal fragments, and performing similarity alignment analysis on each current signal fragment in the set DST and a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a methylated signal fragment subset and an unmethylated signal fragment subset.

In the present invention, the current signal fragment refers to a fragment corresponding to a continuous part signal of a current signal Ion-A of the whole length of a DNA sequence, and the length of the fragment is not particularly limited, and can be freely selected by a person skilled in the art according to the length of a methylation target sequence. In general, the length is, for example, 10 to 150bp, preferably 20 to 90bp, more preferably 25 to 80bp, for example 20, 25, 30, 35, 40, 45, 50, 60, 70 bp. The gene sequence corresponding to the current signal fragment in the invention can not be lower than 10bp generally.

In the present invention, the reference signal segment set DSR is a set comprising a subset of methylated signal segments and a subset of unmethylated signal segments. The subset of methylated signal fragments is generally composed of at least one methylated signal fragment, which is not particularly limited as long as it includes an electrical current signal corresponding to methylation of the target site, and may be one signal fragment corresponding to the same sequence or a plurality of different signal fragments corresponding to the same sequence. The position of the site in the DNA sequence fragment corresponding to methylation in the methylation signal fragment is not particularly limited, and thus a different position of a methylation site in the DNA sequence fragment corresponds to a different methylation signal fragment. Similarly, the subset of unmethylated signal fragments is generally composed of at least one unmethylated signal fragment, which is not particularly limited as long as it includes a current signal that is unmethylated corresponding to the target site, and can be one signal fragment corresponding to the same sequence or a plurality of different signal fragments corresponding to the same sequence.

In the present invention, the reference signal segment set DSR is typically a set of standard reference signal segments that are pre-constructed. In an exemplary embodiment, construction of a reference signal fragment set DSR of the invention includes synthesizing a first base sequence fragment containing a methylation targeting site and a second base sequence fragment containing a non-methylation targeting site, and nanopore sequencing to obtain a methylation signal fragment corresponding to the first base sequence fragment and a non-methylation signal fragment corresponding to the second base sequence fragment. Thereby obtaining a subset of methylated signal fragments consisting of the plurality of methylated signal fragments and a subset of unmethylated signal fragments consisting of the plurality of unmethylated signal fragments. Further constituting a set of reference signal segments DSR. The number of reference signal segments in the reference signal segment set DSR is not limited, and may be 1 or more, 5 or more, 10 or more, 20 or more, 50 or more, 100 or more, and the like.

In an exemplary embodiment, the construction of the reference signal segment set DSR of the present invention comprises: synthesizing a target fragment of the target methylated gene which is not methylated, sequencing the target fragment on a nanopore sequencing platform to obtain a corresponding current signal value, and repeating the step for at least 5 times to obtain a subset of the current signal values of the target fragment of the target methylated gene which is not methylated; synthesizing the target fragments of which the target methylated genes are methylated, sequencing the target fragments through the same nanopore to obtain corresponding current signal values, and repeating the step for at least 5 times to obtain a subset of the current signal values of the target fragments of which the target methylated genes are methylated. The reference signal segment set DSR is composed of two subsets.

The latest research found that unlike the traditional understanding, cfDNA has long fragments, with a length of over 600bp, even 23K (Yu SCY, jiang P, peng W, et al, single-molecule sequencing results a large position of long cell-free DNA molecules in a physical plant, proc Natl Acad Sci U S A. 2021 (50): e2114937118. Doi:10.1073/pnas. 2114937118), whereas methylation information in cfDNA of long fragments is more important. The conventional methylation detection of free DNA can be carried out by the NGS sequencing technology. Compared with the NGS sequencing technology, the method can detect the short read length of hundreds of bases, the sequencing read length of the nanopore sequencing technology can reach thousands to tens of thousands of bases, and even can reach the ultra-long read length of several megabases, so that the method is beneficial to analyzing the characteristics and the length distribution of the original fragment of cfDNA/ctDNA, particularly the analysis of the free DNA of the long fragment, and the methylation information of the DNA of different sites compared with the traditional NGS method can be obtained.

In certain embodiments, the methylation detection of the invention further comprises performing methylation discrimination based on similarity of the alignments, which comprises interpreting the targeted site as methylated if the number of concordant results of the alignment of each electrical current signal segment in the set DST with the subset of methylated signal segments/concordant results with the subset of unmethylated signal segments >1, and interpreting the targeted site as unmethylated if the number of concordant results of the alignment of each electrical current signal segment in the set DST with the subset of methylated signal segments/concordant results with the subset of unmethylated signal segments < 1.

Step (4)

Step (4) of the present invention is to classify the biological sample based on the fragment characteristics and the methylation information. According to the invention, the result analysis and the fragment characteristic analysis of the targeted site methylation detection can be carried out through one-time sequencing of a sample, and the sample is classified based on the fragment characteristic and the methylation information. For example, a normal or healthy sample is considered when the methylation of the target site of the sample is detected as not having methylated and the fragment signature analysis results are characteristic of cfDNA; and if the result of the methylation detection of the target site of the sample is the occurrence of methylation or the result of the fragment characteristic analysis is the characteristic of ctDNA, the sample is considered as a potential cancer sample.

Analysis device

In a second aspect of the invention, a nanopore sequencing data analysis device is provided. The analysis device of the present invention may be, for example, an electronic device, such as a computer, a processor, etc., which includes at least one data acquisition module, a data processing module, and a data storage module, optionally further includes other modules, such as a display module, or further includes a bus connecting different modules, components, or assemblies (including a storage unit and a processing unit). The "module" and "unit" have the same meaning in the present invention.

The analysis device of the invention comprises, by way of example, the following a-c and optionally further modules d:

a. a Data acquisition module configured to acquire current signal Data for at least nanopore sequencing, including a timing current signal Ion-A, optionally configured to further acquire sequencing Data-B of a mutational hot spot;

b. and the Data processing module is configured to perform fragment feature analysis and targeted site methylation detection analysis based on the current signal Ion-A, and optionally, is further configured to perform targeted mutation site detection analysis on Data-B.

In the present invention, a data storage module stores at least the set of reference signal segments DSR and program code executable by a data processing module to cause the data processing module to perform the method of the present invention. Optionally, the data storage module stores the data acquired by the data acquisition module. The memory modules may also include programs/utilities having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The bus may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter. The network adapter communicates with other modules of the electronic device over the bus.

The analysis of the fragment characteristics and the detection and analysis of the methylation of the target site in the data processing module of the analysis device based on the current signal Ion-A are the same as those described in the first aspect of the invention, and are not repeated here. The following is a further description of the analysis of Data-B only.

The Data processing module of the present invention is optionally further configured to enable detection analysis of targeted mutation sites for Data-B. Wherein, data-B is sequencing Data obtained by a nanopore sequencing library Lib-B aiming at the mutation hot spot. And the Lib-B can further establish a second nanopore sequencing library for capturing the target mutation hot spot after the biological sample is judged as a potential cancer sample, and obtain sequencing base data after the Lib-B is sequenced by a nanopore sequencer and base recognition (basecloning) analysis is carried out. Further, the data processing module further performs filtering determination and medical report interpretation on the obtained analysis result, wherein the medical report interpretation includes, but is not limited to, interpretation of drug resistance of the mutation hot spot to the tumor drug, and the like. Optionally, the Data-B is further subjected to length distribution analysis of the sequencing sequence length of the hot spot mutation, and the characteristics of the cfDNA obtained by the fragment characteristic analysis of the present application are verified, thereby increasing the credibility of cancer discrimination.

Computer storage medium

In a third aspect of the invention, a computer storage medium is provided, storing at least a computer program which, when executed by a computer, implements the method of the first aspect of the invention. The storage medium of the present invention may be a readable medium in the form of a magnetic disk, an optical disk, a volatile memory unit, such as a random access memory unit (RAM) and/or a cache memory unit, and may further include a read only memory unit (ROM).

Method for obtaining gene information in biological sample

In a fourth aspect of the invention, there is provided a method of obtaining genetic information from a biological sample, comprising the steps of sequencing DNA in the biological sample using nanopore sequencing technology, and analysing the sequencing data using the method of the first aspect of the invention.

In the invention, the nanopore sequencing technology can be carried out by adopting a currently known platform, and comprises a MinION nanopore sequencer of ONT company, a QNOME-3841 sequencer of Beijing Qincao technology company and the like. Nanopore sequencing can sequence long fragments, such as fragments of 700bp or more, 1kbp, 2 kbp, 5kbp, 6 kbp, 8 kbp, 1Mbp, 2 Mbp or more.

In an exemplary embodiment, sequencing using nanopore sequencing technology of the present invention further comprises the steps of extracting DNA from the biological sample and preparing a sequencing library. The biological sample is not limited, and examples thereof include, but are not limited to, blood or components thereof (e.g., serum, plasma), saliva, and urine. The DNA of the present invention is not limited, and examples thereof include episomal DNA such as cfDNA or ctDNA. The sequencing library preparation method of the present invention is not limited and can be performed using a known kit.

In certain embodiments, the sequencing library is a 1D library, which is prepared by dividing the plus and minus strands and sequencing them separately. In an exemplary embodiment, the 1D library preparation step comprises filling up both ends of the DNA, adding a at the ends, ligating to the linker, and then adding the Teher protein to adsorb the DNA strands to the membrane of the sequencing chip. In another exemplary embodiment, the preparation of the 1D library comprises mixing the adaptor-ligated transposon enzyme with long-chain DNA, cleaving the long-chain DNA by the enzyme, adding the adaptor at the breakpoint, and sequencing by adding the dynein and Tether proteins.

In certain embodiments, the sequencing library is 1D ² A library prepared by ligating 1D on both sides of DNA ² And connecting a joint, and then connecting a sequencing joint, the dynein and the Tether protein. 1D ² The linker can enable the negative strand to be sequenced following the positive strand, and because the two strands are complementary, the two sequences can be corrected with each other, and the interpretation accuracy of the base sequence is improved.

In certain embodiments, library construction further comprises a library enrichment step, such as probe capture and the like.

In certain embodiments, the invention includes the steps of constructing a first nano-sequencing library for extracted biological sample DNA, and creating a second nano-pore sequencing library that captures targeted mutational hotspots.

Example 1

1. Blood samples of 1 non-small cell lung cancer sample, 1 liver cancer sample, 1 small cell lung cancer sample and 1 normal sample were collected using EDTA vacuum tubes. The following description will be given with reference to the non-small cell lung cancer sample example, and the same steps 1 to 7 are repeated for the remaining sample examples. And (3) centrifuging the sample in a centrifuge under the conditions of low temperature and low speed. Plasma was collected by pipette separation without disturbing the precipitated blood cells. cfDNA was extracted from 10ml plasma using the QIAamp Circulating Nucleic Acid Kit. Quantification was performed using Qubit and quality control of the DNA fragments was performed by Agilent 2100 Bioanalyzer. The extracted cfDNA was stored at-80 ℃.

2. For the cfDNA extracted in the step 1, a first Nanopore sequencing library is prepared by using a commercial library construction kit QLK-V1.1.1 (Beijing Qizhi carbon technology Co., ltd.) or SQK-LSK109 (Oxford Nanopore Technologies), and specific operations are carried out according to kit instructions to construct a first library designated as Lib-A.

3. Sequencing Lib-A by using a QNOME-3841 sequencer (Beijing Simultaneous carbon technology Co., ltd.) or an Oxford Nanopore Technologies (ONT) sequencer such as MinION to obtain corresponding data for storing a sequencing current signal, wherein the data comprises a current signal, metadata of chip information of the sequencer, channel information and the like, and the current signal is marked as Ion-A; and (3) carrying out base recognition (baselearning) analysis on the sequencing current signal Ion-A by utilizing a QNOME-3841 high-precision baselearning model and algorithm or an HAC (Hac-algorithm) model and algorithm of ONTs (ONTs), wherein the Data quantity required by each library is at least 2 million sequencing sequences, and obtaining corresponding sequencing base Data which is marked as Data-A.

4. The Ion-A in step 3 was subjected to the targeted methylation gene detection of HOXA7, HOXA9, SHOX2 and RASSF1A (as in Table 1), and the Data-A was subjected to fragment feature analysis.

TABLE 1 methylation site information to differentiate lung cancer from normal/healthy samples

The detection of the targeted methylation gene of Ion-A comprises the following steps: firstly, performing sliding cutting on Ion-A by the step size of a 1-sampling point, wherein the cutting length is the length of a target sequence of a targeted methylated gene, and the lengths of target sequences of four targeted methylated genes, namely HOXA7, HOXA9, SHOX2 and RASSF1A, are respectively 114bp, 89bp, 108bp and 74bp, so as to obtain a set of cut Ion-A. Secondly, respectively carrying out signal similarity comparison analysis on the current signal set (marked as Ion _ methyl) of the methylated gene target sequence and the current signal set (marked as Ion-Unmethyl) of the unmethylated gene target sequence in the collection after Ion-A cutting by using a dynamic time warping algorithm to obtain a distance average value after comparison. Finally, the methylation is judged, namely the ratio of the distance average value of the Ion-methyl set to the distance average value of the Ion-unmethyl set is more than 1 (+). The principle of methylation identification based on current signals is shown in FIG. 3.

The Data-A is subjected to segment feature analysis, and the analysis steps comprise: first, data-a was aligned to human reference genome Hg19 (or Hg 38) using minimap2, sorted using sambamba, resulting in a bam file, and the unique aligned and non-soft-clipping sequencing read sequence results retained. Secondly, length statistics of the read sequence and length distribution are carried out (as shown in fig. 4), and the fragment length is found to be enriched in 165bp (the first peak of the sequencing sequence length) as a whole, the second peak of the sequencing sequence length is 144bp, the third peak is 146bp, the fourth peak is 158bp, and the fragment length is less than 165bp of the enriched length of cfDNA. Then, counting the relative abundance value of the motif of the front 4-mer of each read sequence, screening out that the relative abundance value of the motif-CCCA sequence is 1.56 percent and is 2.00 percent smaller than the average relative abundance value of the motif-CCCA sequence of a normal sample, finally, screening out sequence segments with the length of 120-180bp, processing the sequence segments by using a fast Fourier transform algorithm, comparing and analyzing the sequence segments with expression spectrum data of reference samples of a cell line and an original tissue to calculate the correlation, performing rank difference analysis, sequencing according to the rank difference from high to low, and finding out that the most correlated cell line is A549 (lung cancer correlated cell line), namely the rank difference is 23.

5. And (4) judging according to the result in the step (4), and finding that the result in the step (4) is that the 4 targeted methylated genes of the sample are methylated, the relative abundance value of the motif-CCCA is lower than that of the normal sample, the relative abundance value of the motif-CCCA is strongly correlated with the lung cancer cell line A549 and conforms to the fragment length distribution characteristic of the ctDNA, so that the sample is suggested to have the fragment characteristic of the ctDNA and is considered as a potential lung cancer sample.

6. Furthermore, step 2 may further include establishing a second nanopore sequencing library for capturing the targeted mutation hot spot, which is denoted as Lib-B, and in step 3, after sequencing Lib-B by using a nanopore sequencer such as QNOME-3841 or MinIOn and performing base recognition (basecalling) analysis by using a QNOME-3841 high-precision basecalling model and algorithm or an HAC model and algorithm of ONT, sequencing base Data (see table 2) is obtained, which is denoted as Data-B.

TABLE 2 sequencing data information for samples

The specific implementation steps for establishing the second nanopore sequencing library for capturing the targeted mutation hot spot are as follows: using VAHTS on cfDNA ^® Universal DNA Library Prep Kit for Illumina V3 (Vazyme) to construct a pre-library, and the specific operation is carried out according to the kit instruction. The pre-library is subjected to a hybrid capture elution step of panel and a PCR enrichment process by a capture kit (the size of the panel is 18,452bp, and the information is shown in Table 3) of the lung cancer 11 gene targeted hot-spot panel to prepare a capture library, and the process can be briefly described as follows: 500-1000ng of a pre-library and 7.5 mul of cot-1 DNA are concentrated by 1.8XVAHTS DNA Clean Beads, and 17 mul of mix (2xhybrization buffer 9.5 mul, universal Blockers-ILMN-TS (Du) 2 mul, hybrization enhancer 3 mul, panel 4.5 mul) are used for elution (room temperature 5 min), and the mixture is placed in a PCR instrument at 95 ℃ for 30s; hybridization was carried out overnight at 65 ℃ under hold conditions. After rinsing 50 mul MSB with 150 mul 1xBeads Wash Buffer for 2 times, abandoning the supernatant, and resuspending beads in a new tube with 17 mul 1x hybridization Buffer (2xhibration Buffer 8.5 mul, hybrization enhancer 2.7 mul, NFW 5.8 mul). Preheating on a 65 ℃ PCR instrument, adding into the 17 mu l hybrid liquid tube, blowing, uniformly mixing, and incubating at 65 ℃ for 45min for capture. The captured product is sequentially washed by 100 mu l 1xWash buffer I for 1 time, 150 mu l 1xWash buffer S for 2 times (65 ℃), 150 mu l 1xWash buffer I for 1 time at room temperature, 150 mu l 1xWash buffer II for 1 time at room temperature, and 150 mu l 1xWash buffer III for 1 time at room temperature. After washing by Wash buffer III, the supernatant was discarded, and 20. Mu.l of NFW was used to resuspend the magnetic beads. And performing PCR amplification by using the DNA polymerase in the kit by using the resuspended magnetic beads as templates. The reaction system is as follows: resuspended beads 20 μ l, 2xPCR ReadyMix 25 μ l, 10xPCR PrimerMix 5 μ l. The reaction conditions are as follows: 1min at 98 ℃; 15s at 98 ℃, 30s at 60 ℃, 30s at 72 ℃ and 23 cycles; 1min at 72 ℃; storing at 4 deg.C; PCR products were recovered by purification using 1.8XVAHTS DNA Clean Beads, and library quality control and quantification were performed. A Nanopore sequencing library was prepared using a commercial library construction kit QLK-V1.1.1 (Beijing Qiuch carbon Technologies, inc.) or SQK-LSK109 (Oxford Nanopore Technologies) from 300 fmolLib-B.

TABLE 3 Gene List and information about cancer targeting hotspot detection panel

7. According to the sample judged as the potential lung cancer in the step 5, carrying out detection analysis of the targeted mutation site on the Data-B, and finding that the T790M mutation of the EGFR gene exists; and performing medical interpretation on the T790M mutation of the EGFR gene according to NCCN guidelines, databases of FDA approved drug information and other clinical experimental results, for example, suggesting that the anti-tumor drug which can be referred to by the sample is Oxitinib (Thorasha) and the like; optionally, step 7 may further perform a length distribution analysis of the sequencing sequence length of the hot spot mutation on the Data-B, and the sequencing sequence length of the T790M mutation is 158bp, which coincides with the fourth peak in the length distribution map obtained in step 4.

8. The remaining samples were repeated through steps 1-7 and the results are shown in Table 4.

TABLE 4 statistical information of the analysis results of the samples of the examples

Example 2

This example is an exemplary nanopore sequencing data analysis device. As shown in fig. 5, the nanopore sequencing data analysis device 100 includes a data acquisition module 110, a data storage module 120, a data processing module 130, and a data display module 140. The data acquisition module 110 acquires the sequencing data through communication from the internet or the cloud 210 or the nanopore sequencer 220. Stored in the data storage module 120, while the data storage module 120 also stores the reference signal segment set DSR. The data processing module 120 retrieves the data in the data acquisition module 110, and performs fragment feature analysis and result analysis of targeted site methylation detection based on the current signal Ion-A. The fragment characteristic analysis comprises base recognition analysis of Ion-A to obtain sequencing base Data-A. The analysis of the results of the targeted site methylation detection includes cutting the Ion-A in a sliding manner in a time direction by a specified step length to obtain a set DST composed of different current signal fragments, and performing similarity comparison analysis on each current signal fragment in the set DST and a reference signal fragment set DSR in the data storage module 120. And judging according to the compared similarity. The judgment result is transferred to the data display module 140, and the interpretation result obtained after the analysis by the data processing module is displayed.

While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Many modifications and variations may be made to the exemplary embodiments of the present description without departing from the scope or spirit of the present invention. The scope of the claims is to be accorded the broadest interpretation so as to encompass all modifications and equivalent structures and functions.

Claims

1. A method of nanopore sequencing data analysis, comprising:

acquiring current signal data of a biological sample obtained by nanopore sequencing, wherein the current signal data at least comprises a time sequence current signal Ion-A which comprises information of at least two dimensions of a transverse time dimension and a longitudinal signal intensity dimension;

carrying out base recognition analysis on the Ion-A to obtain sequencing Data-A, and analyzing the fragment characteristics based on the Data-A;

carrying out methylation detection on the target locus based on the Ion-A to obtain methylation information of the target locus; and

classifying the biological sample according to the segment features and the methylation information,

the methylation detection comprises the steps of cutting Ion-A in a sliding mode in a time direction in a specified step size to obtain a set DST composed of different current signal fragments, and carrying out similarity comparison analysis on each current signal fragment in the set DST and a reference signal fragment set DSR respectively, wherein the reference signal fragment set DSR comprises a methylated signal fragment subset and an unmethylated signal fragment subset;

the methylation detection further comprises methylation discrimination according to the compared similarity.

2. The method of nanopore sequencing data analysis according to claim 1, wherein said fragment characteristics comprise at least one of length distribution characteristics, motif characteristics, and tissue characteristics.

3. The method of claim 2, wherein analyzing the length distribution features comprises screening sequences in the sequencing data to retain sequencing read sequence results that are uniquely aligned and not soft cut in a human reference genome, performing length statistics and mapping the length of the screened read sequences to obtain the length distribution features.

4. The method of nanopore sequencing data analysis according to claim 2, wherein said analysis of motif features comprises screening sequences in said sequencing data to retain sequencing read-length sequence results with unique alignment and non-soft-cut in a human reference genome, and counting the frequency or relative abundance of motifs of k-mers before each read-length sequence, wherein 4< = k < =10, resulting in motif features.

5. The method according to claim 2, wherein the analyzing of the tissue characteristics comprises screening sequences in the sequencing data to retain sequencing-read-long sequence results with unique alignment and non-soft-cut in human reference genome, screening sequence fragments with a specified length range, performing comparative analysis and correlation calculation with expression profile data of reference samples of cell lines and primary tissues, and performing tissue tracing analysis to obtain the tissue characteristics.

6. The method of claim 1, wherein the performing methylation discrimination based on the aligned similarities comprises: if the number of results of comparison of each current signal segment in the set DST with the subset of methylated signal segments/the number of results of comparison with the subset of unmethylated signal segments is greater than 1, the targeted site is interpreted as methylated, and if the number of results of comparison of each current signal segment in the set DST with the subset of methylated signal segments/the number of results of comparison with the subset of unmethylated signal segments is less than 1, the targeted site is interpreted as unmethylated.

7. The method of claim 6, wherein the constructing the set of reference signal fragments DSR comprises synthesizing a first sequence fragment comprising a methylated targeting site and a second sequence fragment comprising an unmethylated targeting site, performing nanopore sequencing to obtain a first reference signal fragment corresponding to the first sequence fragment and a second reference signal fragment corresponding to the second sequence fragment, wherein a plurality of the first reference signal fragments form a subset of methylated signal fragments, and a plurality of the second reference signal fragments form a subset of unmethylated signal fragments.

8. A nanopore sequencing data analysis device, comprising:

b. a Data processing module, configured to perform base recognition analysis on the Ion-a to obtain sequencing Data-a, perform fragment feature analysis based on the Data-a, and perform target site methylation detection based on the Ion-a to obtain methylation information of the target site, so as to classify biological samples according to the fragment features and the methylation information, wherein the target site methylation detection includes sliding cutting the Ion-a in a time direction by a specified step length to obtain a set DST composed of different current signal fragments, and performing similarity comparison analysis on each current signal fragment in the set DST and a reference signal fragment set DSR, wherein the reference signal fragment set DSR includes a subset of methylated signal fragments and a subset of non-methylated signal fragments;

the targeted site methylation detection further comprises methylation discrimination based on the compared similarity.

9. The nanopore sequencing data analysis device of claim 8,

the nanopore sequencing data analysis device further comprises:

c. a data storage module to store at least the set of reference signal segments DSR.

10. The nanopore sequencing data analysis device of claim 9, wherein said methylation discrimination based on the compared similarities comprises: if the number of results of comparison between each current signal segment in the set DST and the subset of methylated signal segments/the number of results of comparison between each current signal segment in the set DST and the subset of unmethylated signal segments is greater than 1, the target site is interpreted as methylated, and if the number of results of comparison between each current signal segment in the set DST and the subset of methylated signal segments/the number of results of comparison between each current signal segment in the set DST and the subset of unmethylated signal segments is less than 1, the target site is interpreted as unmethylated.

11. The nanopore sequencing Data analysis device of claim 8, wherein the Data acquisition module is further configured to acquire sequencing Data-B of a mutation hotspot, and the Data processing module is further configured to perform targeted mutation site detection analysis on Data-B.

12. The apparatus according to claim 8, wherein the analysis of Data-B for detection of the target mutation site comprises analysis of length distribution of the Data-B for hot-spot mutation sequencing sequence, and verification is performed based on the length distribution characteristics obtained by the fragment characteristic analysis.

13. A computer storage medium, characterized in that a computer program is stored therein, which computer program, when being executed by a computer, carries out the method of any one of claims 1-7.

14. A method of obtaining genetic information from a biological sample, comprising the steps of sequencing DNA in the biological sample using nanopore technology, and analysing the sequencing data using a method according to any one of claims 1 to 7.

15. The method of claim 14, wherein the biological sample is at least one selected from the group consisting of blood, saliva, and urine.