CN115713971A

CN115713971A - Method, system and terminal for selecting design strategy of target sequence capture probe of next generation sequencing

Info

Publication number: CN115713971A
Application number: CN202211193123.2A
Authority: CN
Inventors: 杨峰; 周艺华; 张博; 石涵; 洪跟东
Original assignee: Shanghai Ruijing Biotechnology Co ltd
Current assignee: Shanghai Ruijing Biotechnology Co ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-02-24
Anticipated expiration: 2042-09-28
Also published as: CN115713971B

Abstract

According to the method, the system and the terminal for selecting the design strategy of the target sequence capture probe of the next generation sequencing, the corresponding classification result of the probe capture region is obtained according to the input target sequence based on the constructed classification model of the probe capture region, and the corresponding design strategy of the probe is selected according to the classification result of the probe capture region. According to the invention, through the prediction of the sequence characteristics of the target region by the model, different regions can be grouped, and different probe design and laying strategies are adopted in a targeted manner, so that the optimization process of a subsequent experiment can be shortened, and the time cost is saved; the method can also effectively improve the overall performance of the capture detection panel in practical application, save research and development cost and ensure accurate and stable detection of clinical samples.

Description

Method, system and terminal for selecting design strategy of target sequence capture probe of next generation sequencing

Technical Field

The invention relates to the field of molecular detection in the biological industry, in particular to a method, a system and a terminal for selecting a design strategy of a targeting sequence capture probe for next generation sequencing.

Background

The current common technologies for molecular detection include biochip method, fluorescence quantitative PCR method, ddPCR method, one-generation sequencing method, NGS (high throughput sequencing) method, etc. Compared with other molecular detection technologies, the NGS method has the advantages of high flux (capable of realizing parallel sequencing on millions to billions of DNA molecules at one time), multiple types of coverage detection variation types (capable of detecting biorarker such as SNV, indel, CNV, fusion, MSI, TMB and the like simultaneously), high sample utilization rate (one sample is not required to be divided into multiple samples, particularly a sample with low content per se), high detection sensitivity and specificity (slightly lower than a ddPCR method on detection of some variation types, such as SNV and Indel), detection of unknown variation (beneficial to discovery of new variation sites), relatively high cost performance (average to a single sample or a single site or a single variation type), and the like. Compared with whole genome sequencing or whole exome sequencing, the target sequencing method based on the probe capture technology has the advantages of higher cost performance, shorter delivery cycle and the like.

The principle of probe capture sequencing is to synthesize effective specific probes by design, hybridize with genomic DNA, capture and enrich target region sequences, establish a library, and perform high-throughput sequencing and subsequent result analysis on a sequencer. The most critical links of the probe capture sequencing method are probe design and laying strategies, and due to the complex diversity of genome sequence characteristics, including GC content unevenness, repeated sequence characteristics, palindromic sequence structural characteristics and the like, the hybridization process of the probe can be influenced, so that the capture efficiency of the probe and the coverage depth of a genome region are influenced. The current probe design and laying ideas mainly comprise three types: 1) Designing a tiled probe; 2) Designing a imbricated probe; 3) Double-stranded probes designed based on positive and negative chains. The existing strategy generally adopts a non-difference fixed 1-3 times of unequal probe coverage layers, so that certain special characteristic areas cannot achieve a good capturing effect, and then the probe coverage layers in the corresponding areas are optimized and adjusted according to a library building and shutdown result, so that good product performance is achieved. Although relatively good performance results can be obtained finally by the strategy, because the process is based on optimization guided by experimental results, researchers can spend a lot of time on carrying out optimization experiments to guide design optimization of laying probes, and the efficiency is low. If the target areas can be predicted early to cause low probe capture efficiency, researchers can adjust the probe design and laying strategy early according to the prediction result, so that the experiment optimization time is shortened, and the panel optimization efficiency is greatly improved. Although studies have shown that GC content is an important factor in the capture efficiency of probe hybridization, due to the complex diversity of gene sequences, other unknown factors may still influence the process. How to comprehensively evaluate the influence of the complex diversity of the target sequence on the probe capture efficiency so as to better guide the upstream probe design and laying strategy needs a more scientific and reasonable evaluation mode, which is also a technical problem to be solved in the field.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention provides a method, system and terminal for selecting a design strategy of a target sequence capture probe for next generation sequencing, which are used to solve the above technical problems in the prior art.

To achieve the above and other related objects, the present invention provides a method for selecting a design strategy of a target sequence capture probe for next generation sequencing, the method comprising: acquiring a target sequence to be subjected to probe design strategy selection; based on the constructed probe capture region classification model, obtaining a corresponding probe capture region classification result according to an input target sequence; wherein the types of the probe capture region classification results include: a probe capture high efficiency result corresponding to the probe capture high efficiency region and a probe capture low efficiency result corresponding to the probe capture low efficiency region; and selecting a corresponding probe design strategy according to the classification result of the probe capture area.

In an embodiment of the present invention, the method for constructing the probe capture region classification model includes: selecting a plurality of probes with no difference in fixed layer number to cover the target sample area; based on the region capture efficiency judgment rule, dividing each probe coverage target sample region into a high-efficiency sample region group corresponding to the probe capture high-efficiency region and a low-efficiency sample region group corresponding to the probe capture low-efficiency region; wherein the high efficiency sample zone group comprises: a plurality of high efficiency sample regions; the set of low-efficiency sample regions comprises: a plurality of low-efficiency sample regions; extracting target region sequence features of each sample region in the high-efficiency sample region group and the low-efficiency sample region group to obtain a feature training matrix; and training by using the characteristic training matrix to obtain the probe capture region classification model.

In an embodiment of the present invention, the extracting the target region sequence features of each high-efficiency sample region and each low-efficiency sample region to obtain the feature training matrix includes: performing disordered k-mer traversal on each high-efficiency sample area and each low-efficiency sample area respectively to obtain characteristic data of each sample area corresponding to a plurality of k-mers respectively; and screening the characteristic data based on the characteristic data of each k-mer as a verification result of the characteristic data training model, and taking the characteristic data of a k-mer corresponding to each sample area after screening as the target area sequence characteristics corresponding to each sample area to obtain a characteristic training matrix.

In an embodiment of the invention, the verification result includes: recall, precision, accuracy, and F1 score.

In an embodiment of the invention, the probe capture region classification model is obtained by optimizing through a ten-fold cross validation method.

In an embodiment of the invention, the selecting the corresponding probe design strategy according to the classification result of the probe capture region includes: if the probe capture area classification result is the probe capture high-efficiency result, selecting a conventional probe laying strategy; and if the probe capture area classification result is the probe capture low-efficiency result, selecting a multiple-difference probe laying strategy.

In an embodiment of the invention, the conventional probe laying strategy is a 3-fold probe coverage strategy; the multiple difference probe laying strategy is a 5-fold probe covering strategy.

In order to achieve the above objects and other related objects, the present invention provides a second generation sequencing targeted sequence capture probe design strategy selection system, which comprises a target sequence acquisition module for acquiring a target sequence to be subjected to probe design strategy selection; the probe capture region classification module is connected with the target sequence acquisition module and is used for acquiring a corresponding probe capture region classification result according to an input target sequence based on the constructed probe capture region classification model; wherein the types of the probe capture region classification result comprise: a probe capture high efficiency result corresponding to the probe capture high efficiency region and a probe capture low efficiency result corresponding to the probe capture low efficiency region; and the strategy selection module is connected with the probe capture region classification module and is used for selecting the corresponding probe design strategy according to the probe capture region classification result.

To achieve the above and other related objects, the present invention provides a terminal for selecting a strategy for designing a second generation sequencing targeting sequence capture probe, comprising: one or more memories and one or more processors; the one or more memories for storing a computer program; the one or more processors, coupled to the memory, are configured to execute the computer program to perform a targeted sequence capture probe design strategy selection method for the next generation sequencing.

As described above, the present invention is a method, system and terminal for selecting a design strategy of a target sequence capture probe for next generation sequencing, and has the following advantages: the invention obtains the corresponding classification result of the probe capture area according to the input target sequence by the classification model of the probe capture area based on construction, and selects the corresponding probe design strategy according to the classification result of the probe capture area. According to the invention, through the prediction of the sequence characteristics of the target region by the model, different regions can be grouped, and different probe design and laying strategies are adopted in a targeted manner, so that the optimization process of a subsequent experiment can be shortened, and the time cost is saved; the method can also effectively improve the overall performance of the capture detection panel in practical application, save research and development cost and ensure accurate and stable detection of clinical samples.

Drawings

FIG. 1 is a schematic flow chart of a strategy selection method for designing a targeting sequence capture probe for next generation sequencing according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a method for selecting a probe design strategy according to an embodiment of the invention.

Fig. 3 is a diagram illustrating the statistical difference between the probe capture efficiency of two sets of probe design solutions for 10 samples according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of the structure of a strategy selection system for designing a second generation sequencing-targeted capture probe according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of the structure of the selection terminal of the design strategy of the second generation sequencing target sequence capture probe in one embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "over," "upper," and the like, may be used herein to facilitate describing one element or feature's relationship to another element or feature as illustrated in the figures.

Throughout the specification, when a certain portion is referred to as being "connected" to another portion, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. In addition, when a certain part is referred to as "including" a certain component, unless otherwise stated, other components are not excluded, but it means that other components may be included.

The terms first, second, third, etc. are used herein to describe various elements, components, regions, layers and/or sections, but are not limited thereto. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the scope of the present invention.

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.

The invention provides a method, a system and a terminal for selecting a design strategy of a target sequence capture probe of next generation sequencing. According to the invention, through the prediction of the sequence characteristics of the target region by the model, different regions can be grouped, and different probe design and laying strategies are adopted in a targeted manner, so that the optimization process of a subsequent experiment can be shortened, and the time cost is saved; the method can also effectively improve the overall performance of the capture detection panel in practical application, save research and development cost and ensure accurate and stable detection of clinical samples.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that those skilled in the art can easily practice the invention. The present invention may be embodied in many different forms and is not limited to the embodiments described herein.

FIG. 1 is a schematic flow chart of a strategy selection method for designing a second-generation sequencing target sequence capture probe according to an embodiment of the present invention.

The method comprises the following steps:

step S11: and acquiring a target sequence to be subjected to probe design strategy selection.

Step S12: and based on the constructed probe capture region classification model, obtaining a corresponding probe capture region classification result according to the input target sequence.

In detail, the types of the probe capture region classification result include: a probe capture high efficiency result corresponding to the probe capture high efficiency region and a probe capture low efficiency result corresponding to the probe capture low efficiency region.

Specifically, the target sequence is input into the classification model of the probe capture region to obtain a high efficiency result of probe capture corresponding to the high efficiency region of probe capture or a low efficiency result of probe capture corresponding to the low efficiency region of probe capture.

In one embodiment, the probe capture region classification model is constructed in a manner including: selecting a plurality of probes with no different fixed layer number to cover the target sample area; based on the region capture efficiency judgment rule, dividing each probe coverage target sample region into a high-efficiency sample region group corresponding to the probe capture high-efficiency region and a low-efficiency sample region group corresponding to the probe capture low-efficiency region; wherein the high efficiency sample zone group comprises: a plurality of high efficiency sample regions; the set of low-efficiency sample regions comprises: a plurality of low efficiency sample regions; extracting target region sequence features of each sample region in the high-efficiency sample region group and the low-efficiency sample region group to obtain a feature training matrix; and training by using the characteristic training matrix to obtain the probe capture region classification model.

It should be noted that, the sequence characteristics of the target region in the high-efficiency sample region are correspondingly labeled with a probe capture high-efficiency region label; the target region sequence features of the low-efficiency sample region are correspondingly marked with a probe capture low-efficiency region label;

in summary, the scheme includes that machine learning is carried out on sequence features of target areas with different probe coverage depths, a classifier is constructed by using a Support Vector Machine (SVM) method, an optimal segmentation hyper-curved surface is given to two groups of sample vectors marked, the two groups of vectors are segmented to two sides, and the distance from the vector (the so-called support vector) closest to the hyper-plane in the two groups of vectors to the hyper-plane is as far as possible. The classification model established by the method can classify the target sequence into two groups of a probe capture high-efficiency area and a probe capture low-efficiency area, and different probe design and laying strategies are adopted in different groups, so that the performance optimization of the panel can be assisted more efficiently.

In an embodiment, the area capture efficiency determination rule includes: when the sequencing depth of the probe covering the target sample region is not less than the difference value between the average sequencing depth and the standard deviation, taking the probe covering the target sample region as a high-efficiency sample region; and when the sequencing depth of the probe covering the target sample region is less than the difference between the average sequencing depth and the standard deviation, taking the probe covering the target sample region as the low-efficiency sample region.

For example, 558 probes with no different number of fixed layers are selected to cover the target area, wherein the area with the sequencing depth of more than or equal to mean (average sequencing depth) -1sd (standard deviation) is determined as the area with high capture efficiency, and 472 areas with high capture efficiency are obtained in total; determining the region with the sequencing depth smaller than mean (average sequencing depth) -1sd (standard deviation) as the region with low capture efficiency, and obtaining 87 regions with low capture efficiency by the probe in total; i.e. 558 regions are divided into two groups of high probe capture efficiency and low probe capture efficiency. And 70% of the 558 regions can be used as a training set of models, and the remaining 30% can be used as a testing set of models.

In a specific embodiment, extracting the target region sequence features for each high-efficiency sample region and each low-efficiency sample region to obtain the feature training matrix comprises: performing disordered k-mer traversal on each high-efficiency sample area and each low-efficiency sample area respectively to obtain characteristic data of each sample area corresponding to a plurality of k-mers respectively; and screening the characteristic data based on the characteristic data of each k-mer as a verification result of the characteristic data training model, and taking the characteristic data of a k-mer corresponding to each sample area after screening as the target area sequence characteristics corresponding to each sample area to form a characteristic training matrix.

Specifically, disordered k-mer traversal is respectively carried out on each high-efficiency sample area and each low-efficiency sample area, a k-mer under each base combination is taken as a feature, the type of the k-mer is recorded as n, and the content of the k-mer is taken as a value of the feature. Each region is converted into a 1*n dimensional array and the final data input is in the form of a matrix of sample region numbers n. And screening the characteristic data based on the characteristic data of each k-mer as a verification result of the characteristic data training model to screen out a k-mer, and taking the characteristic data of the k-mer corresponding to each sample area after screening as the sequence characteristics of the target area corresponding to each k-mer to obtain a characteristic training matrix.

In a specific embodiment, the verification result includes: recall, precision, accuracy, and F1 score. Namely, the k-mer value with the best verification result is selected.

For example, the selection optimization is performed on the k-mer eigenvalues by traversing different k-mer values (3-mer, 5-mer, 7-mer, 9-mer, 10-mer) and using the eigenvalue of each k-mer value as the verification result of the eigenvalue training model, and the contents thereof are as follows:

table 1: characteristic data of each k-mer is used as a verification result of a characteristic data training model

And finally selecting a vector machine model of the characteristic value of the 7-mer according to the verification result.

In one embodiment, the probe capture region classification model is obtained by optimizing the probe capture region classification model by using a ten-fold cross validation method. The ten-fold cross validation method is called 10-fold cross-validation by English name and is used for testing the accuracy of the algorithm. Is a commonly used test method. The data set was divided into ten parts, and 9 parts of the data set were used as training data and 1 part of the data set was used as test data in turn for the experiments. Each trial will yield a corresponding accuracy (or error rate). The average of the accuracy (or error rate) of the 10 results is used as an estimate of the accuracy of the algorithm, and generally 10-fold cross validation is performed multiple times (for example, 10 times of 10-fold cross validation), and then the average is obtained as an estimate of the accuracy of the algorithm.

Step S13: and selecting a corresponding probe design strategy according to the classification result of the probe capture area.

In one embodiment, the selecting a probe design strategy corresponding to the probe capture region classification result according to the probe capture region classification result comprises: if the probe capture area classification result is the probe capture high-efficiency result, selecting a conventional probe laying strategy; and if the probe capture area classification result is the probe capture low-efficiency result, selecting a multiple difference probe laying strategy.

The overall idea of the scheme is shown in fig. 2, a target area is input into a probe capture area classification model for prediction, the target area is divided into two groups according to a prediction result, the region with high probe capture efficiency is displayed by the prediction result, and a conventional probe laying strategy is adopted, and the region with low probe capture efficiency is displayed by the prediction result, and a multiple difference probe laying strategy is adopted.

Preferably, the conventional probe laying strategy is a 3-fold probe coverage strategy; the multiple difference probe laying strategy is a 5-fold probe covering strategy.

To better illustrate the strategy selection method for the design of the target sequence capture probe in the second generation sequencing, the present invention provides the following specific examples.

Example 1: a probe design and placement strategy.

By utilizing the probe design scheme, all exon regions of 40 genes related to the tumor are selected as target regions, the target regions are input into a prediction model for prediction, the target regions are divided into two groups according to the prediction result, the prediction result shows that the regions with high probe capture efficiency adopt a conventional 3-fold probe coverage strategy, the prediction result shows that the regions with low probe capture efficiency adopt a 5-fold probe coverage strategy, 2 added times adopt probe design based on a complementary chain, and the overlapping of probes with adjacent regions is increased. As a control group, all the regions without difference were used to immobilize 3 replicate probes. The length of the probes is 120bp single-stranded RNA probe, and the same blood sample is adopted to carry out experiment to compare the performance difference of the two design strategies.

Example 2: and (3) detecting the variation of the DNA of the blood tissue sample.

The detection technical scheme mainly comprises the following implementation steps:

1. extracting genomic DNA (gDNA); the sample type is human peripheral blood, the blood sample volume should be no less than 200 μ L, the nucleic acid extraction process is performed with reference to kit instruction (DP 304-Tiangen blood/cell/tissue genome DNA extraction kit), and extracted gDNA is treated with Qubit ^TM The concentration of the DNA HS Assay Kit and a matched instrument thereof is measured, and the total amount of the extracted DNA should be not less than 50ng.

2. Constructing a pre-library; the process is to convert gDNA into Illumina high-throughput sequencing platformA proprietary library. The main process refers to the description (ND 627-

Universal Plus DNA Library Prep Kit for Illumina V2), the Kit used can combine DNA fragmentation, end repair and end-dA tail addition into one step, the product does not need to be purified, and linker ligation, fragment sorting and Library enrichment are directly performed. The final library structure contains UDI (Uniqe Double Index) necessary for data resolution, and Index hopping (adaptor contamination) problems introduced in the sequencing process can be reduced during data analysis. The yield of the pre-library should be no less than 500ng.

3. Performing hybridization capture; 1) Taking 500ng of each pre-library sample, mixing 1-6 pre-libraries in an equal amount into a new centrifugal tube, putting the centrifugal tube into a vacuum concentrator, drying for 5-30min, adding 5 mu L of nucleic-free Water after drying, fully shaking and mixing uniformly, centrifuging and standing for later use. 2) Placing a new centrifuge tube on a normal-temperature tube frame to prepare a hybridization mixed solution, sequentially adding reagents, then reversing the centrifuge tube from top to bottom for 2 times, shaking the mixture for 2 seconds by using a vortex mixer, mixing the mixture uniformly, quickly centrifuging the liquid on the tube wall for 5 seconds to the tube bottom, preparing a single or a plurality of hybridization mixed solutions, and placing the hybridization mixed solutions at normal temperature for later use. 3) The prepared hybridization mixture was added to 5. Mu.L of the pre-library in 1) per reaction of 31. Mu.L, and gently pipetted and mixed. The pre-library hybridization mixture was placed on a PCR instrument for hybridization to perform PCR reaction. 4) Fully shaking and uniformly mixing streptavidin magnetic beads which are balanced for 30min at room temperature, subpackaging and adding 50 mu L of streptavidin magnetic beads into each centrifugal tube, then adding 150 mu L of washing buffer I into each centrifugal tube, uniformly mixing, placing on a magnetic frame to remove supernatant, repeatedly adding the washing buffer I, cleaning for 3 times totally, and only keeping the magnetic beads at the bottom of the tube. Finally, 150. Mu.L of washing buffer I is added into each centrifugal tube to resuspend the magnetic beads, and the sample information is marked. 5) After incubating the pre-library hybridization mixture for 16 hours at 65 ℃, opening a PCR tube on a PCR instrument, adding 150 mu L of magnetic bead resuspension in the mixture in the step 4) into the hybridization mixture by using a pipette, and gently blowing and beating the mixture 8 times by using the pipette. The mixture containing the magnetic beads and the hybridization system was then transferred back to the labeled magnetic bead centrifuge tube in 4). The centrifuge tube containing the magnetic bead hybridization mixture is placed on a vertical rotator and fixed, and is incubated for 30 minutes at room temperature in a rotating manner. . 6) And after 30 minutes, taking the centrifuge tube off the rotary instrument or the oscillating metal bath, quickly centrifuging for 5 seconds, placing the centrifuge frame for 2-5 minutes to ensure that the liquid is clear, and absorbing and discarding the supernatant. The centrifuge tube was removed from the magnetic rack, 150. Mu.L of wash buffer II was added to each well to resuspend the beads, gently pipetted 10 times, incubated 15 minutes at room temperature, and rapidly inverted 15 times up and down at 5 minute intervals. After the incubation of the washing buffer solution II at room temperature is finished, the washing buffer solution II is placed on a magnetic frame after being rapidly centrifuged for 5 seconds until the liquid is clear, the supernatant is sucked and removed, the washing buffer solution II is placed on a PCR instrument for hybridization at 65 ℃, 150 mu L of washing buffer solution III which is preheated at 65 ℃ is immediately added by a pipette, the washing buffer solution III is blown and sucked on the PCR instrument for 10 times to fully mix the magnetic beads, and a tube cover is covered for incubation at 65 ℃ for 10 minutes. After incubation for 10 minutes at 65 ℃, the PCR instrument was removed from the magnetic rack and placed, and the supernatant was aspirated after the liquid was clear. Washing with washing buffer III was repeated a total of 3 times. 7) And after the 3 times of cleaning, quickly centrifuging for 10 seconds, placing the magnetic frame again, sucking all residual liquid as far as possible by using a 10-mu-L pipettor, finally adding 20-mu-L of nucleic-free Water re-suspended magnetic beads, lightly blowing and beating the pipettor for 8 times, sucking all magnetic bead suspension, adding the whole magnetic bead suspension into a prepared Post-PCR reaction system, lightly blowing and beating the mixture for 8 times again to ensure that the magnetic beads and the PCR reaction system are uniformly mixed, and then placing the mixed system on a PCR instrument for PCR amplification reaction. 8) After the above PCR is completed, the PCR product is purified by 0.8X (40. Mu.L) purified magnetic beads, and finally eluted by 20. Mu.L of Low TE, the total amount of the obtained final library is not less than 20ng after concentration measurement, and the main peak of the library observed after Agilent Bioanalyzer 2100 or Labchip quality inspection is about 250-400 bp.

4. Sequencing on a computer and analyzing data; 1) Preprocessing sequencing data and controlling quality; the final library is subjected to PE150 sequencing on an Illumina sequencing platform, and the sequencing data amount of each sample is not less than 500Mb. After obtaining a BCL file of sequencing data, converting a sequencing off-line file (BCL format) into a sequence file (FASTQ format) by using BCL2FASTQ v2.17.1.14 software, after obtaining off-line data of the FASTQ format, removing a linker sequence and low-quality base fragments introduced in the library building process by using Trimmomatic (v 0.36) software, and filtering reads with the length of less than 50 bp. The sequence comparison module compares the sequence in the filtered fastq file to the hg19 human reference genome and the fused reference genome based on software bwa (v0.7.10) to generate corresponding bam files, and sorts the generated bam files according to genome coordinates. And counting the average sequencing depth information (depth) within the target region (i.e., each exon region). 2) Evaluating the capture efficiency and the overall performance of the two groups of probe design schemes; the method comprises the steps of selecting 10 samples to respectively count the probe capture efficiency difference of two groups of probe design schemes, wherein the off-line data volume is normalized to 200Mb, the sequencing depth of each area of the sample is more uniform after a new probe design scheme is used, and the sequencing coverage depth of the two groups of areas is close to that of the area with low predicted probe capture efficiency, so that the performance indexes such as the coverage depth and the like embodied by the probe design strategy provided by the invention are obviously superior to those of the traditional non-differential probe laying strategy, and the stable and accurate detection of clinical variation is facilitated. In addition, the probe design strategy provided by the invention can greatly shorten the experimental optimization process of research and development on the performance of the panel, and can obviously improve the research and development efficiency of products.

Similar to the principle of the above embodiments, the present invention provides a strategy selection system for designing a targeting sequence capture probe for next generation sequencing.

Specific embodiments are provided below in conjunction with the attached figures:

FIG. 4 is a schematic diagram of the structure of a second generation sequencing-enabled targeting sequence capture probe design strategy selection system in the embodiment of the present invention.

The system comprises:

a target sequence obtaining module 41, configured to obtain a target sequence to be subjected to probe design strategy selection;

the probe capture region classification module 42 is connected to the target sequence acquisition module 41, and is configured to obtain a corresponding probe capture region classification result according to an input target sequence based on the constructed probe capture region classification model; wherein the types of the probe capture region classification results include: a probe capture high efficiency result corresponding to the probe capture high efficiency region and a probe capture low efficiency result corresponding to the probe capture low efficiency region;

and the strategy selection module 43 is connected with the probe capture region classification module 42 and is used for selecting the corresponding probe design strategy according to the probe capture region classification result.

It should be noted that the division of each module in the system embodiment of fig. 4 is only a division of a logical function, and all or part of the actual implementation may be integrated into one physical entity or may be physically separated. And these units can all be realized in the form of software invoked by a processing element; or may be implemented entirely in hardware; and part of the units can be realized in the form of calling software by the processing element, and part of the units can be realized in the form of hardware.

Since the implementation principle of the strategy selection system for the target sequence capture probe design in the next generation sequencing is described in the foregoing embodiments, the repeated description is omitted here.

Optionally, the construction method of the probe capture region classification model includes: selecting a plurality of probes with no different fixed layer number to cover the target sample area; based on the region capture efficiency judgment rule, dividing each probe coverage target sample region into a high-efficiency sample region group corresponding to the probe capture high-efficiency region and a low-efficiency sample region group corresponding to the probe capture low-efficiency region; wherein the set of high efficiency sample regions comprises: a plurality of high efficiency sample regions; the set of low-efficiency sample regions comprises: a plurality of low efficiency sample regions; extracting target region sequence features of each sample region in the high-efficiency sample region group and the low-efficiency sample region group to obtain a feature training matrix; and training by using the characteristic training matrix to obtain the probe capture region classification model.

Optionally, the area capture efficiency determination rule includes: when the sequencing depth of the probe covering the target sample region is not less than the difference value between the average sequencing depth and the standard deviation, taking the probe covering the target sample region as a high-efficiency sample region; and when the sequencing depth of the probe covering the target sample region is less than the difference between the average sequencing depth and the standard deviation, taking the probe covering the target sample region as the low-efficiency sample region.

Optionally, the extracting the target region sequence features of each high-efficiency sample region and each low-efficiency sample region to obtain a feature training matrix includes: performing disordered k-mer traversal on each high-efficiency sample area and each low-efficiency sample area respectively to obtain characteristic data of each sample area corresponding to a plurality of k-mers respectively; and screening the characteristic data based on the characteristic data of each k-mer as a verification result of the characteristic data training model, and taking the characteristic data of a k-mer corresponding to each sample area after screening as the target area sequence characteristics corresponding to each sample area to obtain a characteristic training matrix.

Optionally, the verification result includes: recall, precision, accuracy, and F1 score.

Optionally, the probe capture region classification model is obtained by optimizing by using a ten-fold cross validation method.

Optionally, the selecting a probe design strategy corresponding to the probe capture region classification result according to the probe capture region classification result includes: if the probe capture area classification result is the probe capture high-efficiency result, selecting a conventional probe laying strategy; and if the probe capture area classification result is the probe capture low-efficiency result, selecting a multiple difference probe laying strategy.

Optionally, the conventional probe laying strategy is a 3-fold probe coverage strategy; the multiple differential probe laying strategy is a 5-fold probe coverage strategy.

FIG. 5 shows a schematic diagram of the terminal 10 for selecting the strategy for designing the next-generation sequenced target sequence capture probe in the embodiment of the present invention.

The second generation sequencing target sequence capture probe design strategy selection terminal 50 comprises: a memory 51 and a processor 52, the memory 51 for storing computer programs; the processor 52 runs a computer program to implement the second generation sequencing targeted sequence capture probe design strategy selection method as described in figure 1.

Optionally, the number of the memories 51 may be one or more, the number of the processors 52 may be one or more, and fig. 5 is an example.

Optionally, the processor 52 in the second-generation sequencing targeted sequence capture probe design strategy selection terminal 50 may load one or more instructions corresponding to the process of the application program into the memory 51 according to the steps described in fig. 1, and the processor 52 runs the application program stored in the first memory 51, so as to implement various functions in the second-generation sequencing targeted sequence capture probe design strategy selection method described in fig. 1.

Optionally, the memory 51 may include, but is not limited to, a high speed random access memory, a non-volatile memory. Such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices; the Processor 52 may include, but is not limited to, a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

Optionally, the Processor 52 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The present invention also provides a computer readable storage medium storing a computer program which when run implements the method for selecting a design strategy for a targeting sequence capture probe for next generation sequencing as shown in fig. 1. The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be a product that is not accessed by the computer device or may be a component that is used by an accessed computer device.

In summary, the second generation sequencing targeting sequence capture probe design strategy selection system of the present invention obtains a corresponding probe capture region classification result according to an input target sequence based on a constructed probe capture region classification model, and selects a corresponding probe design strategy according to the probe capture region classification result. According to the invention, through the prediction of the sequence characteristics of the target region by the model, different regions can be grouped, and different probe design and laying strategies are adopted in a targeted manner, so that the optimization process of a subsequent experiment can be shortened, and the time cost is saved; the method can also effectively improve the overall performance of the capture detection panel in practical application, save research and development cost and ensure accurate and stable detection of clinical samples. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Those skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for selecting a design strategy of a target sequence capture probe for next generation sequencing, which is characterized by comprising the following steps:

acquiring a target sequence to be subjected to probe design strategy selection;

based on the constructed probe capture region classification model, obtaining a corresponding probe capture region classification result according to an input target sequence; wherein the types of the probe capture region classification results include: a probe capture high efficiency result corresponding to the probe capture high efficiency region and a probe capture low efficiency result corresponding to the probe capture low efficiency region;

and selecting a corresponding probe design strategy according to the classification result of the probe capture area.

2. The method for selecting the design strategy of the next generation sequencing targeting sequence capture probe according to claim 1, wherein the probe capture region classification model is constructed in a manner that comprises:

selecting a plurality of probes with no different fixed layer number to cover the target sample area;

based on the region capture efficiency judgment rule, dividing each probe coverage target sample region into a high-efficiency sample region group corresponding to the probe capture high-efficiency region and a low-efficiency sample region group corresponding to the probe capture low-efficiency region; wherein the set of high efficiency sample regions comprises: a plurality of high efficiency sample regions; the set of low-efficiency sample regions comprises: a plurality of low-efficiency sample regions;

extracting target region sequence features of each sample region in the high-efficiency sample region group and the low-efficiency sample region group to obtain a feature training matrix;

and training by using the characteristic training matrix to obtain the probe capture region classification model.

3. The method for selecting the design strategy of the next-generation sequencing targeting sequence capture probe according to claim 2, wherein the region capture efficiency judgment rule comprises:

when the sequencing depth of the probe covering the target sample region is not less than the difference value between the average sequencing depth and the standard deviation, taking the probe covering the target sample region as a high-efficiency sample region;

and when the sequencing depth of the probe covering the target sample region is less than the difference between the average sequencing depth and the standard deviation, the probe covering the target sample region is taken as a low-efficiency sample region.

4. The method of selecting a targeting sequence capture probe design strategy for next generation sequencing according to claim 2, wherein said extracting target region sequence features for each high efficiency sample region and each low efficiency sample region to obtain a feature training matrix comprises:

performing disordered k-mer traversal on each high-efficiency sample area and each low-efficiency sample area respectively to obtain characteristic data of each sample area corresponding to a plurality of k-mers respectively;

and screening the characteristic data based on the characteristic data of each k-mer as a verification result of the characteristic data training model, and taking the characteristic data of a k-mer corresponding to each sample area after screening as the target area sequence characteristics corresponding to each sample area to obtain a characteristic training matrix.

5. The method for selecting the strategy for designing the next-generation sequenced target sequence capture probe according to claim 4, wherein the verification result comprises: recall, precision, accuracy, and F1 score.

6. The method for selecting the design strategy of the next-generation sequencing targeting sequence capture probe according to claim 2, wherein the probe capture region classification model is obtained by optimizing by adopting a ten-fold cross validation method.

7. The method for selecting the design strategy of the target sequence capture probe for the next generation sequencing according to claim 1, wherein the selecting the corresponding probe design strategy according to the classification result of the probe capture region comprises:

if the probe capture area classification result is the probe capture high-efficiency result, selecting a conventional probe laying strategy;

and if the probe capture area classification result is the probe capture low-efficiency result, selecting a multiple difference probe laying strategy.

8. The method for selecting the design strategy of the target sequence capture probe for next generation sequencing according to claim 7, wherein the conventional probe laying strategy is a 3-fold probe coverage strategy; the multiple differential probe laying strategy is a 5-fold probe coverage strategy.

9. A second generation sequenced targeted sequence capture probe design strategy selection system, comprising:

the target sequence acquisition module is used for acquiring a target sequence to be subjected to probe design strategy selection;

the probe capture region classification module is connected with the target sequence acquisition module and is used for acquiring a corresponding probe capture region classification result according to the input target sequence based on the constructed probe capture region classification model; wherein the types of the probe capture region classification results include: a probe capture high efficiency result corresponding to the probe capture high efficiency region and a probe capture low efficiency result corresponding to the probe capture low efficiency region;

and the strategy selection module is connected with the probe capture region classification module and is used for selecting the corresponding probe design strategy according to the probe capture region classification result.

10. A terminal for selecting a strategy for designing a targeting sequence capture probe for next generation sequencing is characterized by comprising the following components: one or more memories and one or more processors;

the one or more memories for storing a computer program;

the one or more processors, coupled to the memory, to execute the computer program to perform the method of any of claims 1-8.