CN116312780A - Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data - Google Patents

Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data Download PDF

Info

Publication number
CN116312780A
CN116312780A CN202310520121.8A CN202310520121A CN116312780A CN 116312780 A CN116312780 A CN 116312780A CN 202310520121 A CN202310520121 A CN 202310520121A CN 116312780 A CN116312780 A CN 116312780A
Authority
CN
China
Prior art keywords
mutation
result
data
detection
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310520121.8A
Other languages
Chinese (zh)
Other versions
CN116312780B (en
Inventor
李尔汉
杨冬成
邓泱泱
李梦真
资意
蔡兴盛
陈敬臣
李金辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Mygene Medical Technology Co ltd
Original Assignee
Guangzhou Mygene Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Mygene Medical Technology Co ltd filed Critical Guangzhou Mygene Medical Technology Co ltd
Priority to CN202310520121.8A priority Critical patent/CN116312780B/en
Publication of CN116312780A publication Critical patent/CN116312780A/en
Application granted granted Critical
Publication of CN116312780B publication Critical patent/CN116312780B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of gene detection, in particular to a method, a terminal and a medium for detecting somatic mutation of targeted gene second-generation sequencing data, which comprise the following steps: the method comprises the steps of obtaining data to be sequenced, preprocessing the data to be sequenced, and carrying out unique molecular marking on the preprocessed result to obtain standard sequencing data; obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result; performing de-duplication treatment on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistent sequence; performing mutation detection according to the consistency sequence to obtain a corresponding detection result; and generating a mutation detection analysis report according to the result. The method has the effect of supporting the detection analysis mode of the tumor single sample and the tumor/control paired sample simultaneously.

Description

Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data
Technical Field
The invention relates to the technical field of gene detection, in particular to a method, a terminal and a medium for detecting somatic mutation of targeted gene second-generation sequencing data.
Background
With the rapid development of second generation sequencing technology, targeted gene detection is increasingly being used for clinical auxiliary diagnosis, drug guidance and prognosis evaluation. With the increase of data volume and complexity, it becomes important to develop a set of comprehensive, accurate, stable and efficient data analysis methods and processes. The existing data analysis methods can be summarized as follows: the method generally comprises the steps of analysis such as sequencing data quality control, comparison reference genome, comparison result quality control, mutation detection, result report and the like, wherein the sample type comprises a tumor single sample or a tumor and control paired sample, and the detected mutation type comprises one or more of mutation types such as single nucleotide mutation, indel mutation, copy number mutation, gene fusion, tumor mutation load, microsatellite instability and the like, and a plurality of common open-source analysis software or tools are adopted.
The prior art solutions described above have the following drawbacks:
most of the existing analysis methods only support a tumor/control paired sample mode, however, many times only tumor samples exist, but control samples cannot be obtained due to various factors, so that development of an analysis method for simultaneously supporting a single tumor sample is required.
Disclosure of Invention
In order to provide a detection analysis mode for simultaneously supporting a tumor single sample and a tumor/control paired sample, the application provides a targeted gene second-generation sequencing data somatic mutation detection method, terminal and medium.
The first object of the present invention is achieved by the following technical solutions:
a targeted gene second-generation sequencing data somatic mutation detection method, comprising:
acquiring data to be sequenced, preprocessing the data to be sequenced, and carrying out unique molecular marking from a preprocessing result to obtain standard sequencing data;
obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result;
performing de-duplication treatment on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistency sequence;
performing mutation detection according to the consistent sequence to obtain a corresponding detection result, wherein the mutation detection comprises single nucleotide mutation and short insertion deletion mutation detection, copy number mutation detection, gene fusion detection, tumor mutation load detection and microsatellite instability detection;
And generating a mutation detection analysis report according to the result.
By adopting the technical scheme, the data containing the unique molecular marker can be adaptively corrected according to the type of the unique molecular marker, and meanwhile, the standard sequencing data is subjected to de-duplication treatment, so that repeated gene fragment sequences of a chromosome in the process of sequencing by PCR can be removed, the consistency of the obtained consistency sequences and the actual sequences of sampled individuals is improved, and the detection precision is improved; meanwhile, the test data is subjected to single nucleotide variation and short insertion and real variation, and a tumor sample with a control sample or a single sample can be tested at the same time, so that various mutations can be rapidly and accurately detected for the tumor sample lacking a normal paired sample, and the detection of the corresponding mutations and markers such as copy number variation, gene fusion, tumor mutation load, microsatellite instability and the like can be combined, more accurate tumor treatment target information can be mined, and more help is provided for patients to select potentially beneficial targeted drugs.
The present application may be further configured in a preferred example to: the step of obtaining the data to be sequenced, the step of preprocessing the data to be sequenced, and the step of performing unique molecular marking from the preprocessing result to obtain standard sequencing data specifically comprises the following steps:
Splitting the data to be tested according to a preset instruction to obtain data to be marked with the same size;
identifying each piece of data to be marked, if the unique molecular mark is identified, annotating the sequence annotation information on the data to be marked to obtain annotation data, otherwise, taking the annotation data as non-annotation data;
and cleaning the unexplored data and the annotated data, removing redundant sequence data, and merging the cleaned annotated data and the unexplored data to obtain the standard sequencing data.
By adopting the technical scheme, the data to be sequenced is split to obtain the data to be marked with the same size, the efficiency of comparing the sequenced data can be improved by utilizing the multiline Cheng Yunsuan, meanwhile, the joint sequence, the low-quality value sequence, the sequence with the unknown base content being too high and the like in the sequenced data can be removed, and unnecessary file generation is reduced; meanwhile, whether the unique molecular marker is contained or not is judged to carry out annotation, and unexplored data and annotated data are obtained, so that data corresponding to the tumor DNA sample memorability can be analyzed, and the applicability of the test is improved.
The present application may be further configured in a preferred example to: the step of obtaining a reference genome, which is to compare the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result, specifically comprises the following steps:
Obtaining a first comparison result of the unique molecular marker for comparing the standard sequencing data with the reference genome, and attaching the annotation information of the annotation sequence to the first comparison result to obtain a second comparison result;
and acquiring coordinate information of the reference genome, and sequencing the second comparison result according to the coordinate information to obtain the comparison result.
By adopting the technical scheme, the standard sequencing data is compared with the reference genome according to the unique part mark, and the annotation information of the labeling sequence is added according to the first comparison result, so that the standard sequencing data can be conveniently analyzed based on the reference genome, and meanwhile, the second comparison result is sequenced according to the coordinate information, so that the obtained comparison result can be sequenced according to the reference gene sequence, and the subsequent detection of the gene mutation is facilitated.
The present application may be further configured in a preferred example to: and performing de-duplication processing on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistent sequence, wherein the method specifically comprises the following steps of:
obtaining a first stop-together position of the unexplored data on the reference genome, marking the first stop-together position as a first repeated sequence according to the first stop-together position;
Acquiring a second initial termination position of the annotation data on the reference genome, acquiring comparison sequence annotation information on the reference genome according to the second initial termination position, and comparing the comparison sequence annotation information with the corresponding sequence annotation information;
if the comparison is inconsistent, marking the annotation data as a second repeated sequence, and removing the first repeated sequence and the second repeated sequence;
and if the comparison is consistent, taking the sequence corresponding to the annotation data and the reference genome as a data set to be processed, judging the unique molecular marker type of each data set to be processed, and carrying out consistency processing according to the unique molecular marker type to obtain the consistency sequence.
By adopting the technical scheme, when the gene sequence of the sample is measured by PCR, the corresponding primer is required to extract DNA, and meanwhile, the measured result is possibly due to recognition errors caused by relatively close chromosomes, so that the sequence with the recognition errors can be removed by marking the first repeated sequence and the second repeated sequence according to the unexplored data and the annotated data and removing the repeated data, and meanwhile, according to the annotated data, the corresponding consistency processing is carried out according to the unique molecular mark type, so that other error sequences such as the primer sequence can be removed, and the obtained consistency sequence is more consistent with the actual sequence of the sample.
The present application may be further configured in a preferred example to: performing mutation detection according to the consistent sequence to obtain a corresponding detection result, wherein the single nucleotide mutation and short insertion deletion mutation detection specifically comprises:
performing single-sample mutation detection on the consistent sequence to obtain a mutation detection original result, and performing standardization processing on the mutation detection original result to obtain a first mutation standardization result;
performing basic filtration on the first variation standardized result to remove false positive variation and germ line variation in the variation standardized result, so as to obtain a first basic variation result;
screening somatic mutation data from the basic mutation results, and performing mutation site chain preference filtering, filtering based on a genome repeated region and a problem region, low allele frequency mutation filtering and filtering based on a self-built background noise data database on the cell mutation data to obtain a first advanced mutation result;
and performing functional filtration on the first advanced mutation result to obtain first somatic single nucleotide and short insertion deletion mutation data.
By adopting the technical scheme, in the tumor/control paired samples, the germ line mutation can be filtered based on the control paired samples in the detection process, so that the somatic mutation can be obtained; for a tumor single sample, all mutations are detected first, including both somatic mutations and germ line mutations, then the type of predicted mutation is either a somatic mutation or a germ line mutation, and then the germ line mutation is filtered out, so that the somatic mutation can be obtained by screening, and the corresponding mutation filtration can be performed according to the somatic mutation, so that first somatic single nucleotide and short insertion deletion mutation data can be obtained.
The present application may be further configured in a preferred example to: performing mutation detection according to the consistent sequence to obtain a corresponding detection result, wherein the single nucleotide mutation and short insertion deletion mutation detection specifically comprises:
obtaining a control sample sequence corresponding to the consistency sequence, carrying out pairing mutation detection on the consistency sequence and the control sample sequence to obtain a mutation detection original result, and carrying out standardization processing on the mutation detection original result to obtain a second mutation standardization result;
performing basic filtration on the second variation standardized result to remove false positive variation and germ line variation in the variation standardized result and obtain a second basic variation result;
performing advanced filtering on the second basic variation result to obtain a second advanced variation result;
and performing functional filtration on the second advanced mutation result to obtain second somatic single nucleotide and short insertion deletion mutation data.
By adopting the technical scheme, for tumor/control paired samples, germ line mutation can be filtered based on the control paired samples in the detection process to obtain somatic mutation, and then after corresponding basic filtration and advanced filtration, the coverage depth of mutation sites, the number of support sequences of the mutation sites, the allele frequencies of the mutation sites and the allele frequency ratio of the mutation sites can be subjected to basic filtration, and after advanced filtration according to a crowd frequency database, a self-built germ line mutation database and a self-built Beijing noise database, corresponding second somatic mononucleotide and short-insertion deletion mutation data are obtained.
The present application may be further configured in a preferred example to: performing mutation detection according to the consistency sequence to obtain a corresponding detection result, wherein the gene fusion detection specifically comprises the following steps:
extracting split and non-identical sequences and insert length distributions from the identical sequences;
respectively carrying out structural variation detection on the split sequences and the inconsistent sequences and the insert fragment length distribution to obtain a structural variation detection original result;
extracting structural variation breakpoint information and corresponding supporting breakpoint sequence data from the structural variation detection original result;
counting the coverage depth result of the structural variation breakpoint information, and acquiring and annotating the gene and exon or intron information thereof from the structural variation breakpoint information to obtain a breakpoint annotation result;
performing basic filtering on the breakpoint supporting sequence data, the coverage depth result and the breakpoint annotation result to obtain candidate gene fusion sequences;
respectively carrying out gene fusion on the candidate gene fusion sequences according to preset gene fusion types, and then carrying out hierarchical filtration to obtain a gene fusion filtration result corresponding to the gene fusion types;
And merging the gene fusion filtering results of each class to obtain a gene fusion result.
By adopting the technical scheme, according to the covering depth of the breakpoints, the sequence quantity supporting the structural variation, the distance between the breakpoints and whether the breakpoints are located in a gene interval or not are removed, the structural variation which does not accord with a preset value can be obtained, possible gene fusion data can be obtained, corresponding fusion filtration can be carried out according to different gene fusion types, and the gene fusion result can be more matched with clinical importance, so that the targeted treatment to a patient is more accurate.
The present application may be further configured in a preferred example to: the targeted gene second-generation sequencing data somatic mutation detection method further comprises the following steps:
acquiring analysis step nodes, judging whether the analysis step nodes have corresponding detection checkpoints, if yes, reading checkpoint information, and if no, initializing the detection checkpoints;
judging whether the analysis step is finished from the check point information, if yes, ending the analysis step, entering a next analysis step node, and if not, initializing the detection check point;
and after initializing the detection check point, monitoring the analysis step node, and writing the analysis step completion message or the analysis step interrupt message into the detection check point when the analysis step completion message or the analysis step interrupt message is acquired.
By adopting the technical scheme, after each analysis step is started, firstly judging whether the corresponding check point of the analysis step exists or not, if not, executing the initialization check point, if so, reading the information in the check point, then judging whether the analysis step is successfully completed or not according to the information, if so, skipping the analysis step, ending the analysis step, then entering the next analysis step, if not, executing the initialization check point as if the corresponding check point of the analysis step does not exist, then starting the analysis step, then monitoring the operation of the analysis step, writing the successfully completed or unsuccessfully completed state into the check point when the analysis step is completed or interrupted, ending the analysis step, if so, entering the next analysis step, if not successfully completed, ending the analysis flow without entering the next analysis step, and if so, setting the check point, and if so, restarting the analysis flow again after being interrupted due to an abnormality factor, the analysis step which can be successfully completed, directly recovering the operation from the interrupted analysis step, thereby saving calculation resources and time.
The second object of the present invention is achieved by the following technical solutions:
a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the targeted gene second generation sequencing data somatic mutation detection method described above when the computer program is executed.
The third object of the present invention is achieved by the following technical solutions:
a computer readable storage medium storing a computer program which when executed by a processor performs the steps of the targeted gene second-generation sequencing data somatic mutation detection method described above.
In summary, the present application includes at least one of the following beneficial technical effects:
by providing the targeted gene second-generation sequencing data somatic mutation detection method, mutation such as single nucleic acid mutation, short insertion deletion mutation, copy number mutation, gene fusion, tumor mutation load, microsatellite instability and the like can be rapidly and comprehensively detected, more tumor accurate treatment target information can be mined, and more help is provided for patients to select potentially beneficial targeted drugs. The invention simultaneously supports a tumor single sample and a tumor/control paired sample mode, can rapidly and accurately detect various mutations for a tumor sample lacking a normal paired sample, simultaneously supports data containing a unique molecular marker (UMI) and data not containing the unique molecular marker, adopts a proper analysis method to correct errors for the data containing the unique molecular marker according to the type of the unique molecular marker, adopts a plurality of parallel execution methods to fully utilize computing resources so as to compress the time required by data analysis completion, and additionally realizes breakpoint continuous operation by setting a check point for each analysis step, so that the operation can be directly recovered from the interrupted analysis step to save computing resources and time when restarting the analysis.
Drawings
FIG. 1 is a flow chart of somatic mutation detection using targeted gene second generation sequencing data in one embodiment of the present application;
FIG. 2 is a schematic diagram of parallel execution of variation detection according to the present application;
FIG. 3 is a flowchart showing the implementation of step S10 in somatic mutation detection of targeted gene second-generation sequencing data in an embodiment of the present application;
FIG. 4 is a flowchart showing the implementation of step S20 in somatic mutation detection of targeted gene second-generation sequencing data in an embodiment of the present application;
FIG. 5 is a flowchart showing the implementation of step S30 in somatic mutation detection of targeted gene second-generation sequencing data in an embodiment of the present application;
FIG. 6 is a first implementation flowchart of step S40 in somatic mutation detection of targeted gene second generation sequencing data in an embodiment of the present application;
FIG. 7 is a second implementation flowchart of step S40 in somatic mutation detection of targeted gene second generation sequencing data in an embodiment of the present application;
FIG. 8 is a third implementation flowchart of step S40 in somatic mutation detection of targeted gene second generation sequencing data in an embodiment of the present application;
FIG. 9 is a flow chart of another implementation in somatic mutation detection of targeted gene second-generation sequencing data in an embodiment of the present application;
fig. 10 is a schematic view of an apparatus in an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to the accompanying drawings.
In one embodiment, as shown in fig. 1, the application discloses a method for detecting somatic mutation of targeted gene second-generation sequencing data, which specifically comprises the following steps:
s10: and (3) obtaining data to be sequenced, preprocessing the data to be sequenced, and carrying out unique molecular marking from a preprocessing result to obtain standard sequencing data.
In this embodiment, the data to be sequenced refers to the gene sequence extracted from the sample and detected. Standard sequencing data refers to the sequence of a gene that is required to make a tumor mutation.
Specifically, when a malignant tumor patient is subjected to targeted therapy, tumor gene detection is required to be performed on the patient so as to detect cancerous cells, and corresponding targeted therapy is performed on the cancerous cells, so that the effects of preventing the cancerous cells from expressing genes or precisely killing the corresponding cancerous cells and reducing the influence on other normal cells are achieved. Therefore, when the targeted therapy is performed on the patient, the cell and gene sequence of the patient with canceration need to be acquired to select the corresponding targeted drug for the therapy, so that after the surface cell of the patient is extracted, the sequence to be tested is acquired by a PCR sequencing mode, the preset treatment is performed to remove the redundant gene sequence, the unnecessary intermediate file generation is reduced, and unique molecular markers are added for each identifiable DNA segment by UMI technology (Unique Molecular Identifiers), so that the standard sequencing data is obtained.
S20: and (3) obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result.
Specifically, a corresponding reference genome is obtained, standard test data is corresponding to the reference genome according to the positions of all gene fragments in the reference genome, and meanwhile, if the unique molecular marker exists in the standard sequencing data, the gene sequencing fragment containing the unique molecular marker is marked, so that a corresponding comparison result is obtained.
S30: and carrying out de-duplication treatment on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistent sequence.
Specifically, after the comparison result of the standard sequencing data and the reference genome is obtained, the redundant repeated sequence is removed according to whether a gene sequence segment of the unique sub-marker exists or not and the type of the unique sub-marker, and error information in the sequence is corrected, so that the consistency sequence is obtained.
S40: and carrying out mutation detection according to the consistency sequence to obtain a corresponding detection result, wherein the mutation detection comprises single nucleotide mutation and short insertion deletion mutation detection, copy number mutation detection, gene fusion detection, tumor mutation load detection and microsatellite instability detection.
Specifically, referring to fig. 2, after obtaining the consensus sequence, the consensus sequence is subjected to gene mutation detection, wherein the mutation detection includes single nucleotide mutation and short insertion deletion mutation detection, copy number mutation detection, gene fusion detection, tumor mutation load detection, and microsatellite instability detection, respectively.
Copy number variation detection: establishing a baseline based on a plurality of copy number variation negative samples (verified by other analysis platforms) by using a CNVkit reference in advance before analysis; during analysis, the CNVkit batch is used for carrying out copy number variation detection based on tumor samples, namely the consistency sequence and the established base line, screening genes with amplified copy numbers according to detection results, calculating the copy numbers of the genes, and drawing copy number variation patterns in the whole genome range;
tumor mutation load detection: after finishing the single nucleotide mutation and short insertion deletion mutation detection to obtain the final somatic mutation with high credibility, calculating the tumor mutation load according to the tumor mutation load calculation rule: screening mutation with mutation frequency greater than or equal to 5% in a coding region in a target region, removing driving mutation (mutation existing in a hot spot mutation database, a TCGA_PCDM driving mutation database and a CHASMPlus driving mutation database), dividing the number of the remaining mutation by the size (Mb) of the coding region in the target region to obtain tumor mutation load, and judging that the tumor mutation load state of a sample is tumor mutation load low (TMB-low) and tumor mutation load high (TMB-high) according to a threshold value;
Microsatellite instability detection: and (3) detecting microsatellite instability of a comparison result of a tumor sample by using MSISensor2 and a pre-trained machine learning model thereof, and then judging microsatellite instability states of the sample as microsatellite stability (MSS), microsatellite low instability (MSI-L) and microsatellite high instability (MSI-H) according to a threshold value based on the detection result.
In the quality control of the consistent sequence, the consistent sequence can be subjected to quality control through picard gene data analysis software, and the targeting rate, the repetition rate, the coverage depth of a targeting region and the uniformity of the consistent sequence are calculated to screen out the sequence meeting the requirements, wherein the specific quality control method comprises the following steps:
(1) Statistical alignment using "Picard CollectAlignmentSummaryMetrics";
(2) Counting the number of sequences in the target region by using 'samtools view', and then calculating the target rate based on the result;
(3) If the sequencing data does not contain a unique molecular marker (UMI), calculating the repetition rate according to the repeated sequence statistical result output by Picard MarkDuplicates; if the sequencing data comprises a unique molecular marker (UMI), calculating the repetition rate according to the total sequence number of the sequencing data and the sequence number after de-duplication;
(4) Using "bedtools coverage" to calculate the depth of coverage for each location within the targeted area, and then calculating the average depth of coverage and uniformity for the targeted area based on this result;
(5) If a tumor sample and a control sample are present, then "somalier" is used to determine that the tumor sample and the control sample are from the same individual.
S50: and generating a mutation detection analysis report according to the result.
Specifically, according to the quality evaluation result of sequencing data, the quality control result of a consistent sequence, the detection result of single nucleotide variation and short insertion deletion variation, the detection result of copy number variation, the detection result of gene fusion, the detection result of tumor mutation load and the detection result of microsatellite instability, an analysis result report is generated;
the report generation step specifically comprises:
(1) Generating a quality control report according to the quality evaluation result of the sequencing data and the quality control result of the comparison result, and respectively generating analysis result reports of all mutation types according to the single nucleotide mutation and short insertion deletion mutation detection result, the copy number mutation detection result, the gene fusion detection result, the tumor mutation load detection result and the microsatellite instability detection result;
(2) And integrating the quality control report and the analysis result report of each variation type to form a finished analysis result report.
In the embodiment, the data containing the unique molecular marker can be adaptively corrected according to the type of the unique molecular marker, and meanwhile, the standard sequencing data is subjected to de-duplication treatment, so that repeated gene fragment sequences of a chromosome in the process of sequencing by PCR can be removed, and the consistency of the obtained consistency sequences and the actual sequences of sampled individuals is improved, so that the detection precision is improved; meanwhile, the test data is subjected to single nucleotide variation and short insertion and real variation, and a tumor sample with a control sample or a single sample can be tested at the same time, so that various mutations can be rapidly and accurately detected for the tumor sample lacking a normal paired sample, and the detection of the corresponding mutations and markers such as copy number variation, gene fusion, tumor mutation load, microsatellite instability and the like can be combined, more accurate tumor treatment target information can be mined, and more help is provided for patients to select potentially beneficial targeted drugs.
In one embodiment, as shown in fig. 3, in step S10, data to be sequenced is obtained, the data to be sequenced is preprocessed, and unique molecular markers are performed from the preprocessed result to obtain standard sequencing data, which specifically includes:
S11: splitting the data to be tested according to a preset instruction to obtain the data to be marked with the same size.
Specifically, in corresponding biological sequence analysis software, the data to be marked are divided into a plurality of parts with relative sizes by corresponding instructions in the software, so as to obtain the data to be marked.
S12: and identifying each piece of data to be marked, marking sequence annotation information on the data to be marked if the unique molecular mark is identified, and obtaining annotation data, otherwise, taking the annotation data as non-annotation data.
Specifically, each piece of data to be marked is identified, whether UMI is contained in the library-building joint is determined according to the experiment, if so, the data to be marked is read by using a 'python map' package, a unique molecular mark is extracted from the data to be marked and added to the first line of each sequence in the sequencing data to serve as sequence annotation information (fastq comment) to serve as annotation data, and otherwise, the first line serves as non-annotation data.
S13: and cleaning the unexplored data and the annotated data, removing redundant sequence data, and combining the cleaned annotated data and the unexplored data to obtain standard sequencing data.
Specifically, using "fastp" to evaluate the quality of annotated data and un-annotated data, counting the yield of sequencing data, the number of sequencing data sequences, Q20, Q30, GC content, and removing linker sequences, low quality value sequences, sequences with an unknown base content that is too high in annotated data and un-annotated data; performing this step in parallel for each piece of annotated data and un-annotated data; and after the completion, combining each part of processed sequencing data by using a cat command to obtain standard sequencing data, and transmitting the standard sequencing data to the next analysis step in a standard output mode to avoid generating unnecessary intermediate files.
In one embodiment, as shown in fig. 4, in step S20, a reference genome is obtained, and standard sequencing data is compared with the reference genome according to a unique molecular marker to obtain a corresponding comparison result, which specifically includes:
s21: and obtaining a first comparison result of the unique molecular marker for comparing the standard sequencing data with the reference genome, and adding annotation information of the annotation sequence to the first comparison result to obtain a second comparison result.
Specifically, the standard test data after pretreatment is compared to a reference genome by using 'bwa mem', a first comparison result is obtained, and a multithreading parameter (-t) is set according to a computing resource; if the standard test data contains a unique molecular marker (UMI), adding a parameter '-C' -adding sequence annotation information (fastq comment) in the standard test data to the first comparison result, and adding MC and MQ labels to the comparison result by using sambolter, thereby obtaining a second comparison result.
S22: and acquiring coordinate information of the reference genome, and sequencing the second comparison result according to the coordinate information to obtain the comparison result.
Specifically, the coordinate position of each gene sequence is obtained in the reference genome, the second comparison result is ordered according to the coordinates of the reference genome by using 'samtools sort', and an index file is built by using 'samtools index' after the second comparison result is completed, so that the comparison result is obtained.
In one embodiment, as shown in fig. 5, in step S30, the standard sequencing data is subjected to deduplication processing according to the comparison result of the unique molecular markers, so as to obtain a consistent sequence, which specifically includes:
s31: a first stop-together position of the unexplored data on the reference genome is obtained, and the first repeat sequence is marked according to the first stop-together position.
Specifically, in the picard gene data analysis software for the unannotated data, sequences aligned to the same start-stop positions on the reference genome were labeled as the first repeat sequence using "Picard MarkDuplicates".
S32: and acquiring a second initial termination position of the annotation data on the reference genome, acquiring comparison sequence annotation information on the reference genome according to the second initial termination position, and comparing the comparison sequence annotation information with corresponding sequence annotation information.
Specifically, for annotation data, in the fgbio analysis software, "fgbio GroupReadsByUmi" was used to align to a second starting termination position on the reference genome according to the sequence in the sequencing data, the annotation data and the corresponding reference genome were grouped into one set, and each set was aligned.
S33: if the alignment is inconsistent, the annotation data is marked as a second repeated sequence and the first repeated sequence and the second repeated sequence are removed.
Specifically, if the alignment is to the same second starting termination location on the reference genome but the corresponding unique molecular markers are not the same, the annotation data is marked as a second repeat sequence and the first and second repeat sequences are removed.
S34: if the comparison is consistent, the sequence corresponding to the annotation data and the reference genome is used as the data set to be processed, the unique molecular marker type of each data set to be processed is judged, and the consistency processing is carried out according to the unique molecular marker type, so that a consistency sequence is obtained.
Specifically, if the alignment is consistent, i.e., the alignment is to the same second starting termination position on the reference genome but the corresponding unique molecular markers are also the same, then grouping the corresponding reference genome with annotation data; further, if the unique molecular marker type is single-ended UMI, each group of sequences is checked by using fgbio CallMolecularConsensusReads' one by one, a likelihood model of each base is built, and then each base and the quality value thereof in the sequences are determined based on the model to obtain a consistent sequence; if the unique molecular marker type is double-end UMI, each group of sequences is subjected to base-by-base inspection by using fgbio CallDuplexConsensusReads, a likelihood model of each base is built, and then each base and the quality value thereof in the sequences are determined based on the model to obtain a consistent sequence.
In one embodiment, as shown in fig. 6, in step S40, mutation detection is performed according to the consensus sequence to obtain a corresponding detection result, where the single nucleotide mutation and short insertion deletion mutation detection specifically includes:
s411: and carrying out single-sample mutation detection on the consistent sequence to obtain a mutation detection original result, and carrying out standardization processing on the mutation detection original result to obtain a first mutation standardization result.
Specifically, according to the comparison result of tumor samples, single-nucleotide mutation and short-insertion deletion mutation detection are carried out by using a single-sample analysis mode, so as to obtain a mutation detection original result, then mutation annotation is carried out on the mutation detection original result, so as to obtain a mutation annotation result, and then left-alignment standardization is carried out on the mutation so as to obtain a standardized first mutation standardization result.
S412: and performing basic filtration on the first variation standardization result to remove false positive variation and germ line variation in the variation standardization result, so as to obtain a first basic variation result.
Specifically, performing basic filtering according to the first mutation standardization result to remove false positive mutation and germ line mutation in the mutation standardization result, including:
1) Filtering according to the coverage depth of the mutation site, wherein the coverage depth > =100deg.X;
2) Filtering according to the sequence support number of the mutation site, wherein the total sequence support number > =5, the forward sequence support number > =2 and the reverse sequence support number > =2;
3) Filtering was performed according to the allele frequency of the mutation site, such as single nucleotide hotspot mutation allele frequency > =0.5%, short indel hotspot mutation allele frequency > =2.0%, single nucleotide non-hotspot mutation allele frequency > =1.0%, short indel non-hotspot mutation allele frequency > =5.0%.
And after the basic variation is filtered, obtaining the first basic variation result.
S413: and screening somatic mutation data from the basic mutation results, and filtering mutation site chain preference, genome repetition region and problem region based filtering, low allele frequency mutation filtering and self-built background noise based database filtering on the cell mutation data to obtain a first high-level mutation result.
Specifically, according to the first basic mutation result obtained in the previous step and passing through basic filtering, somatic mutation and germ line mutation are distinguished, firstly, known germ line mutation is distinguished according to crowd frequency of mutation in crowd frequency database and germ line mutation database built by itself, and then germ line mutation and somatic mutation are distinguished for the rest mutation based on the prediction result of mutation type:
1) Distinguishing the known germ line variation based on the crowd frequency database, and if the variation site is in crowd frequency > =0.5% in any crowd frequency data such as 1000G, exAC, gnomAD, chinaMap, distinguishing the variation site as the known germ line variation;
2) Distinguishing the known germ line variation based on the self-established germ line variation database, and if the variation site exists in the self-established germ line variation database, distinguishing the variation as the known germ line variation;
3) The germ line variation and somatic variation are distinguished based on the result of the variation type prediction, such as using "PureCN" software to predict the type of variation, and the variation is distinguished as either germ line variation or somatic variation based on the prediction result.
Further, the filtering of mutation site strand preference, filtering based on genome duplication region and problem region, filtering of low allele frequency mutation and high-level filtering based on self-built background noise database filtering are performed on somatic mutation, and false positive mutation and germ line mutation in analysis results are further removed, including:
(1) Filtering according to the chain preference of the mutation site, wherein the chain preference refers to that the number of forward sequences or directional sequences in the supporting sequence of the mutation site is uneven, the number of sequences in one direction is far more than that of sequences in the other direction, the chain preference can cause errors of mutation detection results, the probability that the mutation is false positive is higher when the chain preference value is higher, and the possible false positive mutation is filtered through the chain preference value, for example, the chain preference value is more than 1.5;
(2) Filtering the mutation sites according to whether the mutation sites are in repeated areas or problem areas of the genome, wherein the repeated areas or problem areas of the genome can cause more errors than other areas in the sequencing or comparison process, and the errors can cause errors of mutation detection results, so that the mutation sites in the areas need to be filtered;
(3) Filtering the variation of the low allele frequency (allele frequency < 2.0%) according to the strand preference of the mutation site and the average number of base mismatches of the mutation site support sequence, the strand preference of the true positive and false positive mutation sites of the low allele frequency exhibiting a different trend than the distribution of the average number of base mismatches of the support sequence, thus combining the strand preference of the mutation site and the number of base mismatches of the mutation site support sequence to filter the variation of the low allele frequency, such as the strand preference Fisher test p value <0.1 and the average number of base mismatches of the support sequence > = 2.0;
(4) And filtering according to a self-built background noise database, for example, respectively establishing baseline accumulated mutation frequencies for all sites in a detection area by using a plurality of control samples, and if the ratio of the allele frequency of the mutation site to the corresponding baseline accumulated mutation frequency is < 3, filtering the mutation as false positive to obtain a first advanced mutation result.
S414: and performing functional filtration on the first advanced mutation result to obtain first somatic single nucleotide and short insertion deletion mutation data.
Specifically, according to the first advanced mutation result obtained in the previous step and passing through advanced filtering, performing functional filtering, wherein the functional filtering mainly performs filtering according to the region of the gene where the mutation site is located and the functional type of mutation, and filters out the mutation beyond the following 4 types:
(1) Variation within the coding region of the gene (except for synonymous mutations);
(2) Variation in splicing regions of genes;
(3) Variation within the TERT gene promoter region;
(4) Variation of the MET gene 13, 14 intronic regions.
In one embodiment, as shown in fig. 7, in step S40, mutation detection is performed according to the consensus sequence to obtain a corresponding detection result, where the single nucleotide mutation and short insertion deletion mutation detection specifically includes:
s421: and obtaining a control sample sequence corresponding to the consistent sequence, performing pairing mutation detection on the consistent sequence and the control sample sequence to obtain a mutation detection original result, and performing standardization processing on the mutation detection original result to obtain a second mutation standardization result.
Specifically, for a tumor sample and a corresponding control sample sequence, performing single nucleotide mutation and short insertion deletion mutation detection by using a pairing analysis mode to obtain a mutation detection original result, performing mutation annotation on the mutation detection original result to obtain a mutation annotation result, and performing left alignment standardization on the mutation to obtain a standardized second mutation standardization result.
S422: and performing basic filtration on the second variation standardization result to remove false positive variation and germ line variation in the variation standardization result, so as to obtain a second basic variation result.
Specifically, 2. Basic filtering mainly comprises 4 aspects of filtering:
(1) Filtering according to the coverage depth of the mutation site, wherein the coverage depth > =100deg.X;
(2) Filtering according to the sequence support number of the mutation site, wherein the total sequence support number > =5, the forward sequence support number > =2 and the reverse sequence support number > =2;
(3) Filtering is performed according to the allele frequency of the mutation site, such as the allele frequency of single nucleotide hot spot mutation > =0.5%, the allele frequency of short indel hot spot mutation > =2.0%, the allele frequency of single nucleotide non-hot spot mutation > =1.0%, and the allele frequency of short indel non-hot spot mutation > =5.0%.
(4) Filtering is performed according to the ratio of the allele frequency of the mutation site in the tumor sample relative to the allele frequency of the control sample, such as allele frequency ratio > =3 0,
thereby obtaining the second basic variation result.
S423: and performing advanced filtering on the second basic mutation result to obtain a second advanced mutation result.
Specifically, the step of performing advanced filtering on the second basic mutation result includes the following steps to obtain a second advanced mutation result:
(1) Filtering according to the chain preference of the mutation site, wherein the chain preference refers to that the number of forward sequences or directional sequences in the supporting sequence of the mutation site is uneven, the number of sequences in one direction is far more than that of sequences in the other direction, the chain preference can cause errors of mutation detection results, the probability that the mutation is false positive is higher when the chain preference value is higher, and the possible false positive mutation is filtered through the chain preference value, for example, the chain preference value is more than 1.5;
(2) Filtering the mutation sites according to whether the mutation sites are in repeated areas or problem areas of the genome, wherein the repeated areas or problem areas of the genome can cause more errors than other areas in the sequencing or comparison process, and the errors can cause errors of mutation detection results, so that the mutation sites in the areas need to be filtered;
(3) Filtering the variation of the low allele frequency (allele frequency < 2.0%) according to the strand preference of the mutation site and the average number of base mismatches of the mutation site support sequence, the strand preference of the true positive and false positive mutation sites of the low allele frequency exhibiting a different trend than the distribution of the average number of base mismatches of the support sequence, thus combining the strand preference of the mutation site and the number of base mismatches of the mutation site support sequence to filter the variation of the low allele frequency, such as the strand preference Fisher test p value <0.1 and the average number of base mismatches of the support sequence > = 2.0;
(4) Filtering according to the coverage depth of the mutation site, the allele frequency and the quality value, wherein the coverage depth of the mutation site is <6, the average comparison quality value of the support sequence is <55.0, the average base mismatch number of the support sequence is >1.0, or the average comparison quality value of the support sequence is <60.0, and the average base mismatch number of the support sequence is >2.0;
(5) Filtering according to the crowd frequency of the mutation site in the crowd frequency database, if the crowd frequency of the mutation site in any crowd frequency data such as 1000G, exAC, gnomAD, chinaMap is > =0.5%, filtering the mutation site as possible false positive or germ line mutation.
(6) Filtering according to whether a variation site exists in a self-built germ line variation database, if so, establishing the germ line variation database by using germ line variation with high credibility detected by a plurality of control samples, and if so, filtering out the variation site as germ line variation;
(7) Filtering according to a self-built background noise database, for example, respectively establishing baseline accumulated mutation frequencies for all sites in a detection area by using a plurality of control samples, and filtering the mutation as false positive if the ratio of the allele frequency of the mutation site to the corresponding baseline accumulated mutation frequency is < 3.
S424: and performing functional filtration on the second advanced mutation result to obtain second somatic single nucleotide and short insertion deletion mutation data.
Specifically, the second advanced mutation result is subjected to functional filtration, wherein the functional filtration is mainly carried out according to the region of the gene where the mutation site is located and the functional type of mutation, and the mutation outside the following 4 types is filtered out, so that second somatic single nucleotide and short insertion deletion mutation data are obtained:
1) Variation within the coding region of the gene (except for synonymous mutations);
2) Variation in splicing regions of genes;
3) Variation within the TERT gene promoter region;
4) Variation of the MET gene 13, 14 intronic regions.
In one embodiment, as shown in fig. 8, in step S40, mutation detection is performed according to the consensus sequence to obtain a corresponding detection result, where the gene fusion detection specifically includes:
s431: split and non-identical sequences and insert length distributions are extracted from the identical sequences.
Specifically, the split (SA-tagged) and non-identical (insert length out of the expected range) sequences in the alignment were extracted using "lumpy_filter" and the distribution of insert lengths was estimated, resulting in split and non-identical sequences, as well as the distribution of insert lengths.
S432: and respectively carrying out structural variation detection on the split and inconsistent sequences and the insert fragment length distribution to obtain the original structural variation detection result.
Specifically, the "lumpy-sv" was used to detect structural variations in the split and non-identical sequences and insert length distributions, respectively, to obtain the original results of the structural variations detection.
S433: and extracting structural variation breakpoint information and corresponding supporting breakpoint sequence data from the structural variation detection original result.
Specifically, a breakpoint (break point) corresponding to each structural mutation is annotated with "VEP", so as to obtain structural mutation breakpoint information and corresponding supporting breakpoint sequence data.
S434: and counting the coverage depth result of the structural variation breakpoint information, and acquiring and annotating the gene and the exon or intron information thereof from the structural variation breakpoint information to obtain a breakpoint annotation result.
Specifically, using "samtools depth" to count the coverage depth result of the breakpoint corresponding to each structural variation, and counting the coverage depth result of the structural variation breakpoint information, and obtaining and annotating the gene and the exon or intron information from the structural variation breakpoint information to obtain the breakpoint annotation result.
S435: and performing basic filtering on the breakpoint sequence data, the coverage depth result and the breakpoint annotation result to obtain candidate gene fusion sequences.
Specifically, based on the supporting sequence corresponding to the breakpoint of the structural variation, the gene corresponding to the breakpoint and the exon or intron information and the coverage depth of the breakpoint, basic filtering is performed to remove obvious false positive structural variation, so as to obtain possible candidate gene fusion, wherein the basic filtering mainly comprises 4 aspects of filtering:
1) Filtering based on the depth of coverage of the breakpoint, such as depth of coverage > =100deg.X;
2) Filtering based on the sequence support number of the breakpoint, such as the total sequence support number > =3, and wherein the split sequence > =1, the inconsistent sequence > =1;
3) Filtering based on the distance between break points, e.g., distance between break points >2000, or break points located on different chromosomes;
4) Filtering is based on genomic regions corresponding to breakpoint positions, such that structural variations (gene-gene or gene-gene inter-region) are only likely to be candidates for gene fusion, if at least one breakpoint is located within a gene region.
S436: and respectively carrying out gene fusion on the candidate gene fusion sequences according to the preset gene fusion types, and then carrying out hierarchical filtration to obtain a gene fusion filtration result corresponding to the gene fusion types.
Specifically, classifying the candidate gene fusion obtained in the last step, wherein the classification mainly classifies the candidate gene fusion into three types according to the clinical importance of the gene fusion and whether the candidate gene fusion is known gene fusion in a gene fusion database, filtering can be carried out according to the respective threshold value of each type of gene fusion after classification, a relatively loose threshold value is set for the gene fusion or the known gene fusion with high clinical importance, and a relatively strict threshold value is set for the unknown gene fusion which does not exist in the database, so that the gene fusion or the known gene fusion with high clinical importance can be ensured to achieve relatively high detection sensitivity, and relatively high detection accuracy is achieved for the unknown gene fusion;
1) Class I gene fusion: diagnostic biomarkers, drug-related gene fusions, hot spot gene fusions, and the like are classified as class I gene fusions. Filtering threshold for class I gene fusion: total sequence support number > =3, and where split sequence > =1, inconsistent sequence > =1; diagnostic biomarkers, drug-related gene fusions, hot spot gene fusions, and the like are collected in NCCN guidelines, FDA-approved drug-corresponding biomarkers, oncoKB databases, COSMIC databases, and the like, respectively;
2) Class II gene fusion: the gene fusions other than the class I gene fusion in the gene fusion database are classified into class II gene fusions. Threshold of type II genetic exacerbation: total sequence support number > =5, and where split sequence > =2, inconsistent sequence > =2; the gene fusion database is collected and arranged from a fusion GDB database, a TCGA gene fusion database and the like;
3) Class III gene fusion: unknown gene fusions that are not present in the database are classified as class III gene fusions. Filtering threshold for class III gene fusion: total sequence support number > =9, and where split sequence > =4, inconsistent sequence > =4.
S437: and combining the gene fusion filtering results of each class to obtain a gene fusion result.
Specifically, the results obtained after the 3 kinds of gene fusion are combined and respectively filtered are finally obtained through all the filtered results to be the gene fusion results of the high-reliability gene fusion.
In one embodiment, as shown in fig. 9, the targeted gene second generation sequencing data somatic mutation detection method further comprises:
s60: and acquiring analysis step nodes, judging whether the analysis step nodes have corresponding detection checkpoints, if so, reading checkpoint information, and if not, initializing the detection checkpoints.
Specifically, the steps S10 to S50 include preprocessing sequencing data, comparing reference genomes, removing duplicate and correct errors, controlling quality, detecting variation, and generating reports sequentially, setting corresponding analysis step nodes in each flow step, and judging whether corresponding detection checkpoints exist in the analysis step nodes when each analysis step starts, if yes, reading checkpoint information, otherwise initializing the detection checkpoints.
S70: judging whether the analysis step is finished from the check point information, if yes, ending the analysis step, entering a next analysis step node, and if not, initializing a detection check point.
Specifically, whether the analysis step is successfully completed is judged according to the information, if so, the analysis step is skipped, the analysis step is ended, the next analysis step is started, if not, the detection check point is initialized, and the operation of the analysis step is monitored.
S80: after initializing the detection check point, monitoring the analysis step node, and writing the analysis step completion message or the analysis step interrupt message into the detection check point when the analysis step completion message or the analysis step interrupt message is acquired.
Specifically, when the analysis step is completed or interrupted, writing the successfully completed or unsuccessfully completed state into the check point, ending the analysis step, entering the next analysis step if the analysis step is successfully completed, ending the analysis flow without entering the next analysis step if the analysis step is unsuccessfully completed, and by setting the check point, when the analysis flow is restarted again after the interruption due to the abnormal factor, directly recovering the operation from the interrupted analysis step by skipping the successfully completed analysis step.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a targeted gene second-generation sequencing data somatic mutation detection method.
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:
the method comprises the steps of obtaining data to be sequenced, preprocessing the data to be sequenced, and carrying out unique molecular marking on the preprocessed result to obtain standard sequencing data;
obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result;
performing de-duplication treatment on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistent sequence;
performing mutation detection according to the consistent sequence to obtain a corresponding detection result, wherein the mutation detection comprises single nucleotide mutation and short insertion deletion mutation detection, copy number mutation detection, gene fusion detection, tumor mutation load detection and microsatellite instability detection;
and generating a mutation detection analysis report according to the result.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
The method comprises the steps of obtaining data to be sequenced, preprocessing the data to be sequenced, and carrying out unique molecular marking on the preprocessed result to obtain standard sequencing data;
obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result;
performing de-duplication treatment on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistent sequence;
performing mutation detection according to the consistent sequence to obtain a corresponding detection result, wherein the mutation detection comprises single nucleotide mutation and short insertion deletion mutation detection, copy number mutation detection, gene fusion detection, tumor mutation load detection and microsatellite instability detection;
and generating a mutation detection analysis report according to the result.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. The method for detecting the somatic mutation of the target gene second-generation sequencing data is characterized by comprising the following steps of:
Acquiring data to be sequenced, preprocessing the data to be sequenced, and carrying out unique molecular marking from a preprocessing result to obtain standard sequencing data;
obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker to obtain a corresponding comparison result;
performing de-duplication treatment on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consistency sequence;
performing mutation detection according to the consistent sequence to obtain a corresponding detection result, wherein the mutation detection comprises single nucleotide mutation and short insertion deletion mutation detection, copy number mutation detection, gene fusion detection, tumor mutation load detection and microsatellite instability detection;
and generating a mutation detection analysis report according to the result.
2. The method for detecting somatic mutation of targeted gene second-generation sequencing data according to claim 1, wherein the steps of obtaining the data to be sequenced, preprocessing the data to be sequenced, and performing unique molecular markers from the preprocessing result to obtain standard sequencing data comprise:
splitting the data to be tested according to a preset instruction to obtain data to be marked with the same size;
Identifying each piece of data to be marked, if the unique molecular mark is identified, annotating the sequence annotation information on the data to be marked to obtain annotation data, otherwise, taking the annotation data as non-annotation data;
and cleaning the unexplored data and the annotated data, removing redundant sequence data, and merging the cleaned annotated data and the unexplored data to obtain the standard sequencing data.
3. The method for detecting somatic mutation in targeted gene second-generation sequencing data according to claim 2, wherein the step of obtaining a reference genome, and comparing the standard sequencing data with the reference genome according to the unique molecular marker, to obtain a corresponding comparison result, comprises the following steps:
obtaining a first comparison result of the unique molecular marker for comparing the standard sequencing data with the reference genome, and attaching the annotation information of the annotation sequence to the first comparison result to obtain a second comparison result;
and acquiring coordinate information of the reference genome, and sequencing the second comparison result according to the coordinate information to obtain the comparison result.
4. The method for detecting somatic mutation in targeted gene second-generation sequencing data according to claim 2, wherein the step of performing deduplication processing on the standard sequencing data according to the comparison result of the unique molecular marker to obtain a consensus sequence specifically comprises the following steps:
Obtaining a first stop-together position of the unexplored data on the reference genome, marking the first stop-together position as a first repeated sequence according to the first stop-together position;
acquiring a second initial termination position of the annotation data on the reference genome, acquiring comparison sequence annotation information on the reference genome according to the second initial termination position, and comparing the comparison sequence annotation information with the corresponding sequence annotation information;
if the comparison is inconsistent, marking the annotation data as a second repeated sequence, and removing the first repeated sequence and the second repeated sequence;
and if the comparison is consistent, taking the sequence corresponding to the annotation data and the reference genome as a data set to be processed, judging the unique molecular marker type of each data set to be processed, and carrying out consistency processing according to the unique molecular marker type to obtain the consistency sequence.
5. The method for detecting somatic mutation in targeted gene second-generation sequencing data according to claim 1, wherein mutation detection is performed according to the consensus sequence to obtain a corresponding detection result, and wherein the single nucleotide mutation and short insertion deletion mutation detection specifically comprises:
Performing single-sample mutation detection on the consistent sequence to obtain a mutation detection original result, and performing standardization processing on the mutation detection original result to obtain a first mutation standardization result;
performing basic filtration on the first variation standardized result to remove false positive variation and germ line variation in the variation standardized result, so as to obtain a first basic variation result;
screening somatic mutation data from the basic mutation results, and performing mutation site chain preference filtering, filtering based on a genome repeated region and a problem region, low allele frequency mutation filtering and filtering based on a self-built background noise data database on the cell mutation data to obtain a first advanced mutation result;
and performing functional filtration on the first advanced mutation result to obtain first somatic single nucleotide and short insertion deletion mutation data.
6. The method for detecting somatic mutation in targeted gene second-generation sequencing data according to claim 1, wherein mutation detection is performed according to the consensus sequence to obtain a corresponding detection result, and wherein the single nucleotide mutation and short insertion deletion mutation detection specifically comprises:
Obtaining a control sample sequence corresponding to the consistency sequence, carrying out pairing mutation detection on the consistency sequence and the control sample sequence to obtain a mutation detection original result, and carrying out standardization processing on the mutation detection original result to obtain a second mutation standardization result;
performing basic filtration on the second variation standardized result to remove false positive variation and germ line variation in the variation standardized result and obtain a second basic variation result;
performing advanced filtering on the second basic variation result to obtain a second advanced variation result;
and performing functional filtration on the second advanced mutation result to obtain second somatic single nucleotide and short insertion deletion mutation data.
7. The method for detecting somatic mutation of targeted gene second-generation sequencing data according to claim 1, wherein mutation detection is performed according to the consensus sequence to obtain a corresponding detection result, and the gene fusion detection specifically comprises:
extracting split and non-identical sequences and insert length distributions from the identical sequences;
respectively carrying out structural variation detection on the split sequences and the inconsistent sequences and the insert fragment length distribution to obtain a structural variation detection original result;
Extracting structural variation breakpoint information and corresponding supporting breakpoint sequence data from the structural variation detection original result;
counting the coverage depth result of the structural variation breakpoint information, and acquiring and annotating the gene and exon or intron information thereof from the structural variation breakpoint information to obtain a breakpoint annotation result;
performing basic filtering on the breakpoint supporting sequence data, the coverage depth result and the breakpoint annotation result to obtain candidate gene fusion sequences;
respectively carrying out gene fusion on the candidate gene fusion sequences according to preset gene fusion types, and then carrying out hierarchical filtration to obtain a gene fusion filtration result corresponding to the gene fusion types;
and merging the gene fusion filtering results of each class to obtain a gene fusion result.
8. The method for detecting somatic mutation in targeted gene second-generation sequencing data according to any one of claims 1 to 7, wherein the method for detecting somatic mutation in targeted gene second-generation sequencing data further comprises:
acquiring analysis step nodes, judging whether the analysis step nodes have corresponding detection checkpoints, if yes, reading checkpoint information, and if no, initializing the detection checkpoints;
Judging whether the analysis step is finished from the check point information, if yes, ending the analysis step, entering a next analysis step node, and if not, initializing the detection check point;
and after initializing the detection check point, monitoring the analysis step node, and writing the analysis step completion message or the analysis step interrupt message into the detection check point when the analysis step completion message or the analysis step interrupt message is acquired.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, performs the steps of the method for detecting somatic mutation in targeted gene second-generation sequencing data according to any one of claims 1 to 8.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor performs the steps of the targeted gene second-generation sequencing data somatic mutation detection method of any one of claims 1 to 8.
CN202310520121.8A 2023-05-10 2023-05-10 Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data Active CN116312780B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310520121.8A CN116312780B (en) 2023-05-10 2023-05-10 Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310520121.8A CN116312780B (en) 2023-05-10 2023-05-10 Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data

Publications (2)

Publication Number Publication Date
CN116312780A true CN116312780A (en) 2023-06-23
CN116312780B CN116312780B (en) 2023-07-25

Family

ID=86803452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310520121.8A Active CN116312780B (en) 2023-05-10 2023-05-10 Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data

Country Status (1)

Country Link
CN (1) CN116312780B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116646006A (en) * 2023-07-27 2023-08-25 华测检测认证集团股份有限公司 Tumor related gene system mutation detection method and device based on high-throughput sequencing and Gaussian mixture model
CN117253546A (en) * 2023-10-11 2023-12-19 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise
CN117409856A (en) * 2023-10-25 2024-01-16 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN117711487A (en) * 2024-02-05 2024-03-15 广州嘉检医学检测有限公司 Identification method and system for embryo SNV and InDel variation and readable storage medium
CN117253546B (en) * 2023-10-11 2024-05-28 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108588194A (en) * 2018-05-28 2018-09-28 北京诺禾致源科技股份有限公司 Utilize the method and device of high-flux sequence Data Detection Tumor mutations load
CN109817279A (en) * 2019-01-18 2019-05-28 臻悦生物科技江苏有限公司 Detection method, device, storage medium and the processor of Tumor mutations load
CN110060733A (en) * 2019-04-28 2019-07-26 上海宝藤生物医药科技股份有限公司 Tumour somatic variation detection device is sequenced in two generations based on single sample
CN110570904A (en) * 2019-08-27 2019-12-13 深圳百诺精准医疗科技有限公司 tumor mutation analysis method, system, terminal and readable storage medium
CN112029861A (en) * 2020-09-07 2020-12-04 臻悦生物科技江苏有限公司 Tumor mutation load detection device and method based on capture sequencing technology
CN113362889A (en) * 2021-06-25 2021-09-07 广州燃石医学检验所有限公司 Genome structure variation annotation method
US20220259646A1 (en) * 2019-03-04 2022-08-18 King Abdullah University Of Science And Technology Compositions and methods of labeling nucleic acids and sequencing and analysis thereof
CN115961034A (en) * 2022-10-24 2023-04-14 南京艾迪康医学检验所有限公司 UMI technology-based method for detecting and analyzing gene mutation of peripheral blood of lung cancer patient

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108588194A (en) * 2018-05-28 2018-09-28 北京诺禾致源科技股份有限公司 Utilize the method and device of high-flux sequence Data Detection Tumor mutations load
CN109817279A (en) * 2019-01-18 2019-05-28 臻悦生物科技江苏有限公司 Detection method, device, storage medium and the processor of Tumor mutations load
US20220259646A1 (en) * 2019-03-04 2022-08-18 King Abdullah University Of Science And Technology Compositions and methods of labeling nucleic acids and sequencing and analysis thereof
CN110060733A (en) * 2019-04-28 2019-07-26 上海宝藤生物医药科技股份有限公司 Tumour somatic variation detection device is sequenced in two generations based on single sample
CN110570904A (en) * 2019-08-27 2019-12-13 深圳百诺精准医疗科技有限公司 tumor mutation analysis method, system, terminal and readable storage medium
CN112029861A (en) * 2020-09-07 2020-12-04 臻悦生物科技江苏有限公司 Tumor mutation load detection device and method based on capture sequencing technology
CN113362889A (en) * 2021-06-25 2021-09-07 广州燃石医学检验所有限公司 Genome structure variation annotation method
CN115961034A (en) * 2022-10-24 2023-04-14 南京艾迪康医学检验所有限公司 UMI technology-based method for detecting and analyzing gene mutation of peripheral blood of lung cancer patient

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116646006A (en) * 2023-07-27 2023-08-25 华测检测认证集团股份有限公司 Tumor related gene system mutation detection method and device based on high-throughput sequencing and Gaussian mixture model
CN116646006B (en) * 2023-07-27 2023-11-14 华测检测认证集团股份有限公司 Tumor related gene system mutation detection method and device based on high-throughput sequencing and Gaussian mixture model
CN117253546A (en) * 2023-10-11 2023-12-19 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise
CN117253546B (en) * 2023-10-11 2024-05-28 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise
CN117409856A (en) * 2023-10-25 2024-01-16 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN117409856B (en) * 2023-10-25 2024-03-29 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN117711487A (en) * 2024-02-05 2024-03-15 广州嘉检医学检测有限公司 Identification method and system for embryo SNV and InDel variation and readable storage medium
CN117711487B (en) * 2024-02-05 2024-05-17 广州嘉检医学检测有限公司 Identification method and system for embryo SNV and InDel variation and readable storage medium

Also Published As

Publication number Publication date
CN116312780B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN116312780B (en) Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data
CN107849612B (en) Alignment and variant sequencing analysis pipeline
RU2654575C2 (en) Method for detecting chromosomal structural abnormalities and device therefor
CN111326212B (en) Structural variation detection method
CN105389481A (en) Method for detecting variable spliceosome in third generation full-length transcriptome
JP7319197B2 (en) Methods for Aligning Target Nucleic Acid Sequencing Data
CN114743594B (en) Method, device and storage medium for detecting structural variation
CN113035273B (en) Rapid and ultrahigh-sensitivity DNA fusion gene detection method
CN110021355B (en) Haploid typing and variation detection method and device for diploid genome sequencing segment
CN116386718B (en) Method, apparatus and medium for detecting copy number variation
CN111180013B (en) Device for detecting blood disease fusion gene
EP4016533A1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
CN112687341B (en) Method for identifying chromosome structure variation by taking breakpoint as center
Chuan et al. Atria: an ultra-fast and accurate trimmer for adapter and quality trimming
WO2019213810A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
JPWO2019132010A1 (en) Methods, devices and programs for estimating base species in a base sequence
CN113327646B (en) Sequencing sequence processing method and device, storage medium and electronic equipment
CN111028885B (en) Method and device for detecting yak RNA editing site
CN115662520B (en) Detection method of BCR/ABL1 fusion gene and related equipment
CN117577178B (en) Detection method and system for structural variation accurate fracture information and application of detection method and system
CN117746989B (en) Method and device for processing variation description information and electronic equipment
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
US20220399079A1 (en) Method and system for combined dna-rna sequencing analysis to enhance variant-calling performance and characterize variant expression status
Kumar et al. Correcting Methylation Calls in Clinically Relevant Low-Mappability Regions
CN116665775A (en) Method, device and storage medium for detecting mitochondrial origin nuclear genome sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant