CN117577178A - Detection method and system for structural variation accurate fracture information and application of detection method and system - Google Patents

Detection method and system for structural variation accurate fracture information and application of detection method and system Download PDF

Info

Publication number
CN117577178A
CN117577178A CN202410056493.4A CN202410056493A CN117577178A CN 117577178 A CN117577178 A CN 117577178A CN 202410056493 A CN202410056493 A CN 202410056493A CN 117577178 A CN117577178 A CN 117577178A
Authority
CN
China
Prior art keywords
sequence
structural variation
genome
region
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410056493.4A
Other languages
Chinese (zh)
Other versions
CN117577178B (en
Inventor
高媛
高明
王丽娟
高选
陈子江
马金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202410056493.4A priority Critical patent/CN117577178B/en
Publication of CN117577178A publication Critical patent/CN117577178A/en
Application granted granted Critical
Publication of CN117577178B publication Critical patent/CN117577178B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the technical field of biological detection and analysis, and particularly relates to a detection method and system for structural variation accurate fracture information and application thereof. Specifically, the invention can solve the problem that when the three-generation sequencing data analyze structural variation, the structural variation of the fragment repeated regions of the genome is identified through specific sites, the sources of the sequence fragments are distinguished, and the breaking points of the fragment repeated regions are accurately determined. The invention can improve the accuracy and the sensitivity of the structural variation analysis result, does not need to correct and correct the sequence data containing the structural variation fragments, and can reduce a large amount of occupied computing resources. The invention can prolong the assembled segment of genome on the basis of solving the sequence segment source of the segment repetition region. The prolonged assembly segment can increase the haplotype SNP locus information, improve the size of the parting block and provide for the subsequent embryo haplotype parting.

Description

Detection method and system for structural variation accurate fracture information and application of detection method and system
Technical Field
The invention belongs to the technical field of biological detection and analysis, and particularly relates to a detection method and system for structural variation accurate fracture information and application thereof.
Background
The current clinical study shows that about 3% of couples in recurrent abortion (Recurrent Spontaneous Abortion, RSA) have chromosomal structural variation, and more than 50% of embryonic tissues in embryo arrest and spontaneous abortion have chromosomal structural variation. Structural variation (Structural Variant, SV) generally refers to variation of DNA fragments of 50bp or more in length, and can be classified into Deletion (Deletion), repetition (Duplication), insertion (Insertion), inversion (Inversion), translocation (Translocation), and the like. When chromosomal translocation occurs, if no chromosomal fragment is lost, it is called balanced translocation. Balanced translocation occurs at about 0.2% of the general population. Studies have shown that about 3-6% of RSA has a balanced translocation of chromosomes on one of its parents. Robertsonian translocation, which occurs between two proximal centromere chromosomes, fuses the two long arms to form a larger chromosome after translocation. In the general population, the rate of Robertsonian translocation is about 1/1000. Natural abortion easily occurs when carrying a translocation couple to inoculate the next generation, the probability is as high as 50% -80%, and the delivery rate is generally 20% -50%. The main means for solving the genetic defect in human reproductive health is currently assisted reproductive technology, namely genetic testing before embryo implantation (PGT) technology. The genetic detection technology before embryo implantation refers to that in the process of a test tube infant, the embryo cultured in vitro is subjected to genetic detection, whether the embryo has chromosome abnormality or familial genetic gene variation is judged, and the embryo with normal chromosome or without familial genetic variation is selected to be implanted into the uterus according to the detection result, so that the success rate of the test tube infant is improved, the vertical transmission of familial genetic diseases is blocked, and the reproductive genetic problem is fundamentally solved.
The detection of the crowd carrying the structural variation has important significance, on one hand, the structural variation breaking point can be obtained by detecting the structural variation of a couple of a carrier, and the genetic diagnosis can be carried out on the embryo cultured in vitro according to the breaking point information by a pre-embryo implantation diagnosis (PGD) method, so as to judge whether the embryo carries the structural variation. On the other hand, SNP information on the upstream and downstream of the breaking point of the structural variation carrier and SNP information of the related structural variation carried by the carrier relatives in the region can be detected, haplotypes of the structural variation family are constructed, chromosome aneuploidy screening and haplotype typing of the embryo cultured in vitro are realized through an integrated PGT detection technology, and whether the embryo carries the structural variation is judged. Both of the above methods require accurate structural variation breakpoint information.
Current techniques for detecting structural variations are: chromosome karyotyping analysis, chromosome chip, second generation short-reading long sequencing, third generation long-reading long sequencing, optical mapping and the like. Chromosome karyotyping can detect gene Copy Number Variation (CNV) and chromosome translocation of more than 5Mb, and chromosome chip can detect CNV of more than 4Mb (different technical parameters). The second generation short-reading long-sequencing and the third generation long-reading long-sequencing can detect all structural variation types such as CNV/chromosome translocation/homozygote Region (ROH) and the like, and can detect the accurate breaking point of the structural variation. Optical mapping can detect all types of structural variations, but is limited by detection techniques, and the exact determination of the breakpoint of the structural variation (accuracy >10kb or more) is not possible.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method and a system for detecting structural variation accurate fracture information based on third generation gene sequencing data and application thereof. Specifically, the invention can effectively solve the problem that the structural variation of the segment repetition region is difficult to identify the breaking point, and can simultaneously extend the segment length and the haplotype parting block of genome assembly. Based on the above results, the present invention has been completed.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
a method for detecting structural variation accurate fracture information, the method comprising:
s1, comparing data to be sequenced with a reference genome and establishing an index;
s2, analyzing according to the comparison result, and identifying structural variation to obtain a structural variation result;
s3, extracting characteristic difference sites from the results, identifying difference sequence fragments among the fragment repetition areas, and judging breaking points in the fragment repetition areas;
s4, assembling the genome sequence, extending according to analysis results of the segment repeated sequence region, analyzing complex or longer structure variation results, and outputting the structure variation results;
s5, typing a structure variation result according to the genome assembly result.
Wherein the sequencing data is third generation gene sequencing data.
In yet another embodiment of the present invention, the third generation gene sequencing data comprises high base quality single molecule real-time sequencing data (Pacbio HiFi) or high base quality single molecule nanopore sequencing data.
In yet another embodiment of the present invention, the specific method for determining the breaking point in the segment repetition region includes:
s3-1, traversing the segment repeated area of the whole genome, screening the area with the size larger than the average sequencing read length and the sequencing coverage depth larger than or equal to 5x of the repeated segment area, and recording the area as a set A (A1 … AN);
s3-2, selecting a segment repetition region A1 according to the screening result of the step S3-1, setting the segment repetition region A1 as R1, extracting all sequences in the set A, comparing the sequences with the similarity of R1, selecting a plurality of segment repetition region sequences (R1 … RN) with the similarity of more than 90% as one set S1, comparing all sequences in the set S1, extracting differential bases as characteristic differential sites P1, P2 and P3 … PN, and recording the relation between all characteristic differential sites P and corresponding segment repetition region sequences (R1 … RN), wherein the relation is divided into PiRj (i is the number of P, j is the number of R);
s3-3, completing the collection and combination of the fragment repetitive regions of the whole genome according to the step S3-2;
s3-4, extracting and comparing sequence fragments to the characteristic difference sites, and distributing sequence numbers of fragment repeated areas of the sequence fragments according to the corresponding sequence relation of the characteristic difference sites in the step S3-3 to obtain the sequence number Dij (i is the number of P and j is the number of R) of the difference sequence fragments;
s3-5, judging whether a breaking point exists or not according to the sequence number condition of the partial region of the single sequence segment distributed to different segment repeated regions and the ratio of the characteristic site to the sequencing depth;
s3-6, based on the steps of S3-2-S3-5, completing analysis of breaking points of all segment repetition areas for the set A in S3-1, and outputting breaking point information and structural variation results.
In still another embodiment of the present invention, the method for determining whether a breakpoint exists by assigning partial regions of the single sequence segment to different sequence numbers of repeated regions of the segment and a ratio of a feature site to the number of sequences further includes:
s3-5-1, filtering out sequence fragments with lower comparison Quality (MAPQ value is less than 5, wherein MAPQ is Mapping Quality and represents the comparison Quality) when partial regions of single sequence fragments are allocated to different fragment repetition region sequence numbers;
s3-5-2, setting the sequence number of D11 and the sequence number of D12 as RR=D11/D12 for the characteristic difference site P1, wherein when no error occurs in the comparison, the RR value is approximate to 1 (the floating range is 0.8.ltoreq.R1.ltoreq.1.2); when the RR value is more than or equal to 0.8 and less than or equal to 1.2, judging that the structural variation does not occur; when the P1 locus of the R1 region is mutated, the error comparison of the P1 position of the R2 region is increased, so that when the RR value is more than or equal to 0.5 and less than or equal to 0.8, the mutation is judged to occur at the characteristic difference locus of the P1; if a plurality of P points are mutated (the situation of gene fusion occurs), most of the sequences of the fusion genes are compared to a similar region R2 due to the problem of sequence similarity, and more error comparison situations (D12 value is increased); based on the condition that two gene copies exist in two chromosomes, the copy number ratio is=2:2 (corresponding to RR value of 1:1) under normal conditions, and when gene fusion occurs, the copy number ratio result is approximately 1:3 (corresponding to RR value of 1/3), but meanwhile, not all fusion genes are aligned to an R2 region, so that the range of RR values in the condition is 1/3 less than or equal to RR <1/2, and when 1/3 less than or equal to RR <1/2, the R11 sequence and the R2 sequence are judged to have gene conversion or gene fusion, and the difference feature site Px1 occurring at the moment is recorded;
s3-5-3, after the ratio judgment of the D11 and the D12 is completed, when more than two sequences exist in the set S1, respectively comparing every two sequences until the ratio judgment of all the sequences in the set S1 is completed;
s3-5-4, after the ratio of all sequences in the set S1 is judged, recording the chromosome region contained in the P points as a fusion gene generation region according to the P point information recorded when the RR value is equal to 1/3 and when more than two P point information (Px 1, px2 … PxN) appear, and recording the chromosome positions corresponding to the Px1 and the PxN as the breaking points of the fusion genes.
In still another embodiment of the present invention, the specific method of step S4 includes:
s4-1, carrying out genome de novo assembly on third generation gene sequencing data, namely merging a plurality of long segment DNA sequences with similar sequence information to generate longer continuous sequences, namely overlapping groups, and numbering each overlapping group;
s4-2, comparing the assembly result of the step S4-1 with a reference genome to obtain coverage area information of the contig in the reference genome;
s4-3, extending the overlapping region of the chromosome region in the step S4-2 according to the identified difference sequence fragments among the fragment repeated regions;
s4-4, analyzing complex or longer structure variation results according to the result of the step S4-3, and outputting the structure variation results.
In still another embodiment of the present invention, in the step S5, the typing of the structural variation result according to the genome assembly result includes: and (3) assembling the genome to obtain an contig sequence, and judging that the genome is structurally mutated if the contig sequence carries the breakpoint information of the fragment repetition region, and judging that the genome is normal if the genome is not structurally mutated.
In another embodiment of the present invention, the method for detecting structural variation accurate fracture information further comprises extracting a DNA sample of the target sample;
in still another embodiment of the present invention, the method for extracting a DNA sample of a target sample includes: extracting sample genome DNA, and performing quality inspection on the integrity of the DNA fragments, wherein the main peak of the size distribution of the DNA fragments is required to be more than 30 kb, namely the long fragment DNA molecules meeting the requirements;
the invention also provides a detection system for the structural variation accurate fracture information, which comprises:
an input module: configured to compare the data to be sequenced with a reference genome and to establish an index;
and an analysis module: the structure variation analysis module is configured to analyze and identify the structure variation according to the comparison result of the input module to obtain a structure variation result; extracting characteristic difference sites from the results, identifying difference sequence fragments among the fragment repetition areas, and judging breaking points in the fragment repetition areas; sequence assembly is carried out on the genome, extension is carried out according to the analysis result of the fragment repeated sequence region, and the complex or longer structure variation result is analyzed;
and an output module: which is configured to genotype the structural variation results obtained by the analysis module based on the genome assembly results.
Wherein the sequencing data is third generation gene sequencing data.
Specifically, the detection system operates according to a detection method based on the structural variation accurate fracture information.
The present invention also provides a computer readable storage medium having stored thereon a program which, when executed by a processor, performs the steps of the method according to the present invention.
The invention also provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, which processor, when executing the program, performs the steps of the method according to the invention.
The invention also provides application of the detection method, the detection system, the computer readable storage medium or the electronic equipment in embryo haplotype typing.
The beneficial technical effects of one or more of the technical schemes are as follows:
according to the technical scheme, the problem that when the structural variation of the three-generation sequencing data is analyzed, the structural variation of the fragment repeated regions of the genome is identified through the specific sites, the sources of the sequence fragments are distinguished, the breaking points of the fragment repeated regions are accurately judged, and the problem that the structural variation is difficult to identify the breaking points is solved. The technical scheme can improve the accuracy and the sensitivity of the structural variation analysis result, does not need to correct the sequence data containing the structural variation fragments (polish), and can reduce a large amount of occupied computing resources. The invention can prolong the assembled segment of genome and improve the accuracy of complex or longer structure variation result on the basis of solving the sequence segment source of segment repetition region. The prolonged assembly segment can increase the haplotype SNP locus information, improve the size of the parting block and provide for the subsequent embryo haplotype parting.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a schematic flow chart of a method for detecting structural variation accurate fracture information based on third generation gene sequencing data.
FIG. 2 is a schematic diagram of characteristic difference sites of segment repetitive regions of the method for detecting structural variation accurate fragmentation information applied to third generation gene sequencing data of example 1 of the present invention.
FIG. 3 is a schematic diagram showing the determination of the break point of the method for detecting structural variation accurate break information for the third generation gene sequencing data of the present invention in example 1.
FIG. 4 is a schematic diagram showing the genome assembly extension of the method for detecting structural variation accurate fragmentation information for third generation gene sequencing data of the present invention in example 1.
FIG. 5 is a diagram showing the typing results of the method for detecting structural variation accurate fragmentation information applied to the third generation gene sequencing data of the present invention in example 1 of the present invention.
FIG. 6 is a graph showing the results of detecting fracture information by the analysis method applied to conventional structural variation in comparative example 1 of the present invention.
FIG. 7 is a diagram showing the construction variation accurate fragmentation information detection system based on third generation gene sequencing data in example 2 of the present invention.
Detailed Description
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof. It is to be understood that the scope of the invention is not limited to the specific embodiments described below; it is also to be understood that the terminology used in the examples of the invention is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the invention.
The application provides a detection method and a detection system for structural variation accurate fracture information based on third generation gene sequencing data and application thereof, so that the problem that the structural variation of an effective fragment repeated region is difficult in identifying a fracture point can be solved, and meanwhile, the fragment length and the haplotype parting block of genome assembly can be extended. Fig. 1 is a flow chart of steps of a method for detecting structural variation accurate fracture information according to the present invention.
The invention is further illustrated by the following examples, which are not to be construed as limiting the invention. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention.
Example 1
The embodiment provides a detection method for the structural variation accurate fracture information of third generation gene sequencing data and combines SNP linkage analysis of embryos for detection.
The family of structural variation carriers (detection results of the existing structural variation) receiving assisted reproduction, the family variation information of patients is shown in table 1, peripheral blood 5 mL of the circular chromosome carrier and its spouse is extracted and stored in an EDTA anticoagulation blood collection tube, products of embryo biopsy samples of two couples after full genome amplification are obtained simultaneously, and the family peripheral blood sample is used for extracting genome DNA according to a high molecular weight DNA extraction method in the field.
TABLE 1 patient family variation information
(1) Long fragment library-building sequencing of structural variation carrier samples
The male genomic DNA sample of the structural variation carrier is subjected to library-building sequencing by referring to a long fragment sequencing platform library-building sequencing instruction book.
(2) Female sample whole genome pool sequencing
Female genome DNA samples are subjected to library-building sequencing by referring to a sequencing platform library-building sequencing instruction book.
(3) Embryo sample low depth whole genome detection
Embryos are sequenced by referring to a library-built sequencing instruction of a low-depth whole genome sequencing platform, more than 80 thousands of SNPs are contained on average, human 23 pairs of chromosomes can be covered comprehensively, low-depth whole genome detection is carried out on WGA amplified products of 4 sub-embryo trophectoderm biopsy cells of patients, and specific experimental methods are carried out by referring to the instruction.
(4) Structural variation carrier accurate breakpoint detection
(a) Comparing the data to be sequenced with a reference genome and establishing an index;
(b) Analyzing according to the comparison result, and identifying structural variation to obtain a structural variation result;
(c) Extracting feature difference sites from the result, wherein the result is shown in figure 2; identifying the difference sequence fragments among the fragment repeating areas, judging the breaking points in the fragment repeating areas, and obtaining accurate breaking points chr17:14097887, wherein the breaking points at the other end are chr17:15470903, and the result is shown in figure 3;
(d) Sequence assembling is carried out on the genome, extension is carried out according to the analysis result of the segment repeated sequence region, the complex or longer structure variation result is analyzed, and the structure variation result is output, wherein the result is shown in figure 4;
(e) And (3) carrying out haplotype typing on the structure variation result according to the genome assembly result, and confirming that the structure variation of the male carries haplotype.
(5) Embryo SNP linkage analysis.
Marking the haplotype color of the mother according to the genotyping result of the circular chromosome, and performing SNP linkage analysis on the embryo to obtain a genotyping result scatter diagram of each chromosome chain of the embryo, wherein two chromosome SNP chains of the male are defined, the left chain is a structural change carrying type (F1), the right chain is a normal haplotype (F0), the two chromosome SNP chains of the embryo are inherited from the mother, and the left chain is a chromosome inherited from the mother, as shown in FIG. 5. When the SNP typing color of an embryo is consistent with the carrying color of father structural variation at the 17p12 position (for example, embryo 1), the embryo is a structural variation carrier (F1), and when the SNP typing color of an embryo is consistent with the color of a mother normal type haplotype at the 17p12 position (for example, embryo 2), the embryo is normal type (M0).
Comparative example 1
When the structural variation analysis is carried out on the detection data of the third generation sequencing, a clustering algorithm is adopted, and based on the comparison result of the original data, the identification of the structural variation breaking point is realized through clustering the sequence fragments with the differences. In the case of the male sample of example 1, since the structural variation region occurred in the fragment repetition region, the reliability of the alignment result was greatly lowered, and a large number of mismatches occurred in the latter half of the aligned sequences, and as a result, as shown in FIG. 6, the software could evaluate the approximate position (del (17) (p 12) where the structural variation occurred only by covering the depth when calculating the breakpoint, seq [ GRCh37/hg19] (14050001-15500000). Times.1), and thus an accurate breakpoint could not be obtained.
Example 2
FIG. 7 is a diagram of a detection system for structural variation accurate fracture information based on third generation gene sequencing data according to an embodiment of the present invention, including an input module 10, an analysis module 20 and an output module 30;
wherein the input module 10 is configured to compare the data to be sequenced with a reference genome and to establish an index;
an analysis module 20 configured to analyze based on the comparison result of the input module, identify structural variations, and obtain a structural variation result; extracting characteristic difference sites from the results, identifying difference sequence fragments among the fragment repetition areas, and judging breaking points in the fragment repetition areas; sequence assembly is carried out on the genome, extension is carried out according to the analysis result of the fragment repeated sequence region, and the complex or longer structure variation result is analyzed;
an output module 30 configured to genotype the structural variation results obtained by the analysis module based on the genome assembly results.
It should be noted that, the method for detecting accurate fragmentation information of genetic structural variation by the system in example 2 has been described in the summary of the invention and in example 1.
Example 3
An electronic device includes a memory, a processor, and computer instructions stored on the memory and running on the processor, where the computer instructions, when executed by the processor, perform the operations of embodiments 1 and 2, and for brevity, are not described in detail herein.
The electronic device may be a mobile terminal and a non-mobile terminal, where the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone, an IOS Phone, etc.), a Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.
It should be appreciated that in this embodiment the processor may be a central processing unit CPU, but the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The steps of the method disclosed in connection with the present embodiment may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein. Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the embodiments disclosed herein, i.e., the algorithm steps, can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk
In a word, the invention can solve the problems that the three-generation sequencing data can preferentially and rapidly solve most of simple structural variations when analyzing the structural variations, and the structural variations occurring in the fragment repetition areas of the genome can identify the differential sequence fragments among the fragment repetition areas through specific sites, identify the sources of the sequence fragments, accurately judge the breaking points in the fragment repetition areas and solve the problem that the structural variations are difficult to identify the breaking points.
The invention can improve the accuracy and the sensitivity of the structural variation analysis result, does not need to correct the sequence data containing the structural variation fragments (polish), and can reduce a large amount of occupied computing resources.
The invention can prolong the assembled segment of genome and improve the accuracy of complex or longer structure variation result on the basis of solving the sequence segment source of segment repetition region. The prolonged assembly segment can increase the haplotype SNP locus information, improve the size of the parting block and provide for the subsequent embryo haplotype parting.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The method for detecting the structural variation accurate fracture information is characterized by comprising the following steps of:
s1, comparing data to be sequenced with a reference genome and establishing an index;
s2, analyzing according to the comparison result, and identifying structural variation to obtain a structural variation result;
s3, extracting characteristic difference sites from the results, identifying difference sequence fragments among the fragment repetition areas, and judging breaking points in the fragment repetition areas;
s4, assembling the genome sequence, extending according to analysis results of the segment repeated sequence region, analyzing complex or longer structure variation results, and outputting the structure variation results;
s5, typing a structure variation result according to a genome assembly result;
wherein the sequencing data is third generation gene sequencing data.
2. The method for detecting structural variation accurate fragmentation information according to claim 1, wherein the third generation gene sequencing data comprises high base quality single molecule real-time sequencing data or high base quality single molecule nanopore sequencing data;
the specific method for judging the breaking point of the segment repetition region comprises the following steps:
s3-1, traversing the segment repeated area of the whole genome, screening the area with the size larger than the average sequencing read length and the sequencing coverage depth larger than or equal to 5x, and recording as a set A, namely A1 … AN;
s3-2, selecting a segment repetition region A1 according to the screening result of the step S3-1, setting the segment repetition region A1 as R1, extracting all sequences in the set A, comparing the sequences with the similarity of R1, selecting a plurality of segment repetition region sequences R1 … RN with the similarity of more than 90% as one set S1, comparing all sequences in the set S1, extracting differential bases as characteristic differential sites P1, P2 and P3 … PN, and recording the relation between all characteristic differential sites P and corresponding segment repetition region sequences, namely R1 … RN, wherein i is the number of P, j is the number of R;
s3-3, completing the collection and combination of the fragment repetitive regions of the whole genome according to the step S3-2;
s3-4, extracting and comparing sequence fragments to the characteristic difference sites, and distributing sequence numbers of fragment repeated areas of the sequence fragments according to the corresponding sequence relation of the characteristic difference sites in the step S3-3 to obtain the sequence number of the difference sequence fragments, wherein i is the number of P, and j is the number of R;
s3-5, judging whether a breaking point exists or not according to the sequence number condition of the partial region of the single sequence segment distributed to different segment repeated regions and the ratio of the characteristic site to the sequencing depth;
s3-6, based on the steps of S3-2-S3-5, completing analysis of breaking points of all segment repetition areas for the set A in S3-1, and outputting breaking point information and structural variation results.
3. The method for detecting structural variation accurate fragmentation information according to claim 2, wherein the method for determining whether a fragmentation point exists or not is characterized in that the partial regions of the single sequence fragment are assigned to different sequence numbers of the fragment repetition region and the ratio of the feature site to the number of sequences, specifically comprises:
s3-5-1, when partial regions of single sequence fragments are allocated to different fragment repeat region sequence numbers, filtering out sequence fragments with lower comparison quality, wherein MAPQ values of the sequence fragments with lower comparison quality are smaller than 5;
s3-5-2, setting the sequence number of D11 and the sequence number of D12 as RR=D11/D12 for the characteristic difference site P1, and approaching the RR value to 1 when no error occurs in the comparison; when the RR value is more than or equal to 0.8 and less than or equal to 1.2, judging that the structural variation does not occur; when the P1 locus of the R1 region is mutated, the error comparison of the P1 position of the R2 region is increased, so that when the RR value is more than or equal to 0.5 and less than or equal to 0.8, the mutation is judged to occur at the characteristic difference locus of the P1; if a plurality of P points are mutated, most of the sequences of the fusion gene are compared to a similar region R2 due to the problem of sequence similarity, and more error comparison conditions exist; based on the condition that two gene copies exist in two chromosomes, the copy number ratio is=2:2 under normal conditions, and when gene fusion occurs, the copy number ratio result is approximately 1:3, but not all fusion genes are simultaneously aligned to an R2 region, so that the range of RR values of the condition is 1/3-RR <1/2, when 1/3-RR <1/2, the R11 sequence and the R2 sequence are judged to have gene conversion or gene fusion, and the difference characteristic site Px1 occurring at the moment is recorded;
s3-5-3, after the ratio judgment of the D11 and the D12 is completed, when more than two sequences exist in the set S1, respectively comparing every two sequences until the ratio judgment of all the sequences in the set S1 is completed;
s3-5-4, after the ratio of all sequences in the set S1 is judged, recording the chromosome region contained in the P points as a fusion gene generation region according to the P point information recorded when the RR value is equal to 1/3, namely, px1 and Px2 … PxN when more than two P point information appear, and recording the chromosome positions corresponding to the Px1 and the PxN as the breaking points of the fusion genes.
4. The method for detecting structural variation accurate fracture information according to claim 1, wherein the specific method of step S4 comprises:
s4-1, carrying out genome de novo assembly on third generation gene sequencing data, namely merging a plurality of long segment DNA sequences with similar sequence information to generate longer continuous sequences, namely overlapping groups, and numbering each overlapping group;
s4-2, comparing the assembly result of the step S4-1 with a reference genome to obtain coverage area information of the contig in the reference genome;
s4-3, extending the overlapping region of the chromosome region in the step S4-2 according to the identified difference sequence fragments among the fragment repeated regions;
s4-4, analyzing complex or longer structure variation results according to the result of the step S4-3, and outputting the structure variation results.
5. The method for detecting precise structural variation cleavage information according to claim 1, wherein in the step S5, the structural variation result is typed according to the genome assembly result, comprising: and (3) assembling the genome to obtain an contig sequence, and judging that the genome is structurally mutated if the contig sequence carries the breakpoint information of the fragment repetition region, and judging that the genome is normal if the genome is not structurally mutated.
6. The method for detecting structural variation accurate fragmentation information according to claim 1, wherein the method for detecting structural variation accurate fragmentation information of third generation gene sequencing data further comprises extracting a DNA sample of the target sample;
the method for extracting the DNA sample of the target sample comprises the following steps: extracting sample genome DNA, and performing quality inspection on the integrity of the DNA fragments, wherein the main peak of the size distribution of the DNA fragments is required to be more than 30 kb, namely the long fragment DNA molecules meeting the requirements.
7. A system for detecting structural variation accurate fracture information, the system comprising:
an input module: configured to compare the data to be sequenced with a reference genome and to establish an index;
and an analysis module: the structure variation analysis module is configured to analyze and identify the structure variation according to the comparison result of the input module to obtain a structure variation result; extracting characteristic difference sites from the results, identifying difference sequence fragments among the fragment repetition areas, and judging breaking points in the fragment repetition areas; sequence assembly is carried out on the genome, extension is carried out according to the analysis result of the fragment repeated sequence region, and the complex or longer structure variation result is analyzed;
and an output module: the analysis module is configured to obtain a genome assembly result, and the genome assembly result is used for analyzing the structure variation result;
wherein the sequencing data is third generation gene sequencing data;
the detection system operates according to the method for detecting structural variation accurate fracture information according to any one of claims 1 to 6.
8. A computer-readable storage medium having a program stored thereon, wherein the program, when executed by a processor, performs the steps of the method for detecting structural variation accurate fracture information according to any one of claims 1 to 6.
9. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor performs the steps performed by the method for detecting structural variation accurate fracture information according to any one of claims 1-6 when the program is executed.
10. The use of the method for detecting structural variation accurate fracture information of any one of claims 1 to 6, the system for detecting structural variation accurate fracture information of claim 7, the computer-readable storage medium of claim 8 or the electronic device of claim 9 in embryo haplotype typing.
CN202410056493.4A 2024-01-16 2024-01-16 Detection method and system for structural variation accurate fracture information and application of detection method and system Active CN117577178B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410056493.4A CN117577178B (en) 2024-01-16 2024-01-16 Detection method and system for structural variation accurate fracture information and application of detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410056493.4A CN117577178B (en) 2024-01-16 2024-01-16 Detection method and system for structural variation accurate fracture information and application of detection method and system

Publications (2)

Publication Number Publication Date
CN117577178A true CN117577178A (en) 2024-02-20
CN117577178B CN117577178B (en) 2024-03-26

Family

ID=89884846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410056493.4A Active CN117577178B (en) 2024-01-16 2024-01-16 Detection method and system for structural variation accurate fracture information and application of detection method and system

Country Status (1)

Country Link
CN (1) CN117577178B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106834490A (en) * 2017-03-02 2017-06-13 上海亿康医学检验所有限公司 A kind of method for identifying embryo's balanced translocation breakaway poing and balanced translocation carrier state
WO2018232580A1 (en) * 2017-06-20 2018-12-27 深圳华大基因研究院 Method and device for haplotype phasing of diploid genome based on third generation capture sequencing
CN114480610A (en) * 2021-10-29 2022-05-13 复旦大学附属妇产科医院 Method for detecting translocation fragment monomer or trisomy in latent equilibrium translocation carrier embryo
CN114999570A (en) * 2022-08-05 2022-09-02 苏州贝康医疗器械有限公司 Haplotype construction method independent of proband
CN115831222A (en) * 2022-12-20 2023-03-21 北京希望组生物科技有限公司 Third-generation sequencing-based whole genome structural variation identification method
WO2023138131A1 (en) * 2022-01-21 2023-07-27 复旦大学附属妇产科医院 Method for detecting fetal balanced chromosome structure variation by means of cell-free dna in peripheral blood of pregnant woman
CN117059173A (en) * 2023-08-18 2023-11-14 苏州贝康医疗器械有限公司 Method for identifying copy number variation accurate breakpoint and application thereof
CN117248030A (en) * 2023-04-07 2023-12-19 复旦大学附属妇产科医院 PKD1 variant molecule detection method based on single-cell whole genome amplification and application thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106834490A (en) * 2017-03-02 2017-06-13 上海亿康医学检验所有限公司 A kind of method for identifying embryo's balanced translocation breakaway poing and balanced translocation carrier state
WO2018232580A1 (en) * 2017-06-20 2018-12-27 深圳华大基因研究院 Method and device for haplotype phasing of diploid genome based on third generation capture sequencing
CN114480610A (en) * 2021-10-29 2022-05-13 复旦大学附属妇产科医院 Method for detecting translocation fragment monomer or trisomy in latent equilibrium translocation carrier embryo
WO2023138131A1 (en) * 2022-01-21 2023-07-27 复旦大学附属妇产科医院 Method for detecting fetal balanced chromosome structure variation by means of cell-free dna in peripheral blood of pregnant woman
CN114999570A (en) * 2022-08-05 2022-09-02 苏州贝康医疗器械有限公司 Haplotype construction method independent of proband
CN115831222A (en) * 2022-12-20 2023-03-21 北京希望组生物科技有限公司 Third-generation sequencing-based whole genome structural variation identification method
CN117248030A (en) * 2023-04-07 2023-12-19 复旦大学附属妇产科医院 PKD1 variant molecule detection method based on single-cell whole genome amplification and application thereof
CN117059173A (en) * 2023-08-18 2023-11-14 苏州贝康医疗器械有限公司 Method for identifying copy number variation accurate breakpoint and application thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
何永蜀;张闻;杨照青;: "人类基因组结构变异", 遗传, no. 08, 15 August 2009 (2009-08-15) *
江均;梁华;: "应用微阵列比较基因组杂交技术对胎儿额外小标记染色体及染色体大片段重复进行产前诊断", 华中科技大学学报(医学版), no. 01, 15 February 2016 (2016-02-15) *
高明: "染色体平衡易位影响基因组稳定性的临床研究", 《生物医学转化》, vol. 3, no. 4, 31 December 2022 (2022-12-31) *
高明: "非侵入性胚胎植入前遗传学检测的研究进展", 《中国科学 生命科学》, vol. 50, no. 6, 30 June 2020 (2020-06-30) *

Also Published As

Publication number Publication date
CN117577178B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
Breuss et al. Autism risk in offspring can be assessed through quantification of male sperm mosaicism
ES2902401T3 (en) Methods and processes for the non-invasive evaluation of genetic variations
CN107771221B (en) Mutation detection for cancer screening and fetal analysis
Cortés-Ciriano et al. Computational analysis of cancer genome sequencing data
CN105543339B (en) Method for simultaneously completing gene locus, chromosome and linkage analysis
ES2886508T3 (en) Methods and procedures for the non-invasive evaluation of genetic variations
ES2624686T3 (en) Methods and processes for the non-invasive evaluation of genetic variations
BR112015032031B1 (en) METHODS AND PROCESSES FOR NON-INVASIVE ASSESSMENT OF GENETIC VARIATIONS
CN105555970B (en) Method and system for simultaneous haplotyping and chromosomal aneuploidy detection
WO2021232388A1 (en) Method for determining base type of predetermined site in embryonic cell chromosome, and application thereof
JP2018500625A (en) Method, system and process for de novo assembly of sequencing leads
US20220254442A1 (en) Methods and systems for visualizing short reads in repetitive regions of the genome
JP2022522565A (en) An array graph tool for determining the variation of short tandem repeat regions
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN112687341B (en) Method for identifying chromosome structure variation by taking breakpoint as center
CN106906220A (en) A kind of COL4A5 genes of mutation and its application
CN114730610A (en) Kits and methods of using same
CN117577178B (en) Detection method and system for structural variation accurate fracture information and application of detection method and system
CN114531916A (en) System and method for determining a genetic relationship between a sperm provider, an oocyte provider and a corresponding concentiator
CN109979534B (en) C site extraction method and device
JP2022537442A (en) Systems, computer program products and methods using density of single nucleotide mutations to verify copy number variation in human embryos
Xie et al. Combination of trio-based whole exome sequencing and optical genome mapping reveals a cryptic balanced translocation that causes unbalanced chromosomal rearrangements in a family with multiple anomalies
CN116403641A (en) Method for eliminating base sequencing errors, method for identifying low-frequency mutation, and related device
CN118109570A (en) Full exon sequencing method for detecting premature ovarian failure gene
Valls Margarit Comprehensive identification and characterisation of germline structural variation within the Iberian population

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant