CN109074429B - Genome variation detection method, device and terminal - Google Patents

Genome variation detection method, device and terminal Download PDF

Info

Publication number
CN109074429B
CN109074429B CN201680084673.7A CN201680084673A CN109074429B CN 109074429 B CN109074429 B CN 109074429B CN 201680084673 A CN201680084673 A CN 201680084673A CN 109074429 B CN109074429 B CN 109074429B
Authority
CN
China
Prior art keywords
sequence
variation
sequencing
fragments
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201680084673.7A
Other languages
Chinese (zh)
Other versions
CN109074429A (en
Inventor
何俊
张旸
张洪波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN109074429A publication Critical patent/CN109074429A/en
Application granted granted Critical
Publication of CN109074429B publication Critical patent/CN109074429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A genomic variation detection method, a device and a terminal are provided, wherein the genomic variation detection method comprises the following steps: performing double-sequence comparison on a plurality of sequencing sequences of a genome and a reference sequence respectively to obtain double-sequence comparison results (201); determining potential variation regions of the genome from the two-sequence alignment (202); extracting sequencing sequence fragments from all sequencing sequences according to the potential variation regions (203); extracting reference sequence fragments (204) from the reference sequence based on the regions of potential variation; performing multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result (205); determining a variation detection result of the genome based on the multiple sequence alignment result (206). The reference sequence fragments and all sequencing sequence fragments are subjected to multi-sequence comparison, and the sequencing sequence fragments with the same variation type can be gathered together and aligned, so that the accuracy of the genome variation detection result is improved.

Description

Genome variation detection method, device and terminal
Technical Field
The application relates to the technical field of bioinformatics, in particular to a method, a device and a terminal for detecting genome variation.
Background
From the molecular level, genome variation refers to the change of base pair composition or arrangement sequence in the genome, and mainly includes SNP (Single Nucleotide Polymorphism) and indel (short Insertion/Deletion, Insertion or Deletion of small fragment). As the cost of genome sequencing continues to decrease, the genome sequencing data generated by the high-throughput sequencer shows explosive growth, but how to obtain high-quality genome variation detection results from the genome sequencing data still remains a challenging task.
In the conventional genome variation detection, a reference sequence (reference sequence) of a genome is usually used as a reference, a plurality of sequencing sequences of the genome are respectively subjected to double-sequence comparison with the reference sequence to obtain a double-sequence comparison result of each sequencing sequence with the reference sequence, including detailed matching (match), mismatching (mismatch), insertion (insertion), deletion (deletion) and the like of the sequencing sequence relative to the reference sequence, and then the variation detection result of the genome is determined according to the double-sequence comparison result of all the sequencing sequences with the reference sequence. Wherein, the reference sequence is a base sequence when the genome is not changed, and the sequencing sequence is a base sequence of the detected genome.
However, in the process of implementing the present application, the applicant finds that at least the following problems exist in the prior art: because the traditional genome variation detection only carries out double-sequence comparison on each sequencing sequence and a reference sequence and determines the variation detection result of the genome according to the double-sequence comparison result, it is easy to incorrectly compare one type of variation in the sequencing sequences into different types of variation due to inaccurate alignment of the sequencing sequences, thereby causing the inaccurate detection result of the genome variation.
Disclosure of Invention
The application provides a method, a device and a terminal for detecting genome variation, which aim to solve the problem of inaccurate detection result of genome variation in the prior art.
In a first aspect, the present embodiments provide a method for detecting genomic variation, the method including: performing double-sequence comparison on a plurality of sequencing sequences of a genome with a reference sequence to obtain a double-sequence comparison result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome; determining a potential variation region of the genome according to the double sequence alignment result, wherein the potential variation region is a base coding region of the genome in which the potential variation occurs; extracting sequencing sequence fragments from all sequencing sequences according to the potential variation regions; extracting reference sequence fragments from the reference sequence according to the potential variation regions; performing multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result; and determining the variation detection result of the genome according to the multi-sequence alignment result. By adopting the implementation mode, sequencing sequence fragments with the same variation type can be gathered together and aligned, the sequencing sequence alignment is accurate, and the variation belonging to one type is prevented from being wrongly compared into variations of different types, so that the accuracy of the genome variation detection result is improved.
With reference to the first aspect, in a first possible implementation manner of the first aspect, after performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results, the method further includes: determining the variation types of all sequencing sequence fragments according to the multi-sequence comparison result; according to the variation types of all sequencing sequence fragments, converging all the sequencing sequence fragments into at least one sequencing sequence cluster, wherein the variation types of the sequencing sequence fragments in the same sequencing sequence cluster are the same; respectively performing union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster; performing double-sequence comparison on each characteristic sequence and the reference sequence fragment to obtain the variation type of each characteristic sequence; and correcting the multi-sequence comparison result according to the variation type of each characteristic sequence, wherein the variation type of each sequencing sequence fragment in the corrected multi-sequence comparison result is the same as the variation type of the characteristic sequence corresponding to each sequencing sequence fragment. By adopting the implementation mode, firstly, the characteristic sequence is corrected through double sequence comparison of the characteristic sequence and the reference sequence fragment; and then, the sequencing sequence fragment corresponding to the characteristic sequence is corrected according to the corrected characteristic sequence, so that the problem that part of the sequencing sequence fragment deviates relative to the reference sequence fragment in the multi-sequence comparison result is solved, and the accuracy of the genome variation detection result is improved.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, after performing union processing on all sequencing sequence fragments in each sequencing sequence cluster respectively to obtain a feature sequence of each sequencing sequence cluster, the method further includes: performing double-sequence comparison on any two characteristic sequences in the characteristic sequences of each sequencing sequence cluster; judging whether an overlapping region of two characteristic sequences is completely matched, wherein the variation position of at least one characteristic sequence is completely positioned in the overlapping region; when the overlapping regions of the two characteristic sequences are completely matched and the variation position of at least one characteristic sequence is completely positioned in the overlapping region, merging the sequencing sequence clusters corresponding to the two characteristic sequences to obtain a merged sequencing sequence cluster, and merging the two characteristic sequences to obtain the characteristic sequence of the merged sequencing sequence cluster. By adopting the implementation mode, the sequencing sequence clusters which accord with the merging condition are further merged, so that the length of the characteristic sequence is increased, and the accuracy of the double-sequence comparison result of the characteristic sequence and the reference sequence fragment is improved.
With reference to the first aspect, in a third possible implementation manner of the first aspect, determining potential variation regions of the genome according to the double sequence alignment result includes: dividing the genome into a plurality of coding intervals according to the base coding sequence of the genome; determining the variation types of all sequencing sequences according to the double-sequence comparison result; sequentially counting the probability distribution values of the sequencing sequences of different variation types in each coding interval; calculating the information entropy of each coding interval according to the probability distribution value; sequentially judging whether the information entropy of each coding interval is larger than a first threshold value; and when the information entropy of one coding interval is larger than the first threshold, judging the coding interval as a potential variation region. With this implementation, potential variant regions of the genome are determined by entropy of information.
With reference to the first aspect, in a fourth possible implementation manner of the first aspect, determining potential variation regions of the genome according to the double sequence alignment result includes: dividing the genome into a plurality of coding intervals according to the base coding sequence of the genome; sequentially counting the number of the sequence sequences with variation in each coding interval; judging whether the number of the sequence sequences with variation in each coding interval is larger than a second threshold value; and when the number of the sequenced sequences with variation in the coding interval is larger than the second threshold value, judging the coding interval as a potential variation region. With this implementation, the potential variant regions of the genome are determined by the number of sequenced sequences that are mutated within the coding region.
With reference to the first aspect, in a fifth possible implementation manner of the first aspect, extracting a sequencing sequence fragment from all sequencing sequences according to the potential variation region includes: and extracting the intersection part of each sequencing sequence and the potential variation region as the sequencing sequence fragment.
With reference to the first aspect, in a sixth possible implementation manner of the first aspect, extracting, from all sequencing sequences, sequencing sequence fragments according to the potential variation region includes: when each of the sequencing sequences intersects the potential variation region, extracting the sequencing sequence as the sequencing sequence fragment.
With reference to the first aspect, in a seventh possible implementation manner of the first aspect, extracting a reference sequence fragment from the reference sequence according to the potential variation region includes: extracting the intersection of the reference sequence and the potential variation region as the reference sequence fragment.
With reference to the first aspect, in an eighth possible implementation manner of the first aspect, determining a variation detection result of the genome according to the multiple sequence alignment result includes: determining variation positions in the potential variation region according to the multi-sequence alignment result; extracting variation information of all sequencing sequence fragments at the variation positions from the multi-sequence comparison result; converging all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation positions in the same sequencing sequence set is the same; sequentially judging whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold value; and when the number of the sequencing sequence fragments in one sequencing sequence set is greater than the third threshold value, determining that the variation information of the sequencing sequence fragments in the sequencing sequence set is the variation detection result of the genome.
In a second aspect, embodiments of the present application further provide an apparatus for detecting genomic variations, the apparatus including: the first double-sequence comparison unit is used for performing double-sequence comparison on a plurality of sequencing sequences of a genome and a reference sequence respectively to obtain a double-sequence comparison result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome; a potential variation region determination unit, configured to determine a potential variation region of the genome according to the double-sequence alignment result, where the potential variation region is a base coding region in which a potential variation occurs in the genome; a sequencing sequence fragment extraction unit for extracting sequencing sequence fragments from all sequencing sequences according to the potential variation region; a reference sequence fragment extracting unit, configured to extract a reference sequence fragment from the reference sequence according to the potential variation region; the multi-sequence comparison unit is used for carrying out multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result; and a variation detection result determining unit, configured to determine a variation detection result of the genome according to the multiple sequence comparison result.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the apparatus further includes: a variation type determining unit, configured to determine variation types of all sequencing sequence fragments according to the multiple sequence comparison result; the sequencing sequence cluster converging unit is used for converging all the sequencing sequence fragments into at least one sequencing sequence cluster according to the variation types of all the sequencing sequence fragments, wherein the variation types of the sequencing sequence fragments in the same sequencing sequence cluster are the same; the union processing unit is used for respectively carrying out union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster; a second double-sequence comparison unit, configured to perform double-sequence comparison on each feature sequence and the reference sequence fragment to obtain a variation type of each feature sequence; and the correcting unit is used for correcting the multi-sequence comparison result according to the variation type of each characteristic sequence, wherein the variation type of each sequencing sequence fragment in the corrected multi-sequence comparison result is the same as the variation type of the characteristic sequence corresponding to each sequencing sequence fragment.
With reference to the second aspect, in a second possible implementation manner of the second aspect, the apparatus further includes: a third double-sequence comparison unit, configured to perform double-sequence comparison on any two obtained feature sequences in the feature sequences of each sequencing sequence cluster; an overlap region judging unit, configured to judge whether there is a complete match between the overlap regions of the two feature sequences, and a variation position of at least one feature sequence is completely within the overlap region; and a merging unit, configured to, when there is a complete match between overlapping regions of two feature sequences and a variation position of at least one feature sequence is completely within the overlapping region, merge sequencing sequence clusters corresponding to the two feature sequences to obtain a merged sequencing sequence cluster, and merge the two feature sequences to obtain a feature sequence of the merged sequencing sequence cluster.
With reference to the second aspect, in a third possible implementation manner of the second aspect, the potential variation region determining unit includes: a first coding region dividing unit for dividing the genome into a plurality of coding regions according to the base coding order of the genome; a variation type determining subunit, configured to determine variation types of all sequencing sequences according to the double-sequence comparison result; a probability distribution value statistic subunit, configured to sequentially count probability distribution values of sequencing sequences of different variation types in each coding interval; an information entropy calculating subunit, configured to calculate an information entropy of each of the coding sections according to the probability distribution value; a first threshold judgment subunit, configured to sequentially judge whether the information entropy of each coding interval is greater than a first threshold; a first potential variation region determining subunit, configured to determine that a coding section is a potential variation region when the information entropy of the coding section is greater than the first threshold.
With reference to the second aspect, in a fourth possible implementation manner of the second aspect, the potential variation region determining unit includes: a second coding region dividing unit for dividing the genome into a plurality of coding regions according to the base coding order of the genome; a variation quantity counting subunit, configured to count, in sequence, the number of the sequencing sequences that have undergone variation in each of the coding intervals; a second threshold judgment subunit, configured to judge whether the number of the mutated sequencing sequences in each coding interval is greater than a second threshold; and a second potential variation region determining subunit, configured to determine a coding region as a potential variation region when the number of the sequenced sequences with variation within the coding region is greater than the second threshold.
With reference to the second aspect, in a fifth possible implementation manner of the second aspect, the sequencing sequence fragment extracting unit is specifically configured to extract an intersection portion of each of the sequencing sequences and the potential variation region as the sequencing sequence fragment.
With reference to the second aspect, in a sixth possible implementation manner of the second aspect, the sequencing sequence fragment extracting unit is specifically configured to, when the intersection judging subunit judges that there is an intersection between the sequencing sequence and the potential variation region, extract the sequencing sequence as the sequencing sequence fragment.
With reference to the second aspect, in a seventh possible implementation manner of the second aspect, the reference sequence fragment extracting unit is specifically configured to extract an intersection part of the reference sequence and the potential variation region as the reference sequence fragment.
With reference to the second aspect, in an eighth possible implementation manner of the second aspect, the mutation detection result determining unit includes: a variation position determining subunit, configured to determine a variation position in the potential variation region according to the multiple sequence alignment result; a variation information extraction subunit, configured to extract variation information of all the sequencing sequence fragments at the variation position from the multiple sequence alignment result; the sequencing sequence set aggregation subunit is used for aggregating all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation position in the same sequencing sequence set is the same; a third threshold judgment subunit, configured to sequentially judge whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold; and a variation detection result determining subunit, configured to determine, when the number of the sequencing sequence fragments in one of the sequencing sequence sets is greater than the third threshold, that the variation information of the sequencing sequence fragments in the sequencing sequence set is the variation detection result of the genome.
In a third aspect, an embodiment of the present application further provides a genomic variation detection terminal, including: a processor; a memory for storing instructions for execution by the processor; wherein the processor is configured to perform the steps of: performing double-sequence comparison on a plurality of sequencing sequences of a genome with a reference sequence to obtain a double-sequence comparison result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome; determining a potential variation region of the genome according to the double sequence alignment result, wherein the potential variation region is a base coding region of the genome in which the potential variation occurs; extracting sequencing sequence fragments from all sequencing sequences according to the potential variation regions; extracting reference sequence fragments from the reference sequence according to the potential variation regions; performing multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result; and determining the variation detection result of the genome according to the multi-sequence alignment result.
With reference to the third aspect, in a first possible implementation manner of the third aspect, after performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results, the method further includes: determining the variation types of all sequencing sequence fragments according to the multi-sequence comparison result; according to the variation types of all sequencing sequence fragments, converging all the sequencing sequence fragments into at least one sequencing sequence cluster, wherein the variation types of the sequencing sequence fragments in the same sequencing sequence cluster are the same; respectively performing union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster; performing double-sequence comparison on each characteristic sequence and the reference sequence fragment to obtain the variation type of each characteristic sequence; and correcting the multi-sequence comparison result according to the variation type of each characteristic sequence, wherein the variation type of each sequencing sequence fragment in the corrected multi-sequence comparison result is the same as the variation type of the characteristic sequence corresponding to each sequencing sequence fragment.
With reference to the third aspect, in a second possible implementation manner of the third aspect, after performing union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster, the method further includes: performing double-sequence comparison on any two characteristic sequences in the characteristic sequences of each sequencing sequence cluster; judging whether an overlapping region of two characteristic sequences is completely matched, wherein the variation position of at least one characteristic sequence is completely positioned in the overlapping region; when the overlapping regions of the two characteristic sequences are completely matched and the variation position of at least one characteristic sequence is completely positioned in the overlapping region, merging the sequencing sequence clusters corresponding to the two characteristic sequences to obtain a merged sequencing sequence cluster, and merging the two characteristic sequences to obtain the characteristic sequence of the merged sequencing sequence cluster.
With reference to the third aspect, in a third possible implementation manner of the third aspect, determining potential variation regions of the genome according to the double sequence alignment result includes: dividing the genome into a plurality of coding intervals according to the base coding sequence of the genome; determining the variation types of all sequencing sequences according to the double-sequence comparison result; sequentially counting the probability distribution values of the sequencing sequences of different variation types in each coding interval; calculating the information entropy of each coding interval according to the probability distribution value; sequentially judging whether the information entropy of each coding interval is larger than a first threshold value; and when the information entropy of one coding interval is larger than the first threshold, judging the coding interval as a potential variation region.
With reference to the third aspect, in a fourth possible implementation manner of the third aspect, determining potential variation regions of the genome according to the double sequence alignment result includes: dividing the genome into a plurality of coding intervals according to the base coding sequence of the genome; sequentially counting the number of the sequence sequences with variation in each coding interval; judging whether the number of the sequence sequences with variation in each coding interval is larger than a second threshold value; and when the number of the sequenced sequences with variation in the coding interval is larger than the second threshold value, judging the coding interval as a potential variation region.
With reference to the third aspect, in a fifth possible implementation manner of the third aspect, the extracting of the sequenced sequence fragments from all sequenced sequences according to the potential variation region includes: and extracting the intersection part of each sequencing sequence and the potential variation region as the sequencing sequence fragment.
With reference to the third aspect, in a sixth possible implementation manner of the third aspect, the extracting of the sequenced sequence fragments from all sequenced sequences according to the potential variation region includes: when each of the sequencing sequences intersects the potential variation region, extracting the sequencing sequence as the sequencing sequence fragment.
With reference to the third aspect, in a seventh possible implementation manner of the third aspect, extracting a reference sequence fragment from the reference sequence according to the potential variation region includes: extracting the intersection of the reference sequence and the potential variation region as the reference sequence fragment.
With reference to the third aspect, in an eighth possible implementation manner of the third aspect, determining a variation detection result of the genome according to the multiple sequence alignment result includes: determining variation positions in the potential variation region according to the multi-sequence alignment result; extracting variation information of all sequencing sequence fragments at the variation positions from the multi-sequence comparison result; converging all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation positions in the same sequencing sequence set is the same; sequentially judging whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold value; and when the number of the sequencing sequence fragments in one sequencing sequence set is greater than the third threshold value, determining that the variation information of the sequencing sequence fragments in the sequencing sequence set is the variation detection result of the genome.
In a fourth aspect, embodiments of the present application further provide a storage medium, where the storage medium may store a program, and the program may include some or all of the steps in the embodiments of the genomic variation detection method provided in the present application when executed.
By adopting the genome variation detection method, the device, the terminal and the like provided by the embodiment of the application, multi-sequence comparison is carried out on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result; and determining the variation detection result of the genome according to the multiple sequence comparison result. Because the multiple sequence comparison tends to preferentially gather and align the sequences with higher similarity, the reference sequence fragments and all sequencing sequence fragments are put together for multiple sequence comparison, the sequencing sequence fragments with the same variation type can be gathered and aligned, the sequencing sequence fragments are aligned more accurately, the variation belonging to one type is prevented from being erroneously aligned into the variation of different types, and the accuracy of the genome variation detection result is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
FIG. 1A is a diagram illustrating a two-sequence alignment of a sequencing sequence with a reference sequence according to an embodiment of the present disclosure;
FIG. 1B is a schematic diagram of the alignment of the sequenced sequences in FIG. 1A after alignment correction according to the embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for detecting genomic variation according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of the partition of the coding region of a genome provided in the embodiments of the present application;
FIG. 4A is a schematic diagram of a sequencing sequence fragment and a reference sequence fragment extraction process provided in the examples of the present application;
FIG. 4B is a schematic diagram of another extraction process of sequencing sequence fragments and reference sequence fragments provided in the examples of the present application;
FIG. 5A is a diagram illustrating a multiple sequence alignment of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present disclosure;
FIG. 5B is a diagram illustrating a multiple sequence alignment of another sequenced sequence fragment and a reference sequence fragment as provided in an embodiment of the present application;
FIG. 6 is a schematic flow chart of another method for detecting genomic variation according to the present disclosure;
FIG. 7 is a diagram illustrating a convergent result of a sequencing sequence cluster provided in an embodiment of the present application;
fig. 8 is a schematic diagram of a union processing procedure provided in an embodiment of the present application;
FIG. 9A is a schematic diagram of a signature sequence obtained by performing union processing on the sequencing sequence clusters in FIG. 7 according to the embodiment of the present application;
FIG. 9B is a diagram illustrating the result of the double-sequence alignment performed by the embodiment of the present application on the signature sequence of FIG. 9A and the reference sequence fragment;
FIG. 9C is a schematic diagram illustrating a corrected multiple sequence alignment result obtained by correcting the multiple sequence alignment result according to the variation type of the signature sequence in FIG. 9B according to the embodiment of the present application;
FIG. 10A is a diagram illustrating a multiple sequence alignment of a reference sequence fragment with another sequenced sequence fragment provided in an embodiment of the present application;
FIG. 10B is a schematic diagram of a characteristic sequence obtained by performing union processing on the sequencing sequence cluster in FIG. 10A according to the embodiment of the present application;
FIG. 11 is a schematic flow chart of another method for detecting genomic variation according to the present disclosure;
FIG. 12A is a diagram illustrating a merging process for merging the signature sequences in FIG. 10B according to an embodiment of the present application;
FIG. 12B is a diagram illustrating a merging process for merging the sequenced sequence clusters in FIG. 10A according to an embodiment of the present application;
FIG. 13 is a schematic structural diagram of a first genomic variation detection apparatus according to an embodiment of the present disclosure;
FIG. 14 is a schematic structural view of a second genomic variation detection apparatus according to an embodiment of the present disclosure;
FIG. 15 is a schematic structural view of a third genomic variation detection apparatus according to the present embodiment;
fig. 16 is a schematic structural diagram of a genomic variation detection terminal according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the embodiments of the present application better understood, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1A, a schematic diagram of a double-sequence alignment state of a sequencing sequence and a reference sequence provided in this embodiment of the present application is shown in fig. 1B, which is a schematic diagram of an alignment state after alignment correction of the sequencing sequence in fig. 1A of this application, in fig. 1A and fig. 1B, an upper base sequence and a lower base sequence which are identical represent the reference sequence, a dotted line represents the sequencing sequence, and mismatches and deletions of the sequencing sequence relative to the reference sequence are respectively represented by base letters and dots in the sequencing sequence.
Comparing FIG. 1A with FIG. 1B, in FIG. 1A, the partially sequenced sequence has both G- > A (mismatch from base G to base A) and A- > G (mismatch from base A to base G), and the partially sequenced sequence has deletion of TTTG (deletion of base fragment TTTG); in FIG. 1B, however, the alignment of the sequencing sequences was corrected, and those sequences with both G- > A and A- > G were corrected to the deleted sequence with TTTG. That is, in FIG. 1A, due to the fact that there is no alignment between the sequencing sequences, the partially deleted sequencing sequences with TTTG are aligned incorrectly to the sequencing sequences with G- > A and A- > G, that is, one type of variation is aligned incorrectly to different types of variation, which easily results in inaccurate genome variation detection result in the case of variation types of the subsequent statistical genome.
In order to improve the accuracy of the detection result of the genomic variation, the method, the device and the terminal for detecting the genomic variation provided by the embodiment of the application put the reference sequence fragments and all sequencing sequence fragments together for multi-sequence comparison, and because the multi-sequence comparison tends to preferentially gather and align sequences with higher similarity, the reference sequence fragments and all sequencing sequence fragments together for multi-sequence comparison, and sequencing sequence fragments with the same variation type can be gathered and aligned, so that the sequencing sequence fragments are aligned more accurately, thereby avoiding that variations belonging to one type are erroneously aligned into variations of different types, and improving the accuracy of the detection result of the genomic variation.
Referring to fig. 2, a schematic flow chart of a method for detecting genomic variation provided in the embodiments of the present application is shown, the method including the following steps:
step 201: and performing double-sequence comparison on the multiple sequencing sequences of the genome and the reference sequence respectively to obtain double-sequence comparison results.
In the embodiment of the application, the reference sequence is a base sequence when the genome is not mutated and represents the correct arrangement sequence of bases in the genome, and the sequencing sequence is a base sequence to be detected in the genome, so that the variation condition of the sequencing sequence can be judged by taking the reference sequence as a reference, and when the base arrangement sequence of the sequencing sequence is consistent with that of the reference sequence, the sequencing sequence is not mutated; when the base sequence of the sequencing sequence is inconsistent with the base sequence of the reference sequence, the variation of the sequencing sequence is shown, wherein the variation of the sequencing sequence mainly comprises the mismatching, insertion and deletion of the base.
Generally, the sequencing sequence is a short sequence fragment, and the more the number of the sequencing sequences is, the more the original data is obtained in the genome variation detection process, and the more the available data is, the more accurate the genome variation detection result is when the genome variation detection result is statistically analyzed in the subsequent steps. The multiple sequencing sequences of the genome are subjected to double-sequence comparison with the reference sequence respectively, each sequencing sequence can be positioned to the corresponding position of the reference sequence, and detailed variation information of each sequencing sequence relative to the reference sequence, including information such as matching, mismatching, insertion or deletion, is obtained.
Step 202: and determining potential variation regions of the genome according to the double sequence alignment result.
In the field of genome detection technology, in order to locate bases in a genome, each base in the genome is assigned with a code, and then a single code represents one base pair in the genome, and consecutive code intervals represent a base segment in the genome.
In the embodiment of the application, firstly, the genome is divided into a plurality of coding intervals according to the coding sequence of the genome, and then whether each coding interval is a potential variation region is sequentially judged, so that the preliminary screening of the variation position of the genome is realized, and the detection efficiency is improved.
In an optional embodiment of the present application, the genome is divided into continuous coding intervals with equal length, and whether each coding interval is a potential variation region is sequentially determined according to the arrangement order of the coding intervals until the whole genome is traversed, so as to avoid omission of detection regions. The length of the coding region can be adjusted according to actual needs, for example, any length in the range of 50-300bp (bp represents base pair) can be selected, which is not limited in this application.
Referring to fig. 3, a schematic diagram of the division of the coding regions of the genome provided in the embodiment of the present application, since the reference sequence is a base sequence when the genome is not altered, the coding regions of the base pairs in the reference sequence are the coding regions of the genome, and thus the present embodiment can be described by representing the coding regions of the genome with the coding regions of the reference sequence. As shown in fig. 3, the genome is divided into encoding sections having a length of 100bp along the encoding order of the genome, and a first encoding section (1510531, 1510630), a second encoding section (1510631, 1510730), a third encoding section (1510731, 1510830), a fourth encoding section (1510831, 1510930) and the like are formed in this order.
After the coding regions are divided, whether each coding region is a potential variation region is sequentially judged, and the potential variation regions of the genome are screened out from all the coding regions. It should be noted that the number of potential variation regions in a genome may be one or more than one, and the present application is not limited thereto.
There are various methods for determining whether the coding region is a potential variation region. For example, since the information entropy can reflect the degree of mixture of sequences, the larger the information entropy is, the more disordered a sequence is, the more likely a variation occurs in a sequenced sequence, and thus, in a possible implementation manner of the present application, a potential variation region can be determined by the information entropy; for another example, since the greater the number of sequenced sequences that have a variation within the coding region, the greater the likelihood that the coding region is a region of potential variation, in another possible implementation of the present application, the region of potential variation can be determined by the number of sequenced sequences that have a variation within the coding region.
The method for determining the potential variation region through the information entropy specifically comprises the following steps:
firstly, determining the variation types of all sequencing sequences according to the double-sequence comparison result. Because the double-sequence comparison result of the sequencing sequence and the reference sequence comprises detailed information such as matching, mismatching, insertion, deletion and the like of the sequencing sequence relative to the reference sequence, the variation type of the sequencing sequence can be directly determined according to the double-sequence comparison result. Herein, the sequence of the same variation type refers to a sequence having completely the same variation information with respect to the reference sequence, wherein the sequence of the same variation type is also one of the variation types.
And after determining the variation types of all sequencing sequences, counting the probability distribution values of the sequencing sequences of different variation types according to the variation type information of the sequencing sequences. The method specifically comprises the following steps: according to the variation type of the sequencing sequence, the ratio of the number of the sequencing sequences under each variation type in the coding interval to the total number of the sequencing sequences is calculated in sequence to obtain the probability distribution value of the sequencing sequences of different variation types, and the probability distribution value is marked as pi
If two variation types exist in the coding region, namely a first variation type and a second variation type, respectively counting the number of sequencing sequences corresponding to the first variation type and the second variation type, and dividing the number of sequencing sequences corresponding to the first variation type by the total number of sequencing sequences to obtain a probability value p of the first variation type1(ii) a Dividing the number of sequencing sequences corresponding to the second variation type by the total number of sequencing sequences to obtain a probability value p of the second variation type2. Wherein p is1And p2I.e. the probability distribution values of the sequenced sequences of different types of variation within the coding interval.
And calculating the information entropy of the coding interval according to the probability distribution value. The method specifically comprises the following steps: substituting the probability distribution value pi into an information entropy formula: h (U) ═ E [ -logpi]And obtaining the information entropy H (U) of the coding interval.
Judging whether the information entropy H (U) of the coding interval is larger than a preset first threshold value or not, and judging the coding interval as a potential variation area when the information entropy H (U) is larger than the first threshold value.
In addition, the method for determining the potential variation region through the number of the sequencing sequences with variation in the coding region specifically comprises the following steps:
first, the number of sequenced sequences with variations within the coding region was counted. Wherein, the sequencing sequence with variation is taken as the sequencing sequence with mismatch, insertion or deletion as long as the sequencing sequence and the reference sequence can not be perfectly matched.
And judging whether the number of the mutated sequencing sequences is greater than a second threshold or not according to the statistical result, and judging that the coding region is a potential mutation region when the number of the mutated sequencing sequences is greater than the second threshold.
For example, in one possible implementation manner of the present application, if the second threshold is set to 50, the coding region is determined as a potential variation region when the number of sequenced sequences with variation within the coding region is greater than 50; otherwise, the coding region is determined not to be a potential region of variation between domains. The skilled person in the art may adjust the size of the second threshold accordingly according to actual needs, which is not limited in this application.
Step 203: and extracting sequencing sequence fragments from all sequencing sequences according to the potential variation regions.
After the potential variation region is determined, the sequencing sequence fragment in the potential variation region needs to be extracted from the sequencing sequence for analysis and processing in subsequent steps.
In the embodiment of the present application, to facilitate description of an extraction process of a sequencing sequence fragment, a coding region corresponding to an intersection of a sequencing sequence and a reference sequence in a double-sequence alignment result of the sequencing sequence and the reference sequence is used as a coding region of the sequencing sequence.
In one possible implementation of the present application, an intersection portion of each sequencing sequence and the potential variation region is extracted as a sequencing sequence fragment. For example, when the coding region of the sequenced sequence is completely within the coding region of the potential variation region, the sequenced sequence is taken as a sequenced sequence fragment; when the coding region of the potential variation region intersects with the coding region of the sequencing sequence, extracting the intersection part of the sequencing sequence and the potential variation region as a sequencing sequence fragment; when there is no intersection between the coding region of the potential variation region and the coding region of the sequencing sequence, the sequencing sequence is discarded.
Referring to fig. 4A, a schematic diagram of an extraction process of a sequencing sequence fragment and a reference sequence fragment provided in the embodiments of the present application, in fig. 4A, three different types of sequencing sequences are taken as an example, and the extraction process of the sequencing sequence fragment is exemplified. Wherein, the coding interval of the potential variation region is (1510531, 1510630), the coding interval of the first sequencing sequence is (1510541, 1510590), the coding interval of the second sequencing sequence is (1510521, 1510570), the coding interval of the third sequencing sequence is (1510651, 15106700).
For a first sequencing sequence, the coding region (1510541, 1510590) of the first sequencing sequence is completely within the coding region (1510531, 1510630) of the potential variation region, and the first sequencing sequence is extracted as a sequencing sequence fragment; for the second sequencing sequence, the coding interval (1510521, 1510570) and the coding interval (1510531, 1510630) of the potential variation region have partial intersection, the coding interval of the intersection part is (1510531, 1510570), and then the part with the coding interval (1510531, 1510570) is extracted from the second sequencing sequence as the sequencing sequence fragment; for a third sequencing sequence, the absence of intersection of its coding region (1510651, 15106700) with the coding region (1510531, 1510630) of the potential variation region discards the third sequencing sequence, such that the extracted sequencing sequence fragments are all of the first sequencing sequence and the second sequencing sequence coding region is part of (1510531, 1510570).
As can be seen from the above examples, when there is a partial intersection between the coding region of the sequencing sequence and the coding region of the potential variation region, the sequencing sequence is interrupted during the extraction of the sequencing sequence fragment, and the intersection between the sequencing sequence and the potential variation region is extracted as the sequencing sequence fragment. The sequencing sequence is broken, so that the integrity of the sequencing sequence is lost, partial information of the sequencing sequence is lost, and the accuracy of a genome variation detection result is influenced.
In another possible implementation manner of the present application, it is first determined whether an intersection exists between each sequencing sequence and the potential variation region; when the sequencing sequence intersects with the potential variation region, the sequencing sequence is extracted as a sequencing sequence fragment. The method is equivalent to that when the coding region of the potential variation region and the coding region of the sequencing sequence have partial intersection, the coding region of the sequencing sequence is used as a reference to expand the potential variation region, so that the sequencing sequence is prevented from being broken in the extraction process of the sequencing sequence fragment, and the integrity of the sequencing sequence is ensured.
Referring to FIG. 4B, another schematic diagram of the extraction process of the sequenced sequence fragments and the reference sequence fragments provided in the example of the present application is shown, wherein the extraction process of the sequenced sequence fragments in FIG. 4B is substantially similar to that in FIG. 4A, except that, for the second sequenced sequence, because of the partial intersection of the coding region with the coding region of the potential variation region, the union (1510521, 1510630) of the coding region (1510521, 1510570) of the second sequenced sequence and the coding region (1510531, 1510630) of the potential variation region is used as the coding region of the expanded potential variation region, and then the sequenced sequence fragments are extracted from the potential variation region of the second sequenced sequence (at this time, the potential variation region is updated to the expanded potential variation region). Since the coding region of the second sequencing sequence completely falls within the coding region of the potential variation region, the entire second sequencing sequence is extracted as a sequencing sequence fragment. That is, in this implementation, if the sequencing sequence intersects with the region of potential variation, the entire sequencing sequence is extracted as a sequencing sequence fragment.
It should be noted that the above-mentioned extension of the potential variation region is only a specific implementation shown in the embodiment of the present application, and those skilled in the art can make corresponding modifications according to actual needs, which should fall into the protection scope of the present application. For example, before extracting the sequenced sequence fragments, the coding region of the potential variation region may be merged with the coding regions of all sequenced sequences within the potential variation region (the sequenced sequences within the potential variation region include sequenced sequences that partially intersect with the potential variation region and sequenced sequences that completely fall within the potential variation region), and the merged result is used as the coding region of the expanded potential variation region.
In the embodiment of the application, the sequencing sequence is not interrupted in the extraction process of the sequencing sequence fragment, so that the integrity of the sequencing sequence can be ensured, and the accuracy of genome variation detection is improved.
Step 204: extracting reference sequence fragments from the reference sequence according to the potential variation regions.
In the embodiments of the present application, a reference sequence fragment is extracted from a reference sequence based on the coding region of the potential variation region. For example, in fig. 4A, the coding regions of the potential variation regions are (1510531, 1510630), and then the coding region (1510531, 1510630) portions are extracted from the reference sequence as the reference sequence segments; in FIG. 4B, if the coding regions of the expanded potential variation regions are (1510521, 1510630), the coding region (1510521, 1510630) portions are extracted from the reference sequence as reference sequence fragments.
Step 205: and performing multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result.
In the embodiment of the present application, the reference sequence fragment and all sequencing sequence fragments are put together for multiple sequence alignment, and the multiple sequence alignment process includes:
establishing a distance matrix: respectively calculating the distance between every two sequences (including the distance between a reference sequence fragment and any one sequencing sequence fragment and the distance between any two sequencing sequence fragments), and establishing a distance matrix between every two sequences;
constructing a clustering tree: firstly, gathering two sequences with the shortest distance in a distance matrix, then updating the distance matrix, gathering the two sequences or two types of sequences with the shortest distance in the updated distance matrix, and repeating the steps until all the sequences are gathered together to obtain a clustering tree of a reference sequence segment and a sequencing sequence segment;
aligning the sequences: according to the clustering hierarchy of the sequencing sequence and the reference sequence in the clustering tree, firstly, aligning two sequences at the innermost layer, and then, aligning all sequencing sequence fragments and reference sequence fragments.
In the process of constructing the clustering tree, sequences with high similarity are preferentially gathered according to the distance between the sequences (the distance between the sequences represents the similarity between the sequences, the smaller the distance is, the higher the similarity is), so that the reference sequence fragments and all sequencing sequence fragments are put together for multi-sequence comparison, the sequencing sequences with the same variation type can be gathered together and aligned while the variation type of the sequencing sequence fragments is obtained, the variation belonging to one type can be prevented from being wrongly compared into the variation of different types, and the accuracy of the genome variation detection result is improved.
Referring to fig. 5A, a schematic diagram of a multiple sequence alignment state of a sequencing sequence fragment and a reference sequence fragment provided in an embodiment of the present application, in fig. 5A, three different variation types exist in the sequencing sequence fragment, and after all the sequencing sequence fragments and the first reference sequence fragment are put together for multiple sequence alignment, the sequencing sequence fragments of the three different variation types are respectively gathered together and aligned. In addition, because the variation types of the sequencing sequence fragments are generally the same in the same haplotype of the diploid or the polyploid, the reference sequence fragment and all the sequencing sequence fragments are put together for multi-sequence comparison, and the sequencing sequence fragments belonging to the same haplotype can be gathered together, thereby realizing the genome variation detection of the diploid or the polyploid.
Step 206: and determining the variation detection result of the genome according to the multi-sequence alignment result.
Because the multiple sequence alignment result has detailed variation information of the sequencing sequence fragment, including variation position of the sequencing sequence fragment and mismatch, insertion or deletion information at the variation position, the variation detection result of the genome can be determined according to the multiple sequence alignment result.
In the embodiment of the present application, firstly, the variation position in the potential variation region is determined according to the multi-sequence alignment result; then extracting variation information of all sequencing sequence fragments at the variation positions from the multi-sequence comparison result; converging all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation positions in the same sequencing sequence set is the same; sequentially judging whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold value; and when the number of the sequencing sequence fragments in one sequencing sequence set is greater than the third threshold value, determining the variation information of the sequencing sequence fragments in the sequencing sequence set as the genome variation detection result.
For example, in fig. 5A, based on the multiple sequence alignment, the location of the variation in the region of potential variation at code 1510581 is determined; extracting variation information of all sequenced sequence fragments at code 1510581, wherein the variation information coexists in three types, namely: no variation exists, the CCT insertion of the basic segment exists, and the CCT deletion of the basic segment exists; according to the variation information, all sequencing sequence fragments are gathered to three sequencing sequence sets, namely a first sequencing sequence set (the variation information is that no variation exists, the number of the sequencing sequence fragments is 11), a second sequencing sequence set (the variation information is that base segment CCT insertion exists, the number of the sequencing sequence fragments is 7) and a third sequencing sequence set (the variation information is that base segment CCT deletion exists, and the number of the sequencing sequence fragments is 8); and sequentially judging whether the number of sequencing sequence fragments in each sequencing sequence set is larger than a third threshold value.
If the third threshold is 6, the number of the sequenced fragments in the three sequenced sequence sets is greater than the third threshold, so that the variation detection result of the genome at code 1510581 is obtained as follows: absence of variation; inserting a base segment CCT; and (4) deleting the base segment CCT. It can also indicate that the variation detection results of the three haplotypes of the triploid at code 1510581 are respectively: absence of variation; inserting a base segment CCT; and (4) deleting the base segment CCT.
If the third threshold is 10, the number of sequenced sequence fragments in only the first sequenced sequence set in the above three sequenced sequence sets is greater than the third threshold, so as to obtain the variation detection result of the genome at code 1510581 as follows: no variation was present.
It should be noted that the size of the third threshold is only an exemplary illustration in the embodiment of the present application, and those skilled in the art can adjust the size of the third threshold accordingly according to actual needs, which all fall within the protection scope of the present application.
It can be seen from the above embodiments that, by performing multiple sequence alignment of the reference sequence fragments and all sequencing sequence fragments, sequencing sequence fragments having the same variation type can be aligned together, the sequencing sequence fragments are aligned more accurately, and the variation belonging to one type is prevented from being erroneously aligned into variations of different types, thereby improving the accuracy of the genomic variation detection result.
However, there are some defects in the construction method of the clustering tree in the multi-sequence alignment process, so that there is a possibility that the whole deviation of the sequencing sequence fragment from the reference sequence fragment exists in the multi-sequence alignment result.
Referring to FIG. 5B, another schematic diagram of the multiple sequence alignment status of the sequenced sequence fragments and the reference sequence fragments provided in the examples of the present application is shown in FIG. 5B, wherein in the multiple sequence alignment of the first reference sequence fragment and the sequenced sequence fragments, although the sequenced sequence fragments having the same variation type have been grouped together and aligned, some of the sequenced sequence fragments having the same variation type have an overall shift relative to the reference sequence fragments. The deviation of the sequencing sequence fragment from the reference sequence fragment can cause the variation type of the sequencing sequence fragment from the reference sequence fragment to change, thereby affecting the accuracy of genome variation detection. Therefore, it is necessary to correct the variation type of the sequenced sequence fragment relative to the reference sequence fragment after performing multiple sequence alignment of the sequenced sequence fragment and the reference sequence fragment.
Referring to fig. 6, a schematic flow chart of another genomic variation detection method provided in the embodiment of the present application, based on the embodiment shown in fig. 2, after step 205, the method may further include the following steps:
step 601: and determining the variation types of all sequencing sequence fragments according to the multi-sequence comparison result.
In the embodiment of the present application, after the reference sequence fragment and all the sequencing sequence fragments are put together for multi-sequence alignment, the sequencing sequence fragments with the same variation type in the sequencing sequence fragments can be gathered together and aligned, and the variation types of all the sequencing sequence fragments relative to the reference sequence fragment can be obtained, as shown in fig. 5A and 5B. Since in FIG. 5B, the partially sequenced fragments are shifted as a whole from the reference fragments, it is necessary to correct the variation type of the sequenced fragments that are shifted in FIG. 5B in order to obtain the multiple sequence alignment shown in FIG. 5A.
Step 602: and according to the variation types of all the sequencing sequence fragments, converging all the sequencing sequence fragments into at least one sequencing sequence cluster.
In the embodiment of the present application, all sequencing sequence fragments are classified according to their variation types, and the sequencing sequence fragments with the same variation type are gathered into the same sequencing sequence cluster, so as to correct their variation types.
Referring to fig. 7, a schematic diagram of a convergence result of a sequenced sequence cluster provided in this embodiment of the present application, which converges all sequenced sequence fragments in the multiple sequence alignment result shown in fig. 5B into three sequenced sequence clusters according to the variation types of the sequenced sequence fragments. Wherein there is no variation in the sequencing sequence fragments in the first sequencing sequence cluster; the sequencing sequence fragment in the second sequencing sequence cluster has insertion of a base segment CCT; the sequencing sequence fragments in the third sequencing sequence cluster have deletion of a base segment CGCCAG and mismatch of a base sequence.
Step 603: and respectively performing union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster.
Because the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type relative to the reference sequence fragments, the overlapping coding regions of any two sequencing sequence fragments in the same sequencing sequence cluster have the same base sequence, and all the sequencing sequence fragments in the sequencing sequence cluster are subjected to union treatment to merge the overlapping coding regions among the sequencing sequence fragments, so that the characteristic sequence of the sequencing sequence cluster is obtained. The union processing procedure is exemplarily described below with reference to the drawings.
Referring to fig. 8, a schematic diagram of a union processing process provided in this embodiment of the present application, in which fig. 8 includes two sequencing sequence fragments, where a coding region of a first sequencing sequence fragment is (1, 15), a coding region of a second sequencing sequence fragment is (4, 18), and the two sequencing sequence fragments have the same base sequence TCCCCTCCTCCT within overlapping coding regions (4, 15), the overlapping coding regions of the two sequencing sequence fragments are merged, and the portions of the sequencing sequence fragments that are not merged are respectively used as the head and the tail of a signature sequence, so as to obtain a signature sequence GACTCCCCTCCTCCTCCT with a coding region of (1, 18).
Referring to fig. 9A, a schematic diagram of a signature sequence obtained by performing union processing on the sequencing sequence cluster in fig. 7 for the present embodiment, which is obtained by performing union processing on all sequencing sequence fragments in the first sequencing sequence cluster, the second sequencing sequence cluster and the third sequencing sequence cluster in fig. 7 respectively to obtain a first signature sequence, a second signature sequence and a third signature sequence corresponding thereto.
Step 604: and performing double-sequence comparison on each characteristic sequence and the reference sequence fragment to obtain the variation type of each characteristic sequence.
Since the best alignment result of the two sequences can be obtained by the double sequence alignment, the variation type of the characteristic sequence obtained by the double sequence alignment of the characteristic sequence and the reference sequence is the variation type of the characteristic sequence under the best alignment result. Based on the above, the sequenced sequence fragment corresponding to the characteristic sequence can be corrected according to the variation type of the characteristic sequence in the subsequent steps.
If the sequencing sequence fragment in the sequencing sequence cluster has an offset relative to the reference sequence fragment, the characteristic sequence obtained after the reference sequence fragment is subjected to union processing also has the same offset, and the characteristic sequence with the offset and the reference sequence fragment are subjected to double-sequence comparison to correct the characteristic sequence. That is, if the sequencing sequence fragment is shifted, the variation type of the characteristic sequence will be changed after the characteristic sequence corresponding to the sequencing sequence fragment is subjected to double sequence comparison with the reference sequence fragment; if the sequencing sequence fragment does not deviate, the variation type of the characteristic sequence is unchanged after the characteristic sequence corresponding to the sequencing sequence fragment and the reference sequence fragment are subjected to double-sequence comparison. Therefore, in the embodiment of the present application, whether the sequencing sequence fragment corresponding to the feature sequence needs to be corrected can be determined according to the variation type of the feature sequence before and after the double-sequence alignment.
Referring to fig. 9B, a schematic diagram of a double-sequence alignment result obtained by performing double-sequence alignment on the feature sequence in fig. 9A and the reference sequence fragment in the embodiment of the present application is shown, where the first feature sequence, the second feature sequence, and the third feature sequence shown in fig. 9A and the reference sequence fragment are respectively performed with double-sequence alignment, and an obtained alignment result is shown in fig. 9B. Comparing FIGS. 9A and 9B, it can be seen that after the feature sequences are aligned with the reference sequence fragments in a double sequence, the variation types of the first and second feature sequences are not changed, and the variation type of the third feature sequence is changed. That is, the sequencing sequence fragments corresponding to the first characteristic sequence and the second characteristic sequence have the best alignment effect after multi-sequence alignment, and do not need to be corrected; the sequencing sequence fragment corresponding to the third signature sequence is shifted integrally from the reference sequence fragment, and needs to be corrected further.
Step 605: and correcting the multi-sequence comparison result according to the variation type of each characteristic sequence.
In the embodiment of the present application, the variation type of the sequencing sequence fragment corresponding to the feature sequence is corrected based on the variation type of the feature sequence, that is, the multiple sequence alignment result is corrected. The method specifically comprises the following steps: and when the variation type of the characteristic sequence is different from that of the corresponding sequencing sequence fragment, adjusting the variation type of the sequencing sequence fragment into the variation type of the characteristic sequence, so that the variation type of the sequencing sequence fragment in the corrected multi-sequence comparison result is the same as that of the characteristic sequence corresponding to the sequencing sequence fragment.
For example, in fig. 9B, after the third signature sequence is aligned with the reference sequence fragment in a double sequence, the variation type of the third signature sequence is changed, which results in the variation type of the third signature sequence being different from that of the sequenced sequence fragments of the third sequencing sequence cluster, and therefore, the variation type of the sequenced sequence fragments of the third sequencing sequence cluster needs to be adjusted according to the variation type of the third signature sequence.
Referring to fig. 9C, a schematic diagram of a corrected multiple sequence alignment result obtained by correcting the multiple sequence alignment result according to the variation type of the characteristic sequence in fig. 9B in the embodiment of the present application is shown, wherein the variation type of the sequenced sequence fragment of the third sequenced sequence cluster is adjusted to the variation type of the third characteristic sequence.
As can be seen from the above examples, in the present application, the characteristic sequence is first corrected by double sequence alignment of the characteristic sequence and the reference sequence fragment; and then, the sequencing sequence fragment corresponding to the characteristic sequence is corrected according to the corrected characteristic sequence, so that the problem that part of the sequencing sequence fragment deviates relative to the reference sequence fragment in the multi-sequence comparison result is solved, and the accuracy of the genome variation detection result is improved.
Generally, in the process of aligning two sequences, the larger the difference in length between the two sequences, the more likely that multiple alignment results will appear, i.e., the more likely that the two sequences will be in error. That is, in step 604, when the signature sequence is aligned with the reference sequence fragment in a double sequence manner, the longer the signature sequence is, the higher the accuracy of the result of the double sequence alignment between the signature sequence and the reference sequence fragment is.
Referring to fig. 10A, a schematic diagram of a multiple sequence alignment state of another sequenced sequence fragment and a reference sequence fragment provided in this embodiment of the present application, in fig. 10A, the sequenced sequence fragments are aggregated into three sequenced sequence clusters, which are a fourth sequenced sequence cluster, a fifth sequenced sequence cluster, and a sixth sequenced sequence cluster, respectively, according to the variation type of the sequenced sequence fragment.
Referring to fig. 10B, a schematic diagram of feature sequences obtained by performing union processing on the sequencing sequence clusters in fig. 10A for the present embodiment, which performs union processing on all sequencing sequence fragments in the fourth sequencing sequence cluster, the fifth sequencing sequence cluster and the sixth sequencing sequence cluster respectively to obtain a fourth feature sequence, a fifth feature sequence and a sixth feature sequence corresponding thereto.
Referring to fig. 10A and 10B, since the sequencing sequence fragments in the fifth sequencing sequence cluster and the sixth sequencing sequence cluster are shorter (relative to the reference sequence fragment), the fifth signature sequence and the sixth signature sequence obtained after the union processing of all the sequencing sequence fragments in the sequencing sequence clusters are also shorter. If the fifth characteristic sequence or the sixth characteristic sequence is directly subjected to double-sequence comparison with the reference sequence fragment, an ideal comparison result is probably not obtained, so that the variation type of the characteristic sequence is inaccurate, and the correction of the sequencing sequence fragment is influenced.
Referring to fig. 11, a schematic flow chart of another method for detecting genomic variation provided in the embodiments of the present application is shown, where the method may further include the following steps after step 603 based on the embodiment shown in fig. 6:
step 1101: and performing double-sequence alignment on any two characteristic sequences in the characteristic sequences of each sequencing sequence cluster.
In the embodiment of the application, after the characteristic sequences of the sequencing sequence clusters are obtained, two-sequence comparison is performed on any two characteristic sequences in the characteristic sequences of each obtained sequencing sequence cluster, so as to judge whether the sequencing sequence clusters corresponding to the two characteristic sequences can be further merged. For example, for the signature sequences shown in FIG. 10B, the fourth signature sequence and the fifth signature sequence, the fourth signature sequence and the sixth signature sequence, and the fifth signature sequence and the sixth signature sequence are aligned in pairs, respectively.
Step 1102: and judging whether an overlapping region of the two characteristic sequences is completely matched, wherein the variation position of at least one characteristic sequence is completely positioned in the overlapping region.
If the overlapping regions of two signature sequences cannot be completely matched, which indicates that the two signature sequences have different types of variation in the overlapping regions, they cannot be combined, so that complete matching of the overlapping regions of the two signature sequences is a big premise for combining the two signature sequences. The complete location of the variation of at least one signature sequence with respect to the reference sequence fragment within the overlap region ensures that two signature sequences have at least one variation location within their overlap region with the same variation information.
For example, in the first variation position shown in fig. 10B, the fourth signature sequence and the fifth signature sequence have been deleted from the second reference sequence fragment by the base segment CC, and the first variation position is located in the overlapping region of the fourth signature sequence and the fifth signature sequence, which indicates that the fourth signature sequence and the fifth signature sequence satisfy the above-mentioned determination condition; in the second variation position shown in fig. 10B, the fourth signature sequence and the sixth signature sequence have insertion of the base segment CC relative to the second reference sequence fragment, and the second variation position is located in the overlapping region of the fourth signature sequence and the sixth signature sequence, which indicates that the fourth signature sequence and the sixth signature sequence also satisfy the above-mentioned determination condition.
When the judgment condition is met, the step 1103 is entered, and the sequencing sequence clusters are further merged; otherwise, step 604 is entered, and each signature sequence is aligned with the reference sequence fragment.
Step 1103: and merging the sequencing sequence clusters corresponding to the two characteristic sequences to obtain a merged sequencing sequence cluster, and merging the two characteristic sequences to obtain the characteristic sequence of the merged sequencing sequence cluster.
Because the sequencing sequence cluster and the characteristic sequence have a one-to-one correspondence relationship, after the sequencing sequence clusters are combined, the characteristic sequences of the sequencing sequence clusters also need to be correspondingly combined. Merging sequencing sequence clusters corresponding to the two characteristic sequences means that two sequencing sequence clusters before merging are replaced by the merged sequencing sequence cluster to update the sequencing sequence clusters; the merging of the two feature sequences means that the feature sequences obtained by merging are used for replacing the two feature sequences before merging, so as to update the feature sequences.
After the step 1103 is completed, the process returns to the step 1101, and continues to perform double sequence comparison on the feature sequences to determine whether there is a sequencing sequence cluster that meets the merge condition. The signature sequence in step 1101 includes a signature sequence obtained by union processing, and the sequencing sequence cluster in step 1103 includes a combined sequencing sequence cluster.
Referring to fig. 12A and 12B, fig. 12A is a schematic diagram of a merging process for merging the signature sequences in fig. 10B according to an embodiment of the present disclosure, and fig. 12B is a schematic diagram of a merging process for merging the sequencing sequence clusters in fig. 10A according to an embodiment of the present disclosure. As shown in fig. 12A, the fourth signature sequence and the fifth signature sequence are first subjected to a double-sequence alignment, and since the overlapping region of the fourth signature sequence and the fifth signature sequence is completely matched and the first mutation position (deletion of the base segment CC) completely falls within the overlapping region, the fourth signature sequence and the fifth signature sequence are combined to obtain a seventh signature sequence. Accordingly, as shown in fig. 12B, the fourth sequencing sequence cluster and the fifth sequencing sequence cluster are combined to obtain a seventh sequencing sequence cluster.
Further, the seventh signature sequence and the sixth signature sequence are subjected to double sequence alignment, and since the overlapping region of the seventh signature sequence and the sixth signature sequence is completely matched and the second variation position (insertion of the base segment CC) completely falls within the overlapping region, the seventh signature sequence and the sixth signature sequence are combined to obtain the eighth signature sequence. Correspondingly, the seventh sequencing sequence cluster and the sixth sequencing sequence cluster are combined to obtain an eighth sequencing sequence cluster. Then, in the subsequent step 604, the eighth signature sequence is subjected to a double sequence alignment with the reference sequence fragment only, and the sequence fragments in the eighth sequencing sequence cluster are corrected according to the variation type of the eighth signature sequence. In step 604, each signature sequence is subjected to a double sequence alignment with the reference sequence fragment, where each signature sequence includes both the signature sequence of the sequencing sequence cluster that is not merged because it does not meet the merging condition, and the signature sequence of the merged sequencing sequence cluster obtained by merging the sequencing sequence clusters subsequently.
It can be seen from the above examples that, in the embodiments of the present application, the sequencing sequence clusters that meet the merging conditions are further merged to increase the length of the feature sequence, thereby improving the accuracy of the double sequence alignment result between the feature sequence and the reference sequence fragment.
Corresponding to the genomic variation detection method, the application also provides a genomic variation detection device.
Referring to fig. 13, a schematic structural diagram of a first genomic variation detection apparatus according to an embodiment of the present disclosure is shown.
The first genomic variation detection apparatus 1300 may include: a first double sequence alignment unit 1301, a potential mutation region determining unit 1302, a sequencing sequence fragment extracting unit 1303, a reference sequence fragment extracting unit 1304, a multiple sequence alignment unit 1305 and a mutation detection result determining unit 1306.
The first double-sequence comparison unit 1301 is configured to perform double-sequence comparison on multiple sequencing sequences of a genome with a reference sequence to obtain a double-sequence comparison result, where the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome.
A potential variation region determining unit 1302, configured to determine a potential variation region of the genome according to the double sequence alignment result, where the potential variation region is a base coding region where a potential variation occurs in the genome.
A sequencing sequence fragment extracting unit 1303 for extracting sequencing sequence fragments from all sequencing sequences according to the potential variation region.
A reference sequence fragment extracting unit 1304, configured to extract a reference sequence fragment from the reference sequence according to the potential variation region.
A multiple sequence alignment unit 1305, configured to perform multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain a multiple sequence alignment result.
A variation detection result determining unit 1306, configured to determine a variation detection result of the genome according to the multiple sequence alignment result.
In one possible implementation manner of the present application, the potential variation region determining unit 1302 includes: a first coding region dividing unit for dividing the genome into a plurality of coding regions according to the base coding order of the genome; a variation type determining subunit, configured to determine variation types of all sequencing sequences according to the double-sequence comparison result; a probability distribution value statistic subunit, configured to sequentially count probability distribution values of sequencing sequences of different variation types in each coding interval; an information entropy calculating subunit, configured to calculate an information entropy of each of the coding sections according to the probability distribution value; a first threshold judgment subunit, configured to sequentially judge whether the information entropy of each coding interval is greater than a first threshold; a first potential variation region determining subunit, configured to determine that a coding section is a potential variation region when the information entropy of the coding section is greater than the first threshold.
In one possible implementation manner of the present application, the potential variation region determining unit 1302 includes: a second coding region dividing unit for dividing the genome into a plurality of coding regions according to the base coding order of the genome; a variation quantity counting subunit, configured to count, in sequence, the number of the sequencing sequences that have undergone variation in each of the coding intervals; a second threshold judgment subunit, configured to judge whether the number of the mutated sequencing sequences in each coding interval is greater than a second threshold; and a second potential variation region determining subunit, configured to determine a coding region as a potential variation region when the number of the sequenced sequences with variation within the coding region is greater than the second threshold.
In a possible implementation manner of the present application, the sequencing sequence fragment extracting unit 1303 is specifically configured to extract an intersection portion of each sequencing sequence and the potential variation region as the sequencing sequence fragment.
In a possible implementation manner of the present application, the sequencing sequence fragment extracting unit 1303 is specifically configured to extract the sequencing sequence as the sequencing sequence fragment when the intersection judging subunit judges that the sequencing sequence and the potential variation region have an intersection.
In a possible implementation manner of the present application, the reference sequence fragment extracting unit 1304 is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence fragment.
In one possible implementation manner of the present application, the mutation detection result determining unit 1306 includes: a variation position determining subunit, configured to determine a variation position in the potential variation region according to the multiple sequence alignment result; a variation information extraction subunit, configured to extract variation information of all the sequencing sequence fragments at the variation position from the multiple sequence alignment result; the sequencing sequence set aggregation subunit is used for aggregating all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation position in the same sequencing sequence set is the same; a third threshold judgment subunit, configured to sequentially judge whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold; and a variation detection result determining subunit, configured to determine, when the number of the sequencing sequence fragments in one of the sequencing sequence sets is greater than the third threshold, that the variation information of the sequencing sequence fragments in the sequencing sequence set is the variation detection result of the genome.
Referring to fig. 14, a schematic structural diagram of a second genomic variation detection apparatus according to the embodiment of the present disclosure is shown.
The second genomic variation detection apparatus 1400 further includes, in addition to the first genomic variation detection apparatus 1300 shown in fig. 13: a variant type determining unit 1401, a sequencing sequence cluster converging unit 1402, a union processing unit 1403, a second double sequence aligning unit 1404 and a correcting unit 1405.
A variation type determining unit 1401, configured to determine variation types of all sequenced sequence fragments according to the multiple sequence alignment result.
A sequencing sequence cluster converging unit 1402, configured to converge all the sequencing sequence fragments into at least one sequencing sequence cluster according to the variation types of all the sequencing sequence fragments, where the variation types of the sequencing sequence fragments in the same sequencing sequence cluster are the same.
A union processing unit 1403, configured to respectively union process all sequencing sequence fragments in each sequencing sequence cluster to obtain a feature sequence of each sequencing sequence cluster.
A second double sequence alignment unit 1404, configured to perform double sequence alignment on each of the feature sequences and the reference sequence fragment to obtain a variation type of each of the feature sequences.
A calibration unit 1405, configured to calibrate the multiple sequence alignment result according to the variation type of each feature sequence, where the variation type of each sequenced sequence fragment in the calibrated multiple sequence alignment result is the same as the variation type of the feature sequence corresponding to each sequenced sequence fragment.
Fig. 15 is a schematic structural diagram of a third genomic variation detection apparatus provided in the embodiments of the present application.
The third genomic variation detection apparatus 1500 is the second genomic variation detection apparatus 1400 shown in fig. 14, and further includes: a third double sequence alignment unit 1501, an overlap region judgment unit 1502, and a merge unit 1503.
The third double-sequence alignment unit 1501 is configured to perform double-sequence alignment on any two obtained feature sequences in the feature sequences of each sequencing sequence cluster.
An overlap region determining unit 1502 is configured to determine whether there is an overlap region where two feature sequences completely match, and a variation position of at least one feature sequence is completely within the overlap region.
A merging unit 1503, configured to merge the sequencing sequence clusters corresponding to the two feature sequences to obtain a merged sequencing sequence cluster when there is a complete match between overlapping regions of the two feature sequences and a variation position of at least one feature sequence is completely within the overlapping region, and merge the two feature sequences to obtain a feature sequence of the merged sequencing sequence cluster.
The relationship between the functional units in the genomic variation detection apparatus provided in the embodiments of the present application can be referred to in the steps of the genomic variation detection method, and is not described herein again.
Corresponding to the genomic variation detection method, the application also provides a genomic variation detection terminal.
Referring to fig. 16, which is a schematic structural diagram of a genomic variation detection terminal according to an embodiment of the present disclosure, the genomic variation detection terminal 1600 may include: a processor 1601, a memory 1602, and a communication unit 1603. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not limiting of the application, and may be a bus architecture, a star architecture, a combination of more or fewer components than those shown, or a different arrangement of components.
The communication unit 1603 is configured to establish a communication channel so that the storage device can communicate with other devices. Receiving the user data sent by other devices or sending the user data to other devices.
The processor 1601, which is a control center of the storage device, connects various parts of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and/or processes data by running or executing software programs and/or modules stored in the memory 1602 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 1601 may only include a Central Processing Unit (CPU). In the embodiments of the present application, the CPU may be a single arithmetic core or may include multiple arithmetic cores.
The memory 1602 is used for storing instructions executed by the processor 1601, and the memory 1602 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
The executable instructions in the memory 1602, when executed by the processor 1601, enable the genomic variation detection terminal 1600 to perform the steps of:
performing double-sequence comparison on a plurality of sequencing sequences of a genome with a reference sequence to obtain a double-sequence comparison result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome; determining a potential variation region of the genome according to the double sequence alignment result, wherein the potential variation region is a base coding region of the genome in which the potential variation occurs; extracting sequencing sequence fragments from all sequencing sequences according to the potential variation regions; extracting reference sequence fragments from the reference sequence according to the potential variation regions; performing multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result; and determining the variation detection result of the genome according to the multi-sequence alignment result.
In a specific implementation, the present application further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the calling method provided in the present application when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.
The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, as for the device embodiment and the terminal embodiment, since they are basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.
The above-described embodiments of the present application do not limit the scope of the present application.

Claims (16)

1. A method for detecting genomic variation, comprising:
performing double-sequence comparison on a plurality of sequencing sequences of a genome with a reference sequence to obtain a double-sequence comparison result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome;
determining a potential variation region of the genome according to the double sequence alignment result, wherein the potential variation region is a base coding region of the genome in which the potential variation occurs;
extracting sequencing sequence fragments from all sequencing sequences according to the potential variation regions;
extracting reference sequence fragments from the reference sequence according to the potential variation regions;
performing multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result;
determining the variation detection result of the genome according to the multi-sequence comparison result;
after multi-sequence comparison is carried out on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result, the method further comprises the following steps:
determining the variation types of all sequencing sequence fragments according to the multi-sequence comparison result;
according to the variation types of all sequencing sequence fragments, converging all the sequencing sequence fragments into at least one sequencing sequence cluster, wherein the variation types of the sequencing sequence fragments in the same sequencing sequence cluster are the same;
respectively performing union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster;
performing double-sequence comparison on each characteristic sequence and the reference sequence fragment to obtain the variation type of each characteristic sequence;
and correcting the multi-sequence comparison result according to the variation type of each characteristic sequence, wherein the variation type of each sequencing sequence fragment in the corrected multi-sequence comparison result is the same as the variation type of the characteristic sequence corresponding to each sequencing sequence fragment.
2. The method for detecting genomic variation according to claim 1, further comprising, after the merging of all sequenced sequence fragments in each sequenced sequence cluster to obtain the signature sequence of each sequenced sequence cluster:
performing double-sequence comparison on any two characteristic sequences in the characteristic sequences of each sequencing sequence cluster;
judging whether an overlapping region of two characteristic sequences is completely matched, wherein the variation position of at least one characteristic sequence is completely positioned in the overlapping region;
when the overlapping regions of the two characteristic sequences are completely matched and the variation position of at least one characteristic sequence is completely positioned in the overlapping region, merging the sequencing sequence clusters corresponding to the two characteristic sequences to obtain a merged sequencing sequence cluster, and merging the two characteristic sequences to obtain the characteristic sequence of the merged sequencing sequence cluster.
3. The method of detecting genomic variation as claimed in claim 1, wherein determining potential variation regions of the genome from the double sequence alignment comprises:
dividing the genome into a plurality of coding intervals according to the base coding sequence of the genome;
determining the variation types of all sequencing sequences according to the double-sequence comparison result;
sequentially counting the probability distribution values of the sequencing sequences of different variation types in each coding interval;
calculating the information entropy of each coding interval according to the probability distribution value;
sequentially judging whether the information entropy of each coding interval is larger than a first threshold value;
and when the information entropy of one coding interval is larger than the first threshold, judging the coding interval as a potential variation region.
4. The method of detecting genomic variation as claimed in claim 1, wherein determining potential variation regions of the genome from the double sequence alignment comprises:
dividing the genome into a plurality of coding intervals according to the base coding sequence of the genome;
sequentially counting the number of the sequence sequences with variation in each coding interval;
judging whether the number of the sequence sequences with variation in each coding interval is larger than a second threshold value;
and when the number of the sequenced sequences with variation in the coding interval is larger than the second threshold value, judging the coding interval as a potential variation region.
5. The method for detecting genomic variation according to claim 1, wherein the extracting of the sequenced sequence fragments from all sequenced sequences based on the region of potential variation comprises:
and extracting the intersection part of each sequencing sequence and the potential variation region as the sequencing sequence fragment.
6. The method for detecting genomic variation according to claim 1, wherein the extracting of the sequenced sequence fragments from all sequenced sequences based on the region of potential variation comprises:
when each of the sequencing sequences intersects the potential variation region, extracting the sequencing sequence as the sequencing sequence fragment.
7. The method for detecting genomic variations according to claim 1, wherein extracting reference sequence fragments from the reference sequence based on the potential variation regions comprises:
extracting the intersection of the reference sequence and the potential variation region as the reference sequence fragment.
8. The method of claim 1, wherein determining the genomic variation detection result based on the multiple sequence alignment comprises:
determining variation positions in the potential variation region according to the multi-sequence alignment result;
extracting variation information of all sequencing sequence fragments at the variation positions from the multi-sequence comparison result;
converging all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation positions in the same sequencing sequence set is the same;
sequentially judging whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold value;
and when the number of the sequencing sequence fragments in one sequencing sequence set is greater than the third threshold value, determining that the variation information of the sequencing sequence fragments in the sequencing sequence set is the variation detection result of the genome.
9. A genomic variation detection apparatus comprising:
the first double-sequence comparison unit is used for performing double-sequence comparison on a plurality of sequencing sequences of a genome and a reference sequence respectively to obtain a double-sequence comparison result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected of the genome;
a potential variation region determination unit, configured to determine a potential variation region of the genome according to the double-sequence alignment result, where the potential variation region is a base coding region in which a potential variation occurs in the genome;
a sequencing sequence fragment extraction unit for extracting sequencing sequence fragments from all sequencing sequences according to the potential variation region;
a reference sequence fragment extracting unit, configured to extract a reference sequence fragment from the reference sequence according to the potential variation region;
the multi-sequence comparison unit is used for carrying out multi-sequence comparison on the reference sequence fragments and all sequencing sequence fragments to obtain a multi-sequence comparison result;
a variation detection result determining unit, configured to determine a variation detection result of the genome according to the multiple sequence alignment result;
wherein, still include:
a variation type determining unit, configured to determine variation types of all sequencing sequence fragments according to the multiple sequence comparison result;
the sequencing sequence cluster converging unit is used for converging all the sequencing sequence fragments into at least one sequencing sequence cluster according to the variation types of all the sequencing sequence fragments, wherein the variation types of the sequencing sequence fragments in the same sequencing sequence cluster are the same;
the union processing unit is used for respectively carrying out union processing on all sequencing sequence fragments in each sequencing sequence cluster to obtain a characteristic sequence of each sequencing sequence cluster;
a second double-sequence comparison unit, configured to perform double-sequence comparison on each feature sequence and the reference sequence fragment to obtain a variation type of each feature sequence;
and the correcting unit is used for correcting the multi-sequence comparison result according to the variation type of each characteristic sequence, wherein the variation type of each sequencing sequence fragment in the corrected multi-sequence comparison result is the same as the variation type of the characteristic sequence corresponding to each sequencing sequence fragment.
10. The genomic variation detection apparatus of claim 9, further comprising:
a third double-sequence comparison unit, configured to perform double-sequence comparison on any two obtained feature sequences in the feature sequences of each sequencing sequence cluster;
an overlap region judging unit, configured to judge whether there is a complete match between the overlap regions of the two feature sequences, and a variation position of at least one feature sequence is completely within the overlap region;
and a merging unit, configured to, when there is a complete match between overlapping regions of two feature sequences and a variation position of at least one feature sequence is completely within the overlapping region, merge sequencing sequence clusters corresponding to the two feature sequences to obtain a merged sequencing sequence cluster, and merge the two feature sequences to obtain a feature sequence of the merged sequencing sequence cluster.
11. The genomic variation detection apparatus according to claim 9, wherein the potential variation region determining unit comprises:
a first coding region dividing unit for dividing the genome into a plurality of coding regions according to the base coding order of the genome;
a variation type determining subunit, configured to determine variation types of all sequencing sequences according to the double-sequence comparison result;
a probability distribution value statistic subunit, configured to sequentially count probability distribution values of sequencing sequences of different variation types in each coding interval;
an information entropy calculating subunit, configured to calculate an information entropy of each of the coding sections according to the probability distribution value;
a first threshold judgment subunit, configured to sequentially judge whether the information entropy of each coding interval is greater than a first threshold;
a first potential variation region determining subunit, configured to determine that a coding section is a potential variation region when the information entropy of the coding section is greater than the first threshold.
12. The genomic variation detection apparatus according to claim 9, wherein the potential variation region determining unit comprises:
a second coding region dividing unit for dividing the genome into a plurality of coding regions according to the base coding order of the genome;
a variation quantity counting subunit, configured to count, in sequence, the number of the sequencing sequences that have undergone variation in each of the coding intervals;
a second threshold judgment subunit, configured to judge whether the number of the mutated sequencing sequences in each coding interval is greater than a second threshold;
and a second potential variation region determining subunit, configured to determine a coding region as a potential variation region when the number of the sequenced sequences with variation within the coding region is greater than the second threshold.
13. The genomic variation detection apparatus according to claim 9,
the sequencing sequence fragment extracting unit is specifically configured to extract an intersection portion of each sequencing sequence and the potential variation region as the sequencing sequence fragment.
14. The genomic variation detection apparatus according to claim 9,
the reference sequence fragment extracting unit is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence fragment.
15. The genomic mutation detection apparatus as claimed in claim 9, wherein the mutation detection result determining unit comprises:
a variation position determining subunit, configured to determine a variation position in the potential variation region according to the multiple sequence alignment result;
a variation information extraction subunit, configured to extract variation information of all the sequencing sequence fragments at the variation position from the multiple sequence alignment result;
the sequencing sequence set aggregation subunit is used for aggregating all the sequencing sequence fragments into at least one sequencing sequence set according to the variation information, wherein the variation information of the sequencing sequence fragments at the variation position in the same sequencing sequence set is the same;
a third threshold judgment subunit, configured to sequentially judge whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold;
and a variation detection result determining subunit, configured to determine, when the number of the sequencing sequence fragments in one of the sequencing sequence sets is greater than the third threshold, that the variation information of the sequencing sequence fragments in the sequencing sequence set is the variation detection result of the genome.
16. A genomic variation detection terminal, comprising:
a processor;
a memory for storing instructions for execution by the processor;
wherein the processor is configured to perform the method of any one of claims 1-8.
CN201680084673.7A 2016-04-20 2016-04-20 Genome variation detection method, device and terminal Active CN109074429B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/079745 WO2017181368A1 (en) 2016-04-20 2016-04-20 Method, device and terminal for detecting genome variations

Publications (2)

Publication Number Publication Date
CN109074429A CN109074429A (en) 2018-12-21
CN109074429B true CN109074429B (en) 2022-03-29

Family

ID=60116530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680084673.7A Active CN109074429B (en) 2016-04-20 2016-04-20 Genome variation detection method, device and terminal

Country Status (2)

Country Link
CN (1) CN109074429B (en)
WO (1) WO2017181368A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110592208B (en) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN111445950B (en) * 2020-03-19 2022-10-25 西安交通大学 High-fault-tolerance genome complex structure variation detection method based on filtering strategy
CN115910197B (en) * 2021-12-29 2024-03-22 上海智峪生物科技有限公司 Gene sequence processing method, device, storage medium and electronic equipment
CN117789823B (en) * 2024-02-27 2024-06-04 中国人民解放军军事科学院军事医学研究院 Identification method, device, storage medium and equipment of pathogen genome co-evolution mutation cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012034251A2 (en) * 2010-09-14 2012-03-22 深圳华大基因科技有限公司 Methods and systems for detecting genomic structure variations
WO2013109981A1 (en) * 2012-01-20 2013-07-25 Sequenom, Inc. Diagnostic processes that factor experimental conditions
WO2014015319A1 (en) * 2012-07-20 2014-01-23 Verinata Health, Inc. System for determining a copy number variation
CN104160391A (en) * 2011-09-16 2014-11-19 考利达基因组股份有限公司 Determining variants in a genome of a heterogeneous sample

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6750011B1 (en) * 1994-06-17 2004-06-15 Mark W. Perlin Method and system for genotyping
US7910353B2 (en) * 2004-02-13 2011-03-22 Signature Genomic Laboratories Methods and apparatuses for achieving precision genetic diagnoses
CN101914628B (en) * 2010-09-02 2013-01-09 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
CN103617256B (en) * 2013-11-29 2018-01-02 北京诺禾致源科技股份有限公司 The processing method and processing device of file needing mutation detection
CN105349617A (en) * 2014-08-19 2016-02-24 复旦大学 High-throughput RNA sequencing data quality control method and high-throughput RNA sequencing data quality control apparatus
CN105404793B (en) * 2015-12-07 2018-05-11 浙江大学 The method for quickly finding phenotype correlation gene based on probabilistic framework and weight sequencing technologies

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012034251A2 (en) * 2010-09-14 2012-03-22 深圳华大基因科技有限公司 Methods and systems for detecting genomic structure variations
CN104160391A (en) * 2011-09-16 2014-11-19 考利达基因组股份有限公司 Determining variants in a genome of a heterogeneous sample
WO2013109981A1 (en) * 2012-01-20 2013-07-25 Sequenom, Inc. Diagnostic processes that factor experimental conditions
WO2014015319A1 (en) * 2012-07-20 2014-01-23 Verinata Health, Inc. System for determining a copy number variation

Also Published As

Publication number Publication date
CN109074429A (en) 2018-12-21
WO2017181368A1 (en) 2017-10-26

Similar Documents

Publication Publication Date Title
CN109074429B (en) Genome variation detection method, device and terminal
Wick et al. Benchmarking of long-read assemblers for prokaryote whole genome sequencing
Kuo et al. Illuminating the dark side of the human transcriptome with long read transcript sequencing
AU2021201500B2 (en) Haplotype phasing models
Vanderzande et al. High-quality, genome-wide SNP genotypic data for pedigreed germplasm of the diploid outbreeding species apple, peach, and sweet cherry through a common workflow
CN108121897B (en) Genome variation detection method and detection device
Walker et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement
CN108280325B (en) Processing method and processing device for high-throughput sequencing data, storage medium and processor
Bernhardt et al. Genome‐wide sequence information reveals recurrent hybridization among diploid wheat wild relatives
CN112289382B (en) Splitting method and device for polyploid genome homologous chromosome and application thereof
Anderson et al. The gene space in wheat: the complete γ-gliadin gene family from the wheat cultivar Chinese Spring
Moeinzadeh et al. Ranbow: a fast and accurate method for polyploid haplotype reconstruction
US20180232485A1 (en) Method and apparatus for calling single-nucleotide variations and other variations
CN104794371A (en) Method and device for detecting insertion polymorphism of retrotransposon
WO2024130907A1 (en) Base quality score calibration method and apparatus for sequencing platform features, electronic device, and storage medium
CN113782101A (en) Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
KR20240018462A (en) Methods and systems for identifying recombinant variants
CN113205857B (en) Method and device for identifying non-homologous regions of genomic chromosomes
Goltsman et al. Meraculous-2D: Haplotype-sensitive assembly of highly heterozygous genomes
CN111681710A (en) Cell classification method and device based on gene expression characteristics and electronic equipment
CN109491951B (en) Data configuration method and computing equipment
US10937523B2 (en) Methods, systems and computer readable storage media for generating accurate nucleotide sequences
CN113791808A (en) Bottom layer program upgrading method and device, computer equipment and storage medium
KR101600660B1 (en) System and method for processing genome sequnce in consideration of read quality
Heo et al. Comprehensive assessment of error correction methods for high-throughput sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant