WO2017181368A1 - Method, device and terminal for detecting genome variations - Google Patents

Method, device and terminal for detecting genome variations Download PDF

Info

Publication number
WO2017181368A1
WO2017181368A1 PCT/CN2016/079745 CN2016079745W WO2017181368A1 WO 2017181368 A1 WO2017181368 A1 WO 2017181368A1 CN 2016079745 W CN2016079745 W CN 2016079745W WO 2017181368 A1 WO2017181368 A1 WO 2017181368A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
sequencing
variation
sequences
fragments
Prior art date
Application number
PCT/CN2016/079745
Other languages
French (fr)
Chinese (zh)
Inventor
何俊
张旸
张洪波
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2016/079745 priority Critical patent/WO2017181368A1/en
Priority to CN201680084673.7A priority patent/CN109074429B/en
Publication of WO2017181368A1 publication Critical patent/WO2017181368A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass

Definitions

  • the present application relates to the field of bioinformatics technology, and in particular, to a method, device and terminal for detecting genomic variation.
  • genomic variation refers to changes in the base pair composition or order of the genome, including SNP (Single Nucleotide Polymorphism) and indel (short Insertion/Deletion). Or delete).
  • SNP Single Nucleotide Polymorphism
  • indel short Insertion/Deletion
  • delete the cost of genome sequencing continues to decline, the genome sequencing data produced by high-throughput sequencers is exploding, but how to get high-quality genomic variation results from genome sequencing data remains a challenge. work.
  • the traditional genomic variation detection is usually based on the reference sequence of the genome, and the multiple sequencing sequences of the genome are double-sequenced with the reference sequence to obtain the double sequence alignment result of each sequencing sequence and the reference sequence. Including detailed information such as matching, mismatch, insertion, and deletion of the sequencing sequence relative to the reference sequence, and then determining the genome based on the alignment of all the sequencing sequences and the reference sequence. Variation test results.
  • the reference sequence is a base sequence when the genome is not mutated
  • the sequencing sequence is a base sequence of the detected genome.
  • the Applicant found that at least the following problems exist in the prior art: since the traditional genomic variation detection only double-sequences each sequencing sequence with the reference sequence, and compares the results according to the double sequence. Determining the results of genomic variation detection is easy because the alignment of the sequencing sequences is inaccurate, and one type of variation in the sequencing sequence is erroneously compared to different types of mutations, resulting in inaccurate genomic variation detection results.
  • the present application provides a method, device and terminal for detecting genomic variation to solve the problem of inaccurate detection results of genomic variation in the prior art.
  • the embodiment of the present application provides a method for detecting genomic variation, which comprises: performing multiple sequence alignment on a plurality of sequencing sequences of a genome and a reference sequence, respectively, to obtain a double sequence alignment result, wherein
  • the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected by the genome; and the potential variation region of the genome is determined according to the result of the double sequence alignment.
  • the potential variation region is a base coding interval in which a potential mutation occurs in the genome; according to the potential variation region, a sequencing sequence fragment is extracted from all the sequencing sequences; and the reference mutation sequence is extracted according to the potential variation region Deriving a sequence fragment; performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain a multi-sequence alignment result; determining the variation detection result of the genome according to the multi-sequence alignment result.
  • the sequencing sequence fragments with the same mutation type can be clustered and aligned, and the sequencing sequence alignment is more accurate, so as to avoid erroneously comparing one type of variation into different types of mutations, thereby improving the genomic variation detection result. The accuracy.
  • the method further includes: Multi-sequence alignment results, determining the type of variation of all sequencing sequence fragments; concentrating all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster The mutation types are the same; the sequencing sequence is performed on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; each of the characteristic sequences and the reference sequence fragment Performing a double sequence alignment to obtain a variation type of each of the feature sequences; correcting the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein the corrected multiple sequence alignment result a variation type of each of the sequencing sequence fragments and a variation type of the characteristic sequence corresponding to each of the sequencing sequence fragments .
  • the feature sequence is first corrected by double sequence alignment of the characteristic sequence and the reference sequence segment; then the sequence sequence corresponding to the feature sequence is corrected according to the corrected feature sequence, and the multiple sequence alignment is overcome.
  • the partial sequencing sequence fragment is offset from the reference sequence fragment, and the accuracy of the genomic variation detection result is improved.
  • each of the sequencing sequence segments in each of the sequencing sequence clusters are separately processed for each other to obtain each After the sequence of the sequenced sequence clusters, the method further comprises: performing double sequence alignment on any two of the feature sequences of each of the obtained sequence clusters; determining whether there is an overlap of the overlapping regions of the two feature sequences.
  • the mutation position of at least one of the feature sequences is completely within the overlapping region; when there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region,
  • the sequencing sequence clusters corresponding to the two characteristic sequences are combined to obtain a cluster of the sequenced sequences, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.
  • determining a potential variation region of the genome according to the double sequence alignment result including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; according to the double sequence alignment result, the variation types of all the sequencing sequences are determined; and the probability distribution values of the sequencing sequences of different mutation types in each of the coding intervals are sequentially counted; according to the probability a distribution value, calculating an information entropy of each of the coding intervals; determining whether an information entropy of each of the coding intervals is greater than a first threshold; and determining an information entropy of the coding interval when the information entropy of the coding interval is greater than the first threshold
  • the coding interval is a potential variation region. Using this implementation, the potential variation region of the genome is determined by information entropy.
  • determining a potential variation region of the genome according to the double sequence alignment result including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; the number of sequencing sequences in each of the coding intervals is counted in turn; and the number of sequencing sequences in each of the coding intervals is determined to be greater than a second threshold; When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region.
  • the potential variability region of the genome is determined by the number of sequencing sequences that mutate within the coding interval.
  • the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including: extracting each of the sequencing sequences and the potential variation The intersection of the regions serves as the fragment of the sequencing sequence.
  • the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including: when each of the sequencing sequences and the potential variation When there is an intersection of the regions, the sequencing sequence is extracted as the fragment of the sequencing sequence.
  • the extracting the reference sequence segment in the reference sequence according to the potential variation region includes: extracting the reference sequence and the potential variation region The intersection portion is used as the reference sequence fragment.
  • determining the mutation detection result of the genome according to the multiple sequence alignment result including: determining, according to the multiple sequence alignment result, the determining a variation position in the potential variation region; extracting variation information of all the sequencing sequence fragments at the mutation position in the multiple sequence alignment result; and concentrating all the sequencing sequence fragments according to the mutation information At least one set of sequencing sequences, wherein the sequencing information of the sequencing sequence fragments in the same sequencing sequence set is the same at the mutation position; sequentially determining whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; Determining that the number of sequencing sequence fragments in the set of sequencing sequences is greater than the third threshold
  • the variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.
  • the embodiment of the present application further provides a genomic variation detecting apparatus, which comprises: a first double sequence aligning unit, configured to perform multiple sequence alignment of a plurality of sequencing sequences of the genome and a reference sequence respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, the sequencing sequence is a base sequence to be detected by the genome; a potential mutation region determining unit is used according to the Determining, by the double sequence alignment result, a potential variation region of the genome, the potential variation region being a base coding interval in which a potential mutation occurs in the genome; and a sequencing sequence fragment extraction unit for using the potential variation region, A sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment extraction unit is configured to extract a reference sequence fragment from the reference sequence according to the potential variation region; a multi-sequence alignment unit for the reference Multiple sequence alignment of sequence fragments and all sequencing sequence fragments to obtain multiple sequence alignment results; mutation detection If the determination unit, according to the
  • the apparatus further includes: a mutation type determining unit, configured to determine, according to the multiple sequence alignment result, a variation type of all the sequence segments; the sequencing sequence a clustering unit for concentrating all of the sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type; the union processing unit For separately combining all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; a second double sequence alignment unit for each of the described Performing a double sequence alignment with the reference sequence segment to obtain a variation type of each of the feature sequences; and a correction unit, configured to correct the multiple sequence alignment result according to the mutation type of each of the feature sequences Wherein the corrected multi-sequence alignment results in a variation type of each of the sequencing sequence fragments and each of the sequencing sequence fragments Variation of the same feature type corresponding to
  • the apparatus further includes: a third dual sequence aligning unit, configured to use any two of the obtained feature sequences of each of the sequencing sequence clusters The double-sequence alignment is performed on the feature sequences; the overlap region determining unit is configured to determine whether there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the mutation position of at least one of the feature sequences is completely within the overlapping region; When the overlapping regions of the two feature sequences are completely matched, and the mutation positions of at least one of the feature sequences are completely within the overlapping region, the sequencing sequence clusters corresponding to the two feature sequences are merged to obtain a combined The cluster of the sequence is sequenced, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.
  • a third dual sequence aligning unit configured to use any two of the obtained feature sequences of each of the sequencing sequence clusters The double-sequence alignment is performed on the feature sequences; the overlap region determining unit is configured to determine whether there is an
  • the potential variation region determining unit includes: a first coding interval dividing subunit, configured to: according to a base coding order of the genome Base The group is divided into a plurality of coding intervals; the mutation type determining subunit is configured to determine a variation type of all the sequencing sequences according to the double sequence alignment result; a probability distribution value statistical subunit, for sequentially counting each of the codes a probability distribution value of a sequence of different mutation types in the interval; an information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value; and a first threshold value determining subunit for sequentially determining Whether the information entropy of each of the coding intervals is greater than a first threshold; the first latent variation region determining subunit, configured to determine that the coding interval is a potential variation when an information entropy of one of the coding intervals is greater than the first threshold region.
  • the potential variation region determining unit includes: a second coding interval dividing subunit, configured to: according to a base coding order of the genome The genome is divided into a plurality of coding intervals; a variance quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals; and a second threshold determining subunit for determining each of the coding intervals Whether the number of sequencing sequences in which the mutation occurs is greater than a second threshold; and the second latent variation region determining subunit is configured to determine the encoding interval when the number of sequencing sequences in which the mutation occurs within the encoding interval is greater than the second threshold For potential variation areas.
  • the sequencing sequence segment extracting unit is specifically configured to extract an intersection of each of the sequencing sequence and the potential variation region as the sequencing sequence Fragment.
  • the sequencing sequence segment extracting unit is specifically configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection And extracting the sequencing sequence as the fragment of the sequencing sequence.
  • the reference sequence segment extracting unit is specifically configured to extract an intersection of the reference sequence and the potential variation region as the reference sequence segment.
  • the mutation detection result determining unit includes: a mutation position determining subunit, configured to determine the potential variation according to the multiple sequence alignment result a mutation position in the region; a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position; and a sequencing sequence collection convergence subunit, And merging all the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the mutation sequence fragments in the same sequencing sequence set have the same variation information at the mutation position; the third threshold determination subunit is used And determining, in sequence, whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; the mutation detection result determining subunit, configured to: when the number of sequencing sequence fragments in one of the sequencing sequence sets is greater than the third At the threshold, determining the variation information of the sequence fragment in the sequence of the sequencing sequence is the gene The variation test results of the group.
  • the embodiment of the present application further provides a genomic mutation detecting terminal, the terminal comprising: a processor; a memory for storing execution instructions of the processor; wherein the processor is configured to perform the step of: performing a genome
  • the plurality of sequencing sequences are respectively subjected to double sequence alignment with the reference sequence to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is the genome sequence Detecting a base sequence; determining, according to the double sequence alignment result, a potential variation region of the genome, wherein the potential variation region is a base coding interval in which a potential mutation occurs in the genome; and according to the potential variation region, A sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment is extracted from the reference sequence according to the potential variation region; and the reference sequence fragment and all the sequenced fragments are subjected to multiple sequence alignment to obtain a plurality of sequences Aligning the results; determining the variation detection result of the genome according to the multi-
  • the method further includes: Multi-sequence alignment results, determining the type of variation of all sequencing sequence fragments; concentrating all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster The mutation types are the same; the sequencing sequence is performed on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; each of the characteristic sequences and the reference sequence fragment Performing a double sequence alignment to obtain a variation type of each of the feature sequences; correcting the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein the corrected multiple sequence alignment result a variation type of each of the sequencing sequence fragments and a variation type of the characteristic sequence corresponding to each of the sequencing sequence fragments .
  • the sequence of each of the sequencing sequence clusters is obtained by performing a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters.
  • the method further includes: performing double sequence alignment on any two of the feature sequences of each of the obtained sequencing sequence clusters; determining whether there is an exact matching of overlapping regions of the two feature sequences, and at least one of the feature sequences The position of the mutation is completely within the overlapping region; when there is an exact overlap of the overlapping regions of the two feature sequences, and the variation position of at least one of the feature sequences is completely within the overlapping region, the two feature sequences are corresponding
  • the sequenced sequence clusters are combined to obtain a cluster of the sequenced sequences, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.
  • determining a potential variation region of the genome according to the double sequence alignment result including: according to a base coding order of the genome, Base The group is divided into a plurality of coding intervals; according to the double sequence alignment result, the variation types of all the sequencing sequences are determined; and the probability distribution values of the sequencing sequences of different mutation types in each of the coding intervals are sequentially counted; according to the probability a distribution value, calculating an information entropy of each of the coding intervals; determining whether an information entropy of each of the coding intervals is greater than a first threshold; and determining an information entropy of the coding interval when the information entropy of the coding interval is greater than the first threshold
  • the coding interval is a potential variation region.
  • determining a potential variation region of the genome according to the double sequence alignment result including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; the number of sequencing sequences in each of the coding intervals is counted in turn; and the number of sequencing sequences in each of the coding intervals is determined to be greater than a second threshold; When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region.
  • the sequencing sequence fragment is extracted from all the sequencing sequences according to the potential variation region, comprising: extracting each of the sequencing sequences and the potential variation The intersection of the regions serves as the fragment of the sequencing sequence.
  • the sequencing sequence fragment is extracted from all the sequencing sequences according to the potential variation region, including: when each of the sequencing sequences and the potential variation When there is an intersection of the regions, the sequencing sequence is extracted as the fragment of the sequencing sequence.
  • the extracting the reference sequence segment in the reference sequence according to the potential variation region includes: extracting the reference sequence and the potential variation region The intersection portion is used as the reference sequence fragment.
  • determining the mutation detection result of the genome according to the multiple sequence alignment result including: determining, according to the multiple sequence alignment result, the determining a variation position in the potential variation region; extracting variation information of all the sequencing sequence fragments at the mutation position in the multiple sequence alignment result; and concentrating all the sequencing sequence fragments according to the mutation information At least one set of sequencing sequences, wherein the sequencing information of the sequencing sequence fragments in the same sequencing sequence set is the same at the mutation position; sequentially determining whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; When the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining that the variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.
  • the embodiment of the present application further provides a storage medium, where the storage medium may store a program, and the program may include some or all of the steps in each embodiment of the genomic variation detection method provided by the application.
  • the genomic variation detection method, device and terminal provided by the embodiments of the present application are used to perform multiple sequence alignment on the reference sequence fragment and all the sequenced sequence fragments to obtain multiple sequence alignment results; and the genomic variation is determined according to the multiple sequence alignment results. Test results. Since multiple sequence alignments tend to preferentially align sequences with higher similarity, the sequence fragment and all sequenced fragments are put together for multiple sequence alignment, and sequence fragments with the same variation type can be sequenced. When aligned together, the alignment of the sequenced fragments is more accurate, avoiding erroneously comparing one type of variation into different types of variations, thereby improving the accuracy of the genomic variation detection results.
  • 1A is a schematic diagram of a dual sequence alignment state of a sequencing sequence and a reference sequence according to an embodiment of the present application
  • FIG. 1B is a schematic diagram showing an alignment state after the alignment sequence of FIG. 1A is aligned and corrected according to an embodiment of the present application
  • FIG. 2 is a schematic flow chart of a method for detecting genomic variation according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of coding interval division of a genome according to an embodiment of the present application.
  • 4A is a schematic diagram of a process of extracting a sequence segment and a reference sequence segment according to an embodiment of the present application
  • 4B is a schematic diagram of a process of extracting another sequencing sequence segment and a reference sequence segment according to an embodiment of the present application
  • 5A is a schematic diagram showing a multi-sequence alignment state of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application
  • 5B is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application;
  • FIG. 6 is a schematic flow chart of another method for detecting genomic variation according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a convergence result of a cluster of sequencing sequences according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a process of a union process according to an embodiment of the present application.
  • 9A is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 7 according to an embodiment of the present application;
  • FIG. 9B is a schematic diagram of a double sequence alignment result obtained by performing double sequence alignment on the feature sequence of FIG. 9A and the reference sequence segment according to an embodiment of the present application;
  • FIG. 9C is a schematic diagram of the corrected multi-sequence alignment result obtained by correcting the multi-sequence alignment result according to the variation type of the feature sequence in FIG. 9B according to the embodiment of the present application;
  • FIG. 10A is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application.
  • FIG. 10B is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 10A according to an embodiment of the present application;
  • FIG. 11 is a schematic flow chart of another method for detecting genomic variation according to an embodiment of the present application.
  • FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application;
  • FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application;
  • FIG. 12B is a schematic diagram of a merge process of merging the sequence clusters in FIG. 10A according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a first genomic variation detecting apparatus according to an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a second genomic variation detecting apparatus according to an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a third genomic variation detecting apparatus according to an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a genomic mutation detecting terminal according to an embodiment of the present application.
  • FIG. 1A is a schematic diagram of a dual sequence alignment of a sequencing sequence and a reference sequence according to an embodiment of the present application.
  • FIG. 1B it is a schematic diagram of the alignment state after the alignment sequence of FIG. 1A is aligned and corrected according to an embodiment of the present application.
  • the two identical base sequences represent the reference sequence
  • the dotted line represents the sequencing sequence
  • the mismatch and deletion of the sequencing sequence relative to the reference sequence respectively use the base letters in the sequencing sequence. And dots are indicated.
  • FIG. 1A a part of the sequencing sequence has both G->A (mismatch from base G to base A) and A->G (from base A to base G).
  • the mismatch the deletion of TTTG in the partial sequencing sequence (deletion of the base segment TTTG); and in Figure 1B, after sequencing the sequencing sequence, those sequencing sequences with both G->A and A->G are present. The sequence was confirmed to be deleted in the presence of TTTG.
  • the genomic variation detection method, apparatus and terminal put together the reference sequence fragment and all the sequenced sequence fragments to perform multiple sequence alignment, since the multiple sequence alignment tends to Sequences with higher similarity are preferentially aligned and aligned. Therefore, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment, and the sequence fragments of the same variation type can be aligned and aligned, and the sequence fragment is sequenced. Alignment is more accurate, avoiding erroneously comparing one type of variation to different types of variation, and improving the accuracy of genomic variation detection results.
  • Step 201 Double-sequence alignment of multiple sequencing sequences of the genome and the reference sequence to obtain a double sequence alignment result.
  • the reference sequence is a base sequence when the genome does not mutate, which represents the correct arrangement order of the bases in the genome, and the sequencing sequence is the base sequence to be detected by the genome, and therefore, the reference sequence can be
  • the benchmark judges the variation of the sequencing sequence.
  • the sequencing sequence is consistent with the base sequence of the reference sequence, it indicates that the sequencing sequence does not mutate; when the sequence of the sequence of the sequencing sequence and the reference sequence are inconsistent, the sequencing sequence is mutated.
  • the variation of the sequencing sequence mainly includes base mismatch, insertion and deletion.
  • the sequencing sequence is a short sequence fragment.
  • the more the number of sequencing sequences the more raw data obtained during the detection of genomic mutation, and the available data when statistically analyzing the results of genomic variation detection in subsequent steps.
  • the more the genomic variation test results the more accurate.
  • each sequencing sequence can be positioned to the corresponding position of the reference sequence, and detailed information about the variation of each sequencing sequence relative to the reference sequence, including matching and mismatching, can be obtained. , insert or delete information.
  • the genome is first divided into multiple coding intervals according to the coding sequence of the genome, and then each coding interval is sequentially determined as a potential variation region, and the initial screening of the genomic variation position is realized, thereby improving the detection efficiency.
  • the genome is divided into continuous, equal-length coding intervals, and according to the order of the coding intervals, each coding interval is sequentially determined to be a potential variation region until the entire genome is traversed, and the detection region is avoided. Missing.
  • the length of the coding interval may be adjusted according to actual needs. For example, any length within a range of 50-300 bp (bp represents a base pair) may be selected, which is not limited in this application.
  • FIG. 3 is a schematic diagram of a coding interval division of a genome according to an embodiment of the present application. Since the reference sequence is a base sequence in which the genome does not undergo mutation, the coding interval of the base pair in the reference sequence is the coding interval of the genome. Therefore, the scheme can be explained by the coding interval of the reference sequence representing the coding interval of the genome. As shown in FIG. 3, along the coding sequence of the genome, the genome is divided into coding intervals of length 100 bp, and the first coding interval (1510531, 1510630), the second coding interval (1510631, 1510730), and the third coding interval are sequentially formed. (1510731, 1510830), the fourth coding interval (1510831, 1510930), and the like.
  • each coding interval is a potential variation region, and the potential variation region of the genome is selected in all coding intervals. It should be noted that, in a genome, the number of potentially mutated regions may be one or more than one, and this application does not limit this.
  • the information entropy can reflect the degree of confounding of the sequence, the larger the information entropy, the more chaotic the sequence is, and the more likely the sequencing sequence is to be mutated. Therefore, in a possible implementation manner of the present application, information entropy can be determined.
  • Potentially mutated regions for example, the greater the number of sequencing sequences that mutate within the coding interval, the greater the likelihood that the coding interval is a potentially mutated region, and thus, in another possible implementation of the present application, The number of sequencing sequences that mutate within the coding interval determines the potential variation region.
  • a method for determining a potential variation region by information entropy is specifically:
  • the type of variation of all sequencing sequences is determined based on the results of the double sequence alignment. Due to the double sequence alignment of the sequencing sequence and the reference sequence, the results include detailed matching, mismatching, and insertion of the sequencing sequence relative to the reference sequence. And delete information, therefore, the type of variation of the sequencing sequence can be directly determined based on the double sequence alignment result.
  • a sequencing sequence of the same variation type refers to a sequencing sequence having identical information of the same variation relative to a reference sequence, wherein a sequencing sequence in which no variation occurs is also a type of variation.
  • the probability distribution values of the sequencing sequences of different mutation types are counted according to the variation type information of the sequencing sequence. Specifically, according to the variation type of the sequencing sequence, the ratio of the number of sequencing sequences and the total number of sequencing sequences in each variation type in the coding interval is sequentially calculated, and the probability distribution values of the sequencing sequences of different mutation types are obtained, and are recorded as p i .
  • the first mutation type and the second mutation type respectively counting the number of sequencing sequences corresponding to the first mutation type and the second mutation type, and corresponding to the first variation type
  • the number of sequencing sequences is divided by the total number of sequencing sequences to obtain a probability value p 1 of the first mutation type
  • the number of sequencing sequences corresponding to the second mutation type is divided by the total number of sequencing sequences to obtain a probability value p 2 of the second mutation type.
  • p 1 and p 2 are the probability distribution values of the sequencing sequences of different mutation types within the coding interval.
  • a method for determining a potential variation region by the number of sequencing sequences that vary within the coding interval is specifically:
  • sequencing sequences in the coding interval that are mutated are used as sequencing sequences with mutations, including sequencing sequences with mismatches, insertions or deletions.
  • the coding interval is determined to be a potential variation region.
  • the second threshold is set to 50, when the number of sequencing sequences that are mutated in the coding interval is greater than 50, the coding interval is determined to be a potential variation region; otherwise, There is no potentially mutated region between the coding regions.
  • a person skilled in the art can adjust the size of the second threshold according to actual needs, which is not limited in this application.
  • Step 203 Extract the sequencing sequence fragments from all the sequencing sequences according to the potential variation region.
  • sequencing sequence fragments in the potential variation region need to be extracted from the sequencing sequence for analysis and processing in subsequent steps.
  • the coding region corresponding to the intersection of the sequence of the sequence and the reference sequence is used as the sequence of the sequence. Coding interval.
  • an intersection of each of the sequencing sequences and the potential variation region is extracted as a sequencing sequence fragment.
  • the sequencing sequence is used as a sequencing sequence fragment; when the coding interval of the potential variation region intersects with the existence portion of the coding interval of the sequencing sequence, the extraction site The intersection of the sequencing sequence and the potential variation region is used as a sequencing sequence fragment; when the coding interval of the potential variation region and the coding interval of the sequencing sequence are not present, the sequencing sequence is discarded.
  • FIG. 4A is a schematic diagram of a process of extracting a sequence segment and a reference sequence segment according to an embodiment of the present application.
  • three different types of sequencing sequences are taken as an example to illustrate an extraction process of a sequence segment. Description.
  • the coding interval of the potential variation region is (1510531, 1510630)
  • the coding interval of the first sequencing sequence is (1510541, 1510590)
  • the coding interval of the second sequencing sequence is (1510521, 1510570
  • the coding interval of the third sequencing sequence for (1510651, 15106700).
  • the coding interval (1510541, 1510590) is completely within the coding interval of the potential variant region (1510531, 1510630), and the first sequencing sequence is extracted as the sequencing sequence fragment; for the second sequencing sequence, the coding interval (for the second sequencing sequence) 1510521, 1510570) There is a partial intersection with the coding interval (1510531, 1510630) of the potential variation region, and the coding interval of the intersection portion is (1510531, 1510570), and the coding interval is extracted in the second sequencing sequence (1510531, 1510570).
  • the sequence fragment is the portion of the first sequencing sequence and the second sequencing sequence encoding portion (1510531, 1510570).
  • the sequencing sequence when there is a partial intersection between the coding interval of the sequencing sequence and the coding interval of the potential variation region, the sequencing sequence is interrupted during the extraction process of the sequencing sequence segment, and the intersection of the sequencing sequence and the potential variation region is extracted. Partially as a sequence fragment. Interruption of the sequencing sequence will result in loss of integrity of the sequencing sequence, thereby losing part of the information of the sequencing sequence, thereby affecting the accuracy of the genomic variation detection result.
  • the equivalent is that when the coding interval of the potential variation region and the coding interval of the sequencing sequence partially overlap, the potential variation region is extended based on the coding interval of the sequencing sequence, so as to avoid the sequencing sequence during the extraction process of the sequencing sequence fragment. Broken to ensure the integrity of the sequencing sequence.
  • FIG. 4B is a schematic diagram of a process of extracting another sequence segment and a reference sequence segment provided by an embodiment of the present application.
  • the process of extracting a sequence segment in FIG. 4B is substantially similar to that of FIG. 4A, and the difference is that
  • the second sequencing sequence because of the partial intersection of the coding interval and the coding interval of the potential variation region, the union of the coding interval of the second sequencing sequence (1510521, 1510570) and the coding interval of the potential variation region (1510531, 1510630) (1510521) , 1510630) as a coding interval of the expanded potential variation region, and then extracting the sequence fragment in the potential variation region of the second sequencing sequence (the potential variation region at this time has been updated to the expanded potential variation region).
  • the entire second sequencing sequence is extracted as a sequencing sequence fragment. That is to say, in the present implementation, if there is an intersection between the sequencing sequence and the potential variation region, the entire sequencing sequence is extracted as a sequencing sequence fragment.
  • the foregoing expansion manner of the potential variation region is only a specific implementation manner shown in the embodiment of the present application, and those skilled in the art may perform corresponding adjustments according to actual needs, which should all fall into the present application.
  • the coding interval of the potential variation region and the coding interval of all the sequencing sequences in the potential variation region can be combined and processed (the sequence of the potential variation region includes the potential variation region). Partially intersected sequencing sequences and sequencing sequences that fall entirely within the potential variation region, and the result of the union processing is used as the coding interval of the expanded potential variation region.
  • the sequencing sequence is not interrupted during the extraction process of the sequencing sequence fragment, the integrity of the sequencing sequence can be ensured, thereby improving the accuracy of the genomic variation detection.
  • Step 204 Extract a reference sequence segment from the reference sequence according to the potential variation region.
  • the reference sequence segment is extracted in the reference sequence based on the coding interval of the potential variation region.
  • the coding interval of the potential variation region is (1510531, 1510630), and the coding interval (1510531, 1510630) portion is extracted in the reference sequence as the reference sequence segment; in FIG. 4B, the expanded potential is shown in FIG. 4B.
  • the coding interval of the mutation region is (1510521, 1510630), and the coding interval (1510521, 1510630) portion is extracted in the reference sequence as the reference sequence segment.
  • Step 205 performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results.
  • the reference sequence fragment and all the sequenced sequence fragments are put together for multiple sequence alignment, and the process of multiple sequence alignment includes:
  • Establish a distance matrix separately calculate the distance between the two sequences (including the distance between the reference sequence fragment and any one of the sequencing sequence fragments, the distance between any two sequencing sequence fragments), and establish a distance matrix between the two sequences;
  • Construct a clustering tree firstly gather the two closest distances in the distance matrix, then update the distance matrix, and gather the two closest sequences or two types of sequences in the updated distance matrix, and so on. Until all the sequences are brought together to obtain a clustering tree of reference sequence fragments and sequencing sequence fragments;
  • Aligning the sequences According to the clustering hierarchy of the sequencing sequence and the reference sequence in the clustering tree, the two innermost sequences are first aligned, and then all the sequencing sequence fragments and the reference sequence fragments are aligned.
  • the distance between the sequences represents the similarity between the sequences, the smaller the distance, the higher the similarity
  • the higher the similarity sequence is preferentially concentrated.
  • the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment.
  • FIG. 5A is a schematic diagram showing a multi-sequence alignment state of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application.
  • FIG. 5A there are three different mutation types in the sequencing sequence segment, and all sequencing sequence fragments and After the first reference sequence fragments are put together for multiple sequence alignment, the sequencing sequence fragments of the three different mutation types are respectively aligned and aligned.
  • the types of mutations of the sequenced fragments are usually the same in the same haplotype of diploid or polyploid
  • the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment, and may also belong to Sequencing fragments of the same haplotype are brought together to detect genomic variation of diploid or polyploid.
  • Step 206 Determine a mutation detection result of the genome according to the multi-sequence alignment result.
  • the variability detection of the genome can be determined according to the multi-sequence alignment result. result.
  • the present application first, according to the multi-sequence alignment result, determining a mutation position in a potential variation region; and then extracting, in the multiple sequence alignment result, all the sequencing sequence fragments at the mutation position Mutating information; arranging all of the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the variation information of the sequencing sequence fragments in the same sequencing sequence set at the mutation position is the same; Whether the number of sequencing sequence fragments in the set of sequencing sequences is greater than a third threshold; determining the number of sequencing sequence fragments in the sequencing sequence set when the number of sequencing sequence fragments in the set of sequencing sequences is greater than the third threshold
  • the mutation information is the result of the detection of the genomic variation.
  • the mutation position in the potential variation region is determined as 1510581; the variation information of all the sequencing sequence fragments in the coding 1510581 is extracted, and there are three kinds, respectively: non-existent Variant, there is a base segment CCT insertion, and there is a base segment CCT deletion; according to the mutation information, all the sequence fragments are aggregated into three sequencing sequence sets, which are respectively a first sequencing sequence set (variation information is no variation, The number of sequencing sequence fragments is 11), the second sequencing sequence set (variation information is the presence of the base segment CCT insertion, the number of sequencing sequence fragments is 7) and the third sequencing sequence set (the variation information is the presence of the base segment CCT) Deletion, the number of sequencing sequence fragments is 8); it is sequentially determined whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold.
  • the third threshold is 6, the number of sequencing sequence fragments in the above three sequencing sequence sets is greater than the third threshold, so that the mutation detection result of the genome at code 1510581 is: no mutation; base segment CCT insertion; alkali The base segment CCT is deleted. It can also be shown that the mutation results of the three haplotypes of triploid at 1510581 are: no mutation; base segment CCT insertion; base segment CCT deletion.
  • the third threshold is 10
  • only the number of sequencing sequence fragments in the first sequencing sequence set is greater than the third threshold in the above three sequencing sequence sets, so that the mutation detection result of the genome at code 1510581 is: there is no variation.
  • 5B is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment provided by an embodiment of the present application. As shown in FIG. 5B, multiple sequence alignments of the first reference sequence fragment and the sequencing sequence fragment are shown in FIG. 5B.
  • FIG. 5B multiple sequence alignments of the first reference sequence fragment and the sequencing sequence fragment are shown in FIG. 5B.
  • the sequencing sequence fragments having the same variation type have been clustered together, some of the sequencing sequence fragments having the same variation type have an overall offset with respect to the reference sequence fragment.
  • the deviation of the sequenced fragment from the reference sequence fragment results in a change in the type of variation of the sequenced fragment relative to the reference sequence fragment, which in turn affects the accuracy of the detection of the genomic variation. Therefore, it is necessary to correct the variation type of the sequencing sequence fragment relative to the reference sequence fragment after performing multiple sequence alignment between the sequencing sequence fragment and the reference sequence fragment.
  • FIG. 6 is a schematic flowchart of another method for detecting a genomic variation according to an embodiment of the present application.
  • the method may further include the following steps after the step 205 on the basis of the embodiment shown in FIG. 2:
  • Step 601 Determine a variation type of all the sequence fragments according to the multiple sequence alignment result.
  • the sequencing sequence fragments having the same variation type in the sequencing sequence fragments can be put together and aligned, and all can be obtained.
  • the type of variation of the sequenced fragment relative to the reference sequence fragment is shown in Figures 5A and 5B. Since in FIG. 5B, the partial sequencing sequence fragment is totally offset from the reference sequence fragment, if the multi-sequence alignment result shown in FIG. 5A is to be obtained, the variation of the sequence fragment which is shifted in FIG. 5B is required. Type is corrected.
  • Step 602 Concentrate all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments.
  • all the sequencing sequence fragments are classified according to the variation type of the sequencing sequence fragment, and the sequencing sequence fragments having the same variation type are aggregated into the same sequencing sequence cluster, so as to facilitate the variation type of the sequencing sequence fragment. Correction.
  • FIG. 7 is a schematic diagram of a clustering result cluster of clusters according to an embodiment of the present application, which aggregates all sequenced fragments in the multi-sequence alignment result shown in FIG. 5B into three according to the variation type of the sequencing sequence fragment.
  • Sequencing sequence clusters wherein, the sequencing sequence fragment in the first sequencing sequence cluster has no variation; the sequencing sequence fragment in the second sequencing sequence cluster has the insertion of the base segment CCT; and the sequencing sequence fragment in the third sequencing sequence cluster has the base segment CGCCAG Deletion and mismatch of a base sequence.
  • Step 603 Perform a union process on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters.
  • FIG. 8 is a schematic diagram of a process of a union process according to an embodiment of the present application.
  • FIG. 8 includes two sequencing sequence segments, wherein a coding interval of the first sequencing sequence segment is (1, 15), and the second sequencing is performed.
  • the coding interval of the sequence fragment is (4,18).
  • the same base sequence TCCCCTCCTCCT is included in the overlapping coding interval (4, 15) of the two sequencing sequence fragments, and the overlapping coding intervals of the two sequencing sequence fragments are combined and sequenced.
  • the uncombined parts of the sequence fragments are used as the head and tail of the feature sequence, respectively, and the feature sequence GACTCCCCTCCTCCTCCT with the coding interval (1, 18) is obtained.
  • FIG. 9A is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 7 according to an embodiment of the present application, which respectively adopts a first sequencing sequence cluster, a second sequencing sequence cluster, and a third sequencing sequence in FIG. 7 . All the sequenced fragments in the cluster are subjected to a union process to obtain a first feature sequence, a second feature sequence and a third feature sequence corresponding thereto.
  • Step 604 Perform double sequence alignment on each of the feature sequences and the reference sequence segment to obtain a variation type of each of the feature sequences.
  • the variation type of the feature sequence obtained by performing the double sequence alignment between the feature sequence and the reference sequence is the variation of the feature sequence under the optimal alignment result. Types of. Based on this, in the subsequent step, the sequence segment corresponding to the feature sequence can be corrected according to the variation type of the feature sequence.
  • the feature sequence obtained by performing the union process on the reference sequence segment will also have the same offset, and the offset feature sequence and reference will be present.
  • the sequence fragments are subjected to double sequence alignment, and the feature sequences can be corrected. That is to say, if the sequence of the sequencing sequence is shifted, the sequence of the sequence of the sequence sequence is compared with the sequence of the reference sequence, and the variation type of the sequence is changed; if the sequence of the sequence is not offset Then, after the sequence sequence corresponding to the sequence fragment of the sequence is compared with the reference sequence fragment, the variation type of the feature sequence is unchanged. Therefore, in the embodiment of the present application, whether the sequence of the sequence corresponding to the feature sequence needs to be corrected may be determined according to the mutation type of the two-sequence alignment.
  • FIG. 9B is a schematic diagram of double sequence alignment of the feature sequence of FIG. 9A and the reference sequence segment according to an embodiment of the present application, which respectively obtain the first feature sequence and the second sequence shown in FIG. 9A.
  • Special The sequence sequence and the third feature sequence are subjected to double sequence alignment with the reference sequence fragment, and the obtained alignment result is shown in Fig. 9B.
  • FIG. 9A and FIG. 9B after the dual sequence alignment of the feature sequence and the reference sequence segment, the mutation type of the first feature sequence and the second feature sequence does not change, and the variation type of the third feature sequence changes.
  • sequence of the sequence corresponding to the first feature sequence and the second feature sequence has achieved the best alignment effect after multiple sequence alignment, and no correction is needed; the sequence sequence segment corresponding to the third feature sequence An overall offset has occurred relative to the reference sequence segment and further correction is required.
  • Step 605 Correct the multi-sequence alignment result according to the variation type of each of the feature sequences.
  • the variation type of the sequence segment corresponding to the feature sequence is corrected based on the variation type of the feature sequence, that is, the result of the multiple sequence alignment is corrected. Specifically, when the variation type of the characteristic sequence is different from the variation type of the corresponding sequencing sequence fragment, the variation type of the sequencing sequence fragment is adjusted to the variation type of the characteristic sequence, so that the sequence of the corrected multiple sequence alignment result is sequenced.
  • the variation type of the fragment is the same as the variation type of the characteristic sequence corresponding to the fragment of the sequencing sequence.
  • the variation type of the third feature sequence is changed, resulting in the mutation type of the third feature sequence and the sequencing of the third sequencing sequence cluster.
  • the variation types of the sequence fragments are different. Therefore, it is necessary to adjust the variation type of the sequenced sequence fragments of the third sequencing sequence cluster according to the variation type of the third characteristic sequence.
  • FIG. 9C is a schematic diagram of the corrected multi-sequence alignment result obtained by correcting the multi-sequence alignment result according to the variation type of the characteristic sequence in FIG. 9B according to the embodiment of the present application, wherein the third sequencing sequence cluster is sequenced.
  • the variation type of the sequence fragment is adjusted to the variation type of the third feature sequence.
  • the feature sequence is first corrected by the double sequence alignment of the characteristic sequence and the reference sequence segment; and then the sequence corresponding to the feature sequence is corrected according to the corrected feature sequence.
  • the fragment was corrected to overcome the problem of partial sequencing sequence fragment deviation from the reference sequence fragment in the multi-sequence alignment result, and the accuracy of the genomic variation detection result was improved.
  • the greater the difference in length between the two sequences the greater the possibility of multiple alignment results, that is, the greater the probability that the double sequence alignment will be wrong. That is, in the above step 604, when the feature sequence is compared with the reference sequence segment by double sequence, the longer the feature sequence is, the higher the accuracy of the double sequence alignment result of the feature sequence and the reference sequence segment is.
  • FIG. 10A is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment provided by an embodiment of the present application.
  • the sequencing sequence fragment is merged.
  • the clusters are clustered into three sequencing sequences, which are a fourth sequencing sequence cluster, a fifth sequencing sequence cluster and a sixth sequencing sequence cluster, respectively.
  • FIG. 10B is a schematic diagram of a feature sequence obtained by performing the union processing of the sequence clusters in FIG. 10A in the embodiment of the present application, respectively, in the fourth sequencing sequence cluster, the fifth sequencing sequence cluster, and the sixth sequencing sequence cluster. All the sequenced fragments are processed in a union, and the fourth, fifth and sixth characteristic sequences corresponding thereto are obtained.
  • FIG. 11 is a schematic flowchart of another method for detecting a genomic variation according to an embodiment of the present application.
  • the method may further include the following steps after the step 603, based on the embodiment shown in FIG. 6 :
  • Step 1101 Perform double sequence alignment on any two of the characteristic sequences of each of the obtained sequencing sequence clusters.
  • any two of the feature sequences of each of the obtained sequencing sequence clusters are double-sequence-aligned to determine whether two The sequencing sequence cluster corresponding to the characteristic sequence is further combined.
  • the fourth feature sequence and the fifth feature sequence, the fourth feature sequence, and the sixth feature sequence, the fifth feature sequence, and the sixth feature sequence are respectively subjected to double sequence alignment.
  • Step 1102 Determine whether there is an exact matching of overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region.
  • the overlapping regions of the two feature sequences cannot be completely matched, it means that the two feature sequences have different mutation types in their overlapping regions, so they cannot be merged. Therefore, the overlapping regions of the two feature sequences are completely matched.
  • the premise of merging feature sequences The fact that the variation position of the at least one feature sequence relative to the reference sequence segment completely falls within the overlap region ensures that the two feature sequences have at least one variation position with the same variation information in their overlapping regions.
  • the fourth feature sequence and the fifth feature sequence have a deletion of the base segment CC relative to the second reference sequence segment, and the first variation position is located in the fourth feature sequence and the fifth feature.
  • the fourth feature sequence and the fifth feature sequence satisfy the above judgment condition; in FIG. 10B
  • the second variation position, the fourth feature sequence and the sixth feature sequence have the insertion of the base segment CC relative to the second reference sequence segment, and the second mutation position is located in the overlapping region of the fourth feature sequence and the sixth feature sequence. It is to be noted that the fourth feature sequence and the sixth feature sequence also satisfy the above-described judgment condition.
  • step 1103 the process proceeds to step 1103 to further merge the sequence clusters; otherwise, proceed to step 604 to perform a double sequence alignment of each feature sequence with the reference sequence segment.
  • Step 1103 Combine the sequenced sequence clusters corresponding to the two characteristic sequences to obtain a merged sequence cluster, and combine the two feature sequences to obtain the characteristics of the combined sequence clusters. sequence.
  • the merging of the sequencing sequence clusters corresponding to the two characteristic sequences refers to replacing the two sequencing sequence clusters before the combination with the merged sequence clusters to realize the update of the sequencing sequence clusters; Refers to the feature sequence obtained by the union process instead of the two feature sequences before the union process to achieve the update of the feature sequence.
  • step 1103 After the execution of step 1103 is completed, the process returns to step 1101 to continue the dual sequence alignment of the feature sequences to determine whether there are still clusters of sequencing sequences that meet the merge conditions.
  • the feature sequence in step 1101 includes a feature sequence obtained by the union process
  • the sequence cluster in step 1103 includes the merged sequence cluster.
  • FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application
  • FIG. 12B is a merge process of merging the sequence clusters in FIG. 10A according to an embodiment of the present application.
  • schematic diagram As shown in FIG. 12A, the fourth feature sequence and the fifth feature sequence are first subjected to double sequence alignment, because the overlapping regions of the fourth feature sequence and the fifth feature sequence are completely matched, and the first variation position exists (base segment CC) The deletion is completely within its overlapping region, and therefore, the fourth feature sequence and the fifth feature sequence are combined to obtain a seventh feature sequence. Accordingly, as shown in FIG. 12B, the fourth sequencing sequence cluster and the fifth sequencing sequence cluster are combined to obtain a seventh sequencing sequence cluster.
  • the seventh feature sequence and the sixth feature sequence are subjected to double sequence alignment, because the overlapping regions of the seventh feature sequence and the sixth feature sequence are completely matched, and the second mutation position (the insertion of the base segment CC) is completely present. Falling within its overlapping region, therefore, the seventh feature sequence and the sixth feature sequence are combined to obtain an eighth feature sequence.
  • the seventh sequencing sequence cluster and the sixth sequencing sequence cluster are combined to obtain an eighth sequencing sequence cluster. Then, in the subsequent step 604, only the eighth feature sequence and the reference sequence segment are double-sequence aligned, and the sequence segment in the eighth sequencing sequence cluster is corrected according to the mutation type of the eighth feature sequence.
  • each feature sequence is compared with a reference sequence segment by a double sequence, where each feature sequence includes both The characteristic sequence of the sequencing sequence cluster which is not merged according to the combination condition, and the characteristic sequence of the merged sequencing sequence cluster obtained by combining the sequenced clusters.
  • the sequence of the sequence sequence that meets the merge condition is further combined to increase the length of the feature sequence, thereby improving the accuracy of the double sequence alignment of the feature sequence and the reference sequence segment.
  • the present application also provides a genomic variation detecting device.
  • FIG. 13 is a schematic structural diagram of a first genomic variation detecting apparatus according to an embodiment of the present application.
  • the first genomic variation detecting apparatus 1300 may include: a first dual sequence aligning unit 1301, a potential mutated region determining unit 1302, a sequencing sequence segment extracting unit 1303, a reference sequence segment extracting unit 1304, a multiple sequence aligning unit 1305, and a variation.
  • the first double sequence alignment unit 1301 is configured to perform double sequence alignment on multiple sequence sequences of the genome and the reference sequence, respectively, wherein the reference sequence is that the genome has no variation.
  • the base sequence at the time, the sequencing sequence being the base sequence to be detected in the genome.
  • the potential variation region determining unit 1302 is configured to determine a potential variation region of the genome according to the double sequence alignment result, where the potential variation region is a base coding interval in which a potential variation occurs in the genome.
  • the sequencing sequence fragment extracting unit 1303 is configured to extract a sequencing sequence fragment from all the sequencing sequences according to the potential variation region.
  • the reference sequence segment extracting unit 1304 is configured to extract a reference sequence segment in the reference sequence according to the potential variation region.
  • the multiple sequence alignment unit 1305 is configured to perform multiple sequence alignment on the reference sequence fragment and all the sequenced sequence fragments to obtain a multiple sequence alignment result.
  • the mutation detection result determining unit 1306 is configured to determine a mutation detection result of the genome according to the multiple sequence alignment result.
  • the potential variation region determining unit 1302 includes: a second coding interval dividing subunit, configured to divide the genome into multiple codes according to a base coding order of the genome. a section; a variance quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals; and a second threshold determining subunit for determining a sequencing sequence in which each of the coding intervals is mutated Whether the number is greater than a second threshold; and the second potential variation region determining subunit is configured to determine that the coding interval is a potential variation region when the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold.
  • the sequencing sequence segment extracting unit 1303 is specifically configured to extract an intersection portion of each of the sequencing sequence and the potential variation region as the sequencing sequence segment.
  • the sequencing sequence segment extracting unit 1303 is configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection, extract the sequencing sequence. As the fragment of the sequencing sequence.
  • the reference sequence segment extracting unit 1304 is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence segment.
  • the mutation detection result determining unit 1306 includes: a mutation position determining subunit, configured to determine a mutation position in the potential variation region according to the multiple sequence alignment result; a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position; and a sequencing sequence collection convergence subunit, configured to use the mutation information according to the variation information, Converging all of the sequencing sequence fragments into at least one sequencing sequence set, wherein the sequencing information fragments in the same sequencing sequence set have the same variation information at the mutation position; and the third threshold determination subunit is used to sequentially determine each Whether the number of sequencing sequence fragments in the set of sequencing sequences is greater than a third threshold; a mutation detection result determining subunit, configured to determine when the number of sequencing sequence segments in one of the sequencing sequence sets is greater than the third threshold
  • the variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.
  • FIG. 14 is a schematic structural diagram of a second genomic variation detecting apparatus according to an embodiment of the present application.
  • the second genomic variation detecting apparatus 1400 further includes: a mutation type determining unit 1401, a sequencing sequence cluster merging unit 1402, a union processing unit 1403, and a second, based on the first genomic variation detecting apparatus 1300 shown in FIG.
  • the mutation type determining unit 1401 is configured to determine a variation type of all the sequence segments according to the multiple sequence alignment result.
  • the sequencing sequence cluster converging unit 1402 is configured to aggregate all the sequenced sequence fragments into at least one sequencing sequence cluster according to the variation type of the all sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type.
  • the union processing unit 1403 is configured to perform a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters.
  • the second double sequence alignment unit 1404 is configured to perform double sequence alignment on each of the feature sequences and the reference sequence segments to obtain a variation type of each of the feature sequences.
  • a correcting unit 1405 configured to correct the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein a variation type and a variation type of each sequence segment in the corrected multiple sequence alignment result
  • the characteristic sequences corresponding to each of the sequencing sequence fragments have the same type of variation.
  • FIG. 15 is a schematic structural diagram of a third genomic variation detecting apparatus according to an embodiment of the present application.
  • the third genomic variation detecting apparatus 1500 further includes a third dual sequence matching unit 1501, an overlapping area determining unit 1502, and a merging unit 1503, based on the second genomic variation detecting apparatus 1400 shown in FIG.
  • the third double sequence alignment unit 1501 is configured to perform double sequence alignment on any two of the feature sequences of each of the obtained sequencing sequence clusters.
  • the overlap region determining unit 1502 is configured to determine whether there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the mutation position of the at least one feature sequence is completely within the overlapping region.
  • a merging unit 1503 configured to merge the sequence sequence clusters corresponding to the two feature sequences when the overlapping regions of the two feature sequences are completely matched, and the mutation positions of the at least one feature sequence are completely within the overlapping region
  • the merged sequence clusters are obtained, and the two feature sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence clusters.
  • the present application also provides a genomic mutation detection terminal.
  • FIG. 16 is a schematic structural diagram of a genomic mutation detecting terminal according to an embodiment of the present application.
  • the genomic variation detecting terminal 1600 may include: a processor 1601, a memory 1602, and a communication unit 1603. These components communicate through one or more buses. It will be understood by those skilled in the art that the structure of the server shown in the figure does not constitute a limitation of the present application, and it may be a bus structure or a star structure. More or fewer components may be included than in the drawings, or some components may be combined, or different component arrangements.
  • the communication unit 1603 is configured to establish a communication channel, so that the storage device can communicate with other devices. Receive user data sent by other devices or send user data to other devices.
  • the processor 1601 which is a control center of the storage device, connects various parts of the entire electronic device by using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 1602, and calling the storage in the memory. Data to perform various functions of the electronic device and/or process data.
  • the processor may be composed of an integrated circuit (IC), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs that have the same function or different functions.
  • the processor 1601 may include only a Central Processing Unit (CPU).
  • the CPU may be a single operation core, and may also include a multi-operation core.
  • the memory 1602 is configured to store execution instructions of the processor 1601, and the memory 1602 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), Erase programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
  • SRAM static random access memory
  • EEPROM Erase programmable read only memory
  • EPROM erasable programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory magnetic or optical disk.
  • the genomic mutation detecting terminal 1600 is enabled to perform the following steps:
  • the present application further provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in various embodiments of the calling method provided by the application.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (English: read-only memory, abbreviated as: ROM) or a random access memory (English: random access memory, abbreviation: RAM).
  • the technology in the embodiments of the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution in the embodiment of the present application is essentially Or the part contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, etc., including a plurality of instructions for making one
  • a computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments or portions of the embodiments of the present application.

Abstract

Provided are a method, a device and a terminal for detecting genome variations, wherein the method for detecting genome variations comprises: respectively performing pairwise sequence alignment between multiple sequencing sequences of a genome and a reference sequence, and obtaining a pairwise sequence alignment result (201); determining a potential variable region of the genome according to the pairwise sequence alignment result (202); extracting a sequencing sequence fragment from all the sequencing sequences according to the potential variable region (203); extracting a reference sequence fragment from the reference sequence according to the potential variable region (204); performing multiple sequence alignment between the reference sequence fragment and all the sequencing sequence fragments, and obtaining a multiple sequence alignment result (205); and determining a variation detection result of the genome according to the multiple sequence alignment result (206). As for performing multiple sequence alignments between the reference sequence fragment and all the sequencing sequence fragments, the sequencing sequence fragments of the same variation type can be aggregated for the alignment, so as to improve the accuracy of the detection result of genome variations.

Description

基因组变异检测方法、装置及终端Genomic variation detection method, device and terminal 技术领域Technical field
本申请涉及生物信息学技术领域,尤其涉及一种基因组变异检测方法、装置及终端。The present application relates to the field of bioinformatics technology, and in particular, to a method, device and terminal for detecting genomic variation.
背景技术Background technique
从分子水平上看,基因组变异是指基因组中碱基对组成或排列顺序的改变,主要包括SNP(Single Nucleotide Polymorphism,单核苷酸多态性)和indel(short Insertion/Deletion,小片段的插入或删除)。随着基因组测序成本的持续下降,高通量测序仪产出的基因组测序数据呈现了爆炸式的增长,但是如何从基因组测序数据中得到高质量的基因组变异检测结果,依然是一项富有挑战性的工作。At the molecular level, genomic variation refers to changes in the base pair composition or order of the genome, including SNP (Single Nucleotide Polymorphism) and indel (short Insertion/Deletion). Or delete). As the cost of genome sequencing continues to decline, the genome sequencing data produced by high-throughput sequencers is exploding, but how to get high-quality genomic variation results from genome sequencing data remains a challenge. work.
传统的基因组变异检测通常以基因组的参考序列(reference sequence)为基准,分别将基因组的多条测序序列与参考序列进行双序列比对,得到每条测序序列与参考序列的双序列比对结果,包括测序序列相对于参考序列详细的匹配(match)、错配(mismatch)、插入(insertion)和删除(deletion)等信息,然后根据所有测序序列与参考序列的双序列比对结果,确定基因组的变异检测结果。其中,参考序列为基因组没有发生变异时的碱基序列,测序序列为被检测基因组的碱基序列。The traditional genomic variation detection is usually based on the reference sequence of the genome, and the multiple sequencing sequences of the genome are double-sequenced with the reference sequence to obtain the double sequence alignment result of each sequencing sequence and the reference sequence. Including detailed information such as matching, mismatch, insertion, and deletion of the sequencing sequence relative to the reference sequence, and then determining the genome based on the alignment of all the sequencing sequences and the reference sequence. Variation test results. Wherein, the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence of the detected genome.
但是,在实现本申请的过程中,申请人发现现有技术中至少存在如下问题:由于传统的基因组变异检测只是将每条测序序列与参考序列进行双序列比对,并根据双序列比对结果确定基因组的变异检测结果,很容易因为测序序列对齐不准确,把测序序列中一种类型的变异错误地比对成不同类型的变异,导致基因组变异检测结果不准确。However, in the process of implementing the present application, the Applicant found that at least the following problems exist in the prior art: since the traditional genomic variation detection only double-sequences each sequencing sequence with the reference sequence, and compares the results according to the double sequence. Determining the results of genomic variation detection is easy because the alignment of the sequencing sequences is inaccurate, and one type of variation in the sequencing sequence is erroneously compared to different types of mutations, resulting in inaccurate genomic variation detection results.
发明内容Summary of the invention
本申请提供了一种基因组变异检测方法、装置及终端,以解决现有技术中基因组变异检测结果不准确的问题。The present application provides a method, device and terminal for detecting genomic variation to solve the problem of inaccurate detection results of genomic variation in the prior art.
第一方面,本申请实施例提供了一种基因组变异检测方法,该方法包括:将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所 述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列;根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间;根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段;根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段;对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果;根据所述多序列比对结果,确定所述基因组的变异检测结果。采用本实现方式,可以把具有相同变异类型的测序序列片段聚在一起对齐,测序序列对齐较为准确,避免将属于一种类型的变异错误地比对成不同类型的变异,从而提高基因组变异检测结果的准确性。In a first aspect, the embodiment of the present application provides a method for detecting genomic variation, which comprises: performing multiple sequence alignment on a plurality of sequencing sequences of a genome and a reference sequence, respectively, to obtain a double sequence alignment result, wherein The reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected by the genome; and the potential variation region of the genome is determined according to the result of the double sequence alignment. The potential variation region is a base coding interval in which a potential mutation occurs in the genome; according to the potential variation region, a sequencing sequence fragment is extracted from all the sequencing sequences; and the reference mutation sequence is extracted according to the potential variation region Deriving a sequence fragment; performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain a multi-sequence alignment result; determining the variation detection result of the genome according to the multi-sequence alignment result. By adopting the present implementation method, the sequencing sequence fragments with the same mutation type can be clustered and aligned, and the sequencing sequence alignment is more accurate, so as to avoid erroneously comparing one type of variation into different types of mutations, thereby improving the genomic variation detection result. The accuracy.
结合第一方面,在第一方面第一种可能的实现方式中,在对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果之后,还包括:根据所述多序列比对结果,确定所有测序序列片段的变异类型;根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇,其中,同一测序序列簇中的测序序列片段的变异类型相同;分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列;将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型;根据每个所述特征序列的变异类型对所述多序列比对结果进行校正,其中,所述校正后的多序列比对结果中每个测序序列片段的变异类型与所述每个测序序列片段所对应的特征序列的变异类型相同。采用本实现方式,首先通过特性序列与参考序列片段的双序列比对,对特征序列进行校正;然后根据校正后的特征序列对特征序列所对应的测序序列片段进行校正,克服了多序列比对结果中部分测序序列片段相对参考序列片段发生偏移的问题,提高基因组变异检测结果的准确性。In combination with the first aspect, in a first possible implementation manner of the first aspect, after performing multi-sequence alignment on the reference sequence segment and all the sequence segments, to obtain a multi-sequence alignment result, the method further includes: Multi-sequence alignment results, determining the type of variation of all sequencing sequence fragments; concentrating all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster The mutation types are the same; the sequencing sequence is performed on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; each of the characteristic sequences and the reference sequence fragment Performing a double sequence alignment to obtain a variation type of each of the feature sequences; correcting the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein the corrected multiple sequence alignment result a variation type of each of the sequencing sequence fragments and a variation type of the characteristic sequence corresponding to each of the sequencing sequence fragments . In this implementation mode, the feature sequence is first corrected by double sequence alignment of the characteristic sequence and the reference sequence segment; then the sequence sequence corresponding to the feature sequence is corrected according to the corrected feature sequence, and the multiple sequence alignment is overcome. In the result, the partial sequencing sequence fragment is offset from the reference sequence fragment, and the accuracy of the genomic variation detection result is improved.
结合第一方面第一种可能的实现方式,在第一方面第二种可能的实现方式中,在分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列之后,还包括:将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对;判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内;当存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内时,将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。采用本实现方式,通过把符合合并条件的测序序列簇进一步合并,增加特征序列的长度,进而提高 了特征序列与参考序列片段双序列比对结果的准确性。In conjunction with the first possible implementation of the first aspect, in a second possible implementation manner of the first aspect, each of the sequencing sequence segments in each of the sequencing sequence clusters are separately processed for each other to obtain each After the sequence of the sequenced sequence clusters, the method further comprises: performing double sequence alignment on any two of the feature sequences of each of the obtained sequence clusters; determining whether there is an overlap of the overlapping regions of the two feature sequences. And wherein the mutation position of at least one of the feature sequences is completely within the overlapping region; when there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region, The sequencing sequence clusters corresponding to the two characteristic sequences are combined to obtain a cluster of the sequenced sequences, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster. By adopting the implementation manner, by further merging the sequence clusters that meet the merge conditions, the length of the feature sequence is increased, thereby improving The accuracy of the alignment of the characteristic sequence with the reference sequence fragment double sequence.
结合第一方面,在第一方面第三种可能的实现方式中,根据所述双序列比对结果,确定所述基因组的潜在变异区域,包括:根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;根据所述双序列比对结果,确定所有测序序列的变异类型;依次统计每个所述编码区间内不同变异类型的测序序列的概率分布值;根据所述概率分布值,计算每个所述编码区间的信息熵;依次判断每个所述编码区间的信息熵是否大于第一阈值;当一个所述编码区间的信息熵大于所述第一阈值时,判定该编码区间为潜在变异区域。采用本实现方式,通过信息熵确定基因组的潜在变异区域。In combination with the first aspect, in a third possible implementation manner of the first aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; according to the double sequence alignment result, the variation types of all the sequencing sequences are determined; and the probability distribution values of the sequencing sequences of different mutation types in each of the coding intervals are sequentially counted; according to the probability a distribution value, calculating an information entropy of each of the coding intervals; determining whether an information entropy of each of the coding intervals is greater than a first threshold; and determining an information entropy of the coding interval when the information entropy of the coding interval is greater than the first threshold The coding interval is a potential variation region. Using this implementation, the potential variation region of the genome is determined by information entropy.
结合第一方面,在第一方面第四种可能的实现方式中,根据所述双序列比对结果,确定所述基因组的潜在变异区域,包括:根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;依次统计每个所述编码区间内发生变异的测序序列的数量;判断每个所述编码区间内发生变异的测序序列的数量是否大于第二阈值;当一个所述编码区间内发生变异的测序序列的数量大于所述第二阈值时,判定该编码区间为潜在变异区域。采用本实现方式,通过编码区间内发生变异的测序序列的数量确定基因组的潜在变异区域。In combination with the first aspect, in a fourth possible implementation manner of the first aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; the number of sequencing sequences in each of the coding intervals is counted in turn; and the number of sequencing sequences in each of the coding intervals is determined to be greater than a second threshold; When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region. With this implementation, the potential variability region of the genome is determined by the number of sequencing sequences that mutate within the coding interval.
结合第一方面,在第一方面第五种可能的实现方式中,根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段,包括:抽取每条所述测序序列与所述潜在变异区域的交集部分作为所述测序序列片段。In combination with the first aspect, in a fifth possible implementation of the first aspect, the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including: extracting each of the sequencing sequences and the potential variation The intersection of the regions serves as the fragment of the sequencing sequence.
结合第一方面,在第一方面第六种可能的实现方式中,根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段,包括:当每条所述测序序列与所述潜在变异区域存在交集时,抽取所述测序序列作为所述测序序列片段。In combination with the first aspect, in a sixth possible implementation of the first aspect, the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including: when each of the sequencing sequences and the potential variation When there is an intersection of the regions, the sequencing sequence is extracted as the fragment of the sequencing sequence.
结合第一方面,在第一方面第七种可能的实现方式中,根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段,包括:抽取所述参考序列与所述潜在变异区域的交集部分作为所述参考序列片段。With reference to the first aspect, in a seventh possible implementation manner of the first aspect, the extracting the reference sequence segment in the reference sequence according to the potential variation region includes: extracting the reference sequence and the potential variation region The intersection portion is used as the reference sequence fragment.
结合第一方面,在第一方面第八种可能的实现方式中,根据所述多序列比对结果,确定所述基因组的变异检测结果,包括:根据所述多序列比对结果,确定所述潜在变异区域中的变异位置;在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处的变异信息;根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,判定所述 测序序列集合中测序序列片段的变异信息为所述基因组的变异检测结果。With reference to the first aspect, in an eighth possible implementation manner of the first aspect, determining the mutation detection result of the genome according to the multiple sequence alignment result, including: determining, according to the multiple sequence alignment result, the determining a variation position in the potential variation region; extracting variation information of all the sequencing sequence fragments at the mutation position in the multiple sequence alignment result; and concentrating all the sequencing sequence fragments according to the mutation information At least one set of sequencing sequences, wherein the sequencing information of the sequencing sequence fragments in the same sequencing sequence set is the same at the mutation position; sequentially determining whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; Determining that the number of sequencing sequence fragments in the set of sequencing sequences is greater than the third threshold The variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.
第二方面,本申请实施例还提供了一种基因组变异检测装置,该装置包括:第一双序列比对单元,用于将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列;潜在变异区域确定单元,用于根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间;测序序列片段抽取单元,用于根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段;参考序列片段抽取单元,用于根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段;多序列比对单元,用于对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果;变异检测结果确定单元,用于根据所述多序列比对结果,确定所述基因组的变异检测结果。In a second aspect, the embodiment of the present application further provides a genomic variation detecting apparatus, which comprises: a first double sequence aligning unit, configured to perform multiple sequence alignment of a plurality of sequencing sequences of the genome and a reference sequence respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, the sequencing sequence is a base sequence to be detected by the genome; a potential mutation region determining unit is used according to the Determining, by the double sequence alignment result, a potential variation region of the genome, the potential variation region being a base coding interval in which a potential mutation occurs in the genome; and a sequencing sequence fragment extraction unit for using the potential variation region, A sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment extraction unit is configured to extract a reference sequence fragment from the reference sequence according to the potential variation region; a multi-sequence alignment unit for the reference Multiple sequence alignment of sequence fragments and all sequencing sequence fragments to obtain multiple sequence alignment results; mutation detection If the determination unit, according to the multiple sequence alignment results, the detection result of the variability of the genome.
结合第二方面,在第二方面第一种可能的实现方式中,该装置还包括:变异类型确定单元,用于根据所述多序列比对结果,确定所有测序序列片段的变异类型;测序序列簇汇聚单元,用于根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇,其中,同一测序序列簇中的测序序列片段的变异类型相同;并集处理单元,用于分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列;第二双序列比对单元,用于将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型;校正单元,用于根据每个所述特征序列的变异类型对所述多序列比对结果进行校正,其中,所述校正后的多序列比对结果中每个测序序列片段的变异类型与所述每个测序序列片段所对应的特征序列的变异类型相同。With reference to the second aspect, in a first possible implementation manner of the second aspect, the apparatus further includes: a mutation type determining unit, configured to determine, according to the multiple sequence alignment result, a variation type of all the sequence segments; the sequencing sequence a clustering unit for concentrating all of the sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type; the union processing unit For separately combining all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; a second double sequence alignment unit for each of the described Performing a double sequence alignment with the reference sequence segment to obtain a variation type of each of the feature sequences; and a correction unit, configured to correct the multiple sequence alignment result according to the mutation type of each of the feature sequences Wherein the corrected multi-sequence alignment results in a variation type of each of the sequencing sequence fragments and each of the sequencing sequence fragments Variation of the same feature type corresponding to the sequence.
结合第二方面,在第二方面第二种可能的实现方式中,该装置还包括:第三双序列比对单元,用于将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对;重叠区域判断单元,用于判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内;合并单元,用于当存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内时,将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。With reference to the second aspect, in a second possible implementation manner of the second aspect, the apparatus further includes: a third dual sequence aligning unit, configured to use any two of the obtained feature sequences of each of the sequencing sequence clusters The double-sequence alignment is performed on the feature sequences; the overlap region determining unit is configured to determine whether there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the mutation position of at least one of the feature sequences is completely within the overlapping region; When the overlapping regions of the two feature sequences are completely matched, and the mutation positions of at least one of the feature sequences are completely within the overlapping region, the sequencing sequence clusters corresponding to the two feature sequences are merged to obtain a combined The cluster of the sequence is sequenced, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.
结合第二方面,在第二方面第三种可能的实现方式中,所述潜在变异区域确定单元包括:第一编码区间划分子单元,用于根据所述基因组的碱基编码顺序,将所述基 因组划分为多个编码区间;变异类型确定子单元,用于根据所述双序列比对结果,确定所有测序序列的变异类型;概率分布值统计子单元,用于依次统计每个所述编码区间内不同变异类型的测序序列的概率分布值;信息熵计算子单元,用于根据所述概率分布值,计算每个所述编码区间的信息熵;第一阈值判断子单元,用于依次判断每个所述编码区间的信息熵是否大于第一阈值;第一潜在变异区域判定子单元,用于当一个所述编码区间的信息熵大于所述第一阈值时,判定该编码区间为潜在变异区域。With reference to the second aspect, in a third possible implementation manner of the second aspect, the potential variation region determining unit includes: a first coding interval dividing subunit, configured to: according to a base coding order of the genome Base The group is divided into a plurality of coding intervals; the mutation type determining subunit is configured to determine a variation type of all the sequencing sequences according to the double sequence alignment result; a probability distribution value statistical subunit, for sequentially counting each of the codes a probability distribution value of a sequence of different mutation types in the interval; an information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value; and a first threshold value determining subunit for sequentially determining Whether the information entropy of each of the coding intervals is greater than a first threshold; the first latent variation region determining subunit, configured to determine that the coding interval is a potential variation when an information entropy of one of the coding intervals is greater than the first threshold region.
结合第二方面,在第二方面第四种可能的实现方式中,所述潜在变异区域确定单元包括:第二编码区间划分子单元,用于根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;变异数量统计子单元,用于依次统计每个所述编码区间内发生变异的测序序列的数量;第二阈值判断子单元,用于判断每个所述编码区间内发生变异的测序序列的数量是否大于第二阈值;第二潜在变异区域判定子单元,用于当一个所述编码区间内发生变异的测序序列的数量大于所述第二阈值时,判定该编码区间为潜在变异区域。With reference to the second aspect, in a fourth possible implementation manner of the second aspect, the potential variation region determining unit includes: a second coding interval dividing subunit, configured to: according to a base coding order of the genome The genome is divided into a plurality of coding intervals; a variance quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals; and a second threshold determining subunit for determining each of the coding intervals Whether the number of sequencing sequences in which the mutation occurs is greater than a second threshold; and the second latent variation region determining subunit is configured to determine the encoding interval when the number of sequencing sequences in which the mutation occurs within the encoding interval is greater than the second threshold For potential variation areas.
结合第二方面,在第二方面第五种可能的实现方式中,所述测序序列片段抽取单元,具体用于抽取每条所述测序序列与所述潜在变异区域的交集部分作为所述测序序列片段。In conjunction with the second aspect, in a fifth possible implementation of the second aspect, the sequencing sequence segment extracting unit is specifically configured to extract an intersection of each of the sequencing sequence and the potential variation region as the sequencing sequence Fragment.
结合第二方面,在第二方面第六种可能的实现方式中,所述测序序列片段抽取单元,具体用于当所述交集判断子单元判断所述测序序列与所述潜在变异区域存在交集时,抽取所述测序序列作为所述测序序列片段。With reference to the second aspect, in a sixth possible implementation manner of the second aspect, the sequencing sequence segment extracting unit is specifically configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection And extracting the sequencing sequence as the fragment of the sequencing sequence.
结合第二方面,在第二方面第七种可能的实现方式中,所述参考序列片段抽取单元,具体用于抽取所述参考序列与所述潜在变异区域的交集部分作为所述参考序列片段。In conjunction with the second aspect, in a seventh possible implementation of the second aspect, the reference sequence segment extracting unit is specifically configured to extract an intersection of the reference sequence and the potential variation region as the reference sequence segment.
结合第二方面,在第二方面第八种可能的实现方式中,所述变异检测结果确定单元,包括:变异位置确定子单元,用于根据所述多序列比对结果,确定所述潜在变异区域中的变异位置;变异信息提取子单元,用于在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处的变异信息;测序序列集合汇聚子单元,用于根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;第三阈值判断子单元,用于依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;变异检测结果判定子单元,用于当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,判定所述测序序列集合中测序序列片段的变异信息为所述基因 组的变异检测结果。With reference to the second aspect, in the eighth possible implementation of the second aspect, the mutation detection result determining unit includes: a mutation position determining subunit, configured to determine the potential variation according to the multiple sequence alignment result a mutation position in the region; a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position; and a sequencing sequence collection convergence subunit, And merging all the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the mutation sequence fragments in the same sequencing sequence set have the same variation information at the mutation position; the third threshold determination subunit is used And determining, in sequence, whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; the mutation detection result determining subunit, configured to: when the number of sequencing sequence fragments in one of the sequencing sequence sets is greater than the third At the threshold, determining the variation information of the sequence fragment in the sequence of the sequencing sequence is the gene The variation test results of the group.
第三方面,本申请实施例还提供了一种基因组变异检测终端,该终端包括:处理器;用于存储处理器的执行指令的存储器;其中,所述处理器被配置为执行步骤:将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列;根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间;根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段;根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段;对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果;根据所述多序列比对结果,确定所述基因组的变异检测结果。In a third aspect, the embodiment of the present application further provides a genomic mutation detecting terminal, the terminal comprising: a processor; a memory for storing execution instructions of the processor; wherein the processor is configured to perform the step of: performing a genome The plurality of sequencing sequences are respectively subjected to double sequence alignment with the reference sequence to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is the genome sequence Detecting a base sequence; determining, according to the double sequence alignment result, a potential variation region of the genome, wherein the potential variation region is a base coding interval in which a potential mutation occurs in the genome; and according to the potential variation region, A sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment is extracted from the reference sequence according to the potential variation region; and the reference sequence fragment and all the sequenced fragments are subjected to multiple sequence alignment to obtain a plurality of sequences Aligning the results; determining the variation detection result of the genome according to the multi-sequence alignment result.
结合第三方面,在第三方面第一种可能的实现方式中,在对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果之后,还包括:根据所述多序列比对结果,确定所有测序序列片段的变异类型;根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇,其中,同一测序序列簇中的测序序列片段的变异类型相同;分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列;将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型;根据每个所述特征序列的变异类型对所述多序列比对结果进行校正,其中,所述校正后的多序列比对结果中每个测序序列片段的变异类型与所述每个测序序列片段所对应的特征序列的变异类型相同。With reference to the third aspect, in a first possible implementation manner of the third aspect, after performing multi-sequence alignment on the reference sequence segment and all the sequence segments, to obtain the multi-sequence alignment result, the method further includes: Multi-sequence alignment results, determining the type of variation of all sequencing sequence fragments; concentrating all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster The mutation types are the same; the sequencing sequence is performed on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; each of the characteristic sequences and the reference sequence fragment Performing a double sequence alignment to obtain a variation type of each of the feature sequences; correcting the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein the corrected multiple sequence alignment result a variation type of each of the sequencing sequence fragments and a variation type of the characteristic sequence corresponding to each of the sequencing sequence fragments .
结合第三方面,在第三方面第二种可能的实现方式中,在分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列之后,还包括:将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对;判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内;当存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内时,将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。In combination with the third aspect, in a second possible implementation manner of the third aspect, the sequence of each of the sequencing sequence clusters is obtained by performing a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters. After that, the method further includes: performing double sequence alignment on any two of the feature sequences of each of the obtained sequencing sequence clusters; determining whether there is an exact matching of overlapping regions of the two feature sequences, and at least one of the feature sequences The position of the mutation is completely within the overlapping region; when there is an exact overlap of the overlapping regions of the two feature sequences, and the variation position of at least one of the feature sequences is completely within the overlapping region, the two feature sequences are corresponding The sequenced sequence clusters are combined to obtain a cluster of the sequenced sequences, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.
结合第三方面,在第三方面第三种可能的实现方式中,根据所述双序列比对结果,确定所述基因组的潜在变异区域,包括:根据所述基因组的碱基编码顺序,将所述基 因组划分为多个编码区间;根据所述双序列比对结果,确定所有测序序列的变异类型;依次统计每个所述编码区间内不同变异类型的测序序列的概率分布值;根据所述概率分布值,计算每个所述编码区间的信息熵;依次判断每个所述编码区间的信息熵是否大于第一阈值;当一个所述编码区间的信息熵大于所述第一阈值时,判定该编码区间为潜在变异区域。In combination with the third aspect, in a third possible implementation manner of the third aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, Base The group is divided into a plurality of coding intervals; according to the double sequence alignment result, the variation types of all the sequencing sequences are determined; and the probability distribution values of the sequencing sequences of different mutation types in each of the coding intervals are sequentially counted; according to the probability a distribution value, calculating an information entropy of each of the coding intervals; determining whether an information entropy of each of the coding intervals is greater than a first threshold; and determining an information entropy of the coding interval when the information entropy of the coding interval is greater than the first threshold The coding interval is a potential variation region.
结合第三方面,在第三方面第四种可能的实现方式中,根据所述双序列比对结果,确定所述基因组的潜在变异区域,包括:根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;依次统计每个所述编码区间内发生变异的测序序列的数量;判断每个所述编码区间内发生变异的测序序列的数量是否大于第二阈值;当一个所述编码区间内发生变异的测序序列的数量大于所述第二阈值时,判定该编码区间为潜在变异区域。With reference to the third aspect, in a fourth possible implementation manner of the third aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; the number of sequencing sequences in each of the coding intervals is counted in turn; and the number of sequencing sequences in each of the coding intervals is determined to be greater than a second threshold; When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region.
结合第三方面,在第三方面第五种可能的实现方式中,根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段,包括:抽取每条所述测序序列与所述潜在变异区域的交集部分作为所述测序序列片段。In conjunction with the third aspect, in a fifth possible implementation of the third aspect, the sequencing sequence fragment is extracted from all the sequencing sequences according to the potential variation region, comprising: extracting each of the sequencing sequences and the potential variation The intersection of the regions serves as the fragment of the sequencing sequence.
结合第三方面,在第三方面第六种可能的实现方式中,根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段,包括:当每条所述测序序列与所述潜在变异区域存在交集时,抽取所述测序序列作为所述测序序列片段。In combination with the third aspect, in a sixth possible implementation of the third aspect, the sequencing sequence fragment is extracted from all the sequencing sequences according to the potential variation region, including: when each of the sequencing sequences and the potential variation When there is an intersection of the regions, the sequencing sequence is extracted as the fragment of the sequencing sequence.
结合第三方面,在第三方面第七种可能的实现方式中,根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段,包括:抽取所述参考序列与所述潜在变异区域的交集部分作为所述参考序列片段。With reference to the third aspect, in a seventh possible implementation manner of the third aspect, the extracting the reference sequence segment in the reference sequence according to the potential variation region includes: extracting the reference sequence and the potential variation region The intersection portion is used as the reference sequence fragment.
结合第三方面,在第三方面第八种可能的实现方式中,根据所述多序列比对结果,确定所述基因组的变异检测结果,包括:根据所述多序列比对结果,确定所述潜在变异区域中的变异位置;在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处的变异信息;根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,判定所述测序序列集合中测序序列片段的变异信息为所述基因组的变异检测结果。With reference to the third aspect, in an eighth possible implementation manner of the third aspect, determining the mutation detection result of the genome according to the multiple sequence alignment result, including: determining, according to the multiple sequence alignment result, the determining a variation position in the potential variation region; extracting variation information of all the sequencing sequence fragments at the mutation position in the multiple sequence alignment result; and concentrating all the sequencing sequence fragments according to the mutation information At least one set of sequencing sequences, wherein the sequencing information of the sequencing sequence fragments in the same sequencing sequence set is the same at the mutation position; sequentially determining whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; When the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining that the variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.
第四方面,本申请实施例还提供了一种存储介质,该存储介质可存储有程序,该程序执行时可包括本申请提供的基因组变异检测方法的各实施例中的部分或全部步骤。 In a fourth aspect, the embodiment of the present application further provides a storage medium, where the storage medium may store a program, and the program may include some or all of the steps in each embodiment of the genomic variation detection method provided by the application.
采用本申请实施例提供的基因组变异检测方法、装置及终端等,对参考序列片段和所有测序序列片段进行多序列比对,获得多序列比对结果;根据多序列比对结果,确定基因组的变异检测结果。由于多序列比对倾向于把相似度较高的序列优先聚在一起对齐,因此,将参考序列片段和所有测序序列片段放在一起进行多序列比对,可以把具有相同变异类型的测序序列片段聚在一起对齐,测序序列片段对齐较为准确,避免将属于一种类型的变异错误地比对成不同类型的变异,从而提高基因组变异检测结果的准确性。The genomic variation detection method, device and terminal provided by the embodiments of the present application are used to perform multiple sequence alignment on the reference sequence fragment and all the sequenced sequence fragments to obtain multiple sequence alignment results; and the genomic variation is determined according to the multiple sequence alignment results. Test results. Since multiple sequence alignments tend to preferentially align sequences with higher similarity, the sequence fragment and all sequenced fragments are put together for multiple sequence alignment, and sequence fragments with the same variation type can be sequenced. When aligned together, the alignment of the sequenced fragments is more accurate, avoiding erroneously comparing one type of variation into different types of variations, thereby improving the accuracy of the genomic variation detection results.
附图说明DRAWINGS
为了更清楚地说明本申请的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the present application, the drawings used in the embodiments will be briefly described below. Obviously, for those skilled in the art, without any creative labor, Other drawings can also be obtained from these figures.
图1A为本申请实施例提供的一种测序序列与参考序列的双序列比对状态示意图;1A is a schematic diagram of a dual sequence alignment state of a sequencing sequence and a reference sequence according to an embodiment of the present application;
图1B为本申请实施例将图1A中的测序序列对齐校正后的比对状态示意图;FIG. 1B is a schematic diagram showing an alignment state after the alignment sequence of FIG. 1A is aligned and corrected according to an embodiment of the present application; FIG.
图2为本申请实施例提供的一种基因组变异检测方法流程示意图;2 is a schematic flow chart of a method for detecting genomic variation according to an embodiment of the present application;
图3为本申请实施例提供的一种基因组的编码区间划分示意图;FIG. 3 is a schematic diagram of coding interval division of a genome according to an embodiment of the present application; FIG.
图4A为本申请实施例提供的一种测序序列片段和参考序列片段的抽取过程示意图;4A is a schematic diagram of a process of extracting a sequence segment and a reference sequence segment according to an embodiment of the present application;
图4B为本申请实施例提供的另一种测序序列片段和参考序列片段的抽取过程示意图;4B is a schematic diagram of a process of extracting another sequencing sequence segment and a reference sequence segment according to an embodiment of the present application;
图5A为本申请实施例提供的一种测序序列片段和参考序列片段的多序列比对状态示意图;5A is a schematic diagram showing a multi-sequence alignment state of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application;
图5B为本申请实施例提供的另一种测序序列片段和参考序列片段的多序列比对状态示意图;5B is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application;
图6为本申请实施例提供的另一种基因组变异检测方法流程示意图;6 is a schematic flow chart of another method for detecting genomic variation according to an embodiment of the present application;
图7为本申请实施例提供的一种测序序列簇的汇聚结果示意图;FIG. 7 is a schematic diagram of a convergence result of a cluster of sequencing sequences according to an embodiment of the present application;
图8为本申请实施例提供的一种并集处理过程示意图; FIG. 8 is a schematic diagram of a process of a union process according to an embodiment of the present application;
图9A为本申请实施例将图7中的测序序列簇作并集处理,得到的特征序列示意图;9A is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 7 according to an embodiment of the present application;
图9B为本申请实施例将图9A中的特征序列与参考序列片段进行双序列比对,得到的双序列比对结果示意图;9B is a schematic diagram of a double sequence alignment result obtained by performing double sequence alignment on the feature sequence of FIG. 9A and the reference sequence segment according to an embodiment of the present application;
图9C为本申请实施例根据图9B中的特征序列的变异类型对多序列比对结果进行校正,得到的校正后的多序列比对结果示意图;9C is a schematic diagram of the corrected multi-sequence alignment result obtained by correcting the multi-sequence alignment result according to the variation type of the feature sequence in FIG. 9B according to the embodiment of the present application;
图10A为本申请实施例提供的另一种测序序列片段和参考序列片段的多序列比对状态示意图;FIG. 10A is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application; FIG.
图10B为本申请实施例将图10A中的测序序列簇作并集处理,得到的特征序列示意图;10B is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 10A according to an embodiment of the present application;
图11为本申请实施例提供的另一种基因组变异检测方法流程示意图;11 is a schematic flow chart of another method for detecting genomic variation according to an embodiment of the present application;
图12A为本申请实施例将图10B中的特征序列进行合并的合并过程示意图;FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application; FIG.
图12B为本申请实施例将图10A中的测序序列簇进行合并的合并过程示意图;FIG. 12B is a schematic diagram of a merge process of merging the sequence clusters in FIG. 10A according to an embodiment of the present application; FIG.
图13为本申请实施例提供的第一基因组变异检测装置结构示意图;FIG. 13 is a schematic structural diagram of a first genomic variation detecting apparatus according to an embodiment of the present application;
图14为本申请实施例提供的第二基因组变异检测装置结构示意图;14 is a schematic structural diagram of a second genomic variation detecting apparatus according to an embodiment of the present application;
图15为本申请实施例提供的第三基因组变异检测装置结构示意图;15 is a schematic structural diagram of a third genomic variation detecting apparatus according to an embodiment of the present application;
图16为本申请实施例提供的一种基因组变异检测终端结构示意图。FIG. 16 is a schematic structural diagram of a genomic mutation detecting terminal according to an embodiment of the present application.
具体实施方式detailed description
为了使本领域技术人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所述描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. It is a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
参见图1A,为本申请实施例提供的一种测序序列与参考序列的双序列比对状态示意图,参见图1B,为本申请实施例将图1A中的测序序列对齐校正后的比对状态示意图,在图1A和图1B中,上下两条完全一样的碱基序列代表参考序列,虚线条代表测序序列,测序序列相对于参考序列的错配和删除分别在测序序列中用碱基字母 和圆点表示。1A is a schematic diagram of a dual sequence alignment of a sequencing sequence and a reference sequence according to an embodiment of the present application. Referring to FIG. 1B, it is a schematic diagram of the alignment state after the alignment sequence of FIG. 1A is aligned and corrected according to an embodiment of the present application. In FIG. 1A and FIG. 1B, the two identical base sequences represent the reference sequence, the dotted line represents the sequencing sequence, and the mismatch and deletion of the sequencing sequence relative to the reference sequence respectively use the base letters in the sequencing sequence. And dots are indicated.
对比图1A和图1B,在图1A中,部分测序序列同时存在G->A(由碱基G变为碱基A的错配)和A->G(由碱基A变为碱基G的错配),部分测序序列存在TTTG的删除(碱基段TTTG的删除);而在图1B中,将测序序列对齐校正后,那些同时存在G->A和A->G的测序序列都被校正成了存在TTTG的删除的测序序列。也就是说,在图1A中,由于测序序列之间没有对齐,而将部分存在TTTG的删除的测序序列错误地比对成了存在G->A和A->G的测序序列,即把一种类型的变异错误地比对成了不同类型的变异,则在后续统计基因组的变异类型时,容易导致基因组变异检测结果不准确。Comparing FIG. 1A with FIG. 1B, in FIG. 1A, a part of the sequencing sequence has both G->A (mismatch from base G to base A) and A->G (from base A to base G). The mismatch), the deletion of TTTG in the partial sequencing sequence (deletion of the base segment TTTG); and in Figure 1B, after sequencing the sequencing sequence, those sequencing sequences with both G->A and A->G are present. The sequence was confirmed to be deleted in the presence of TTTG. That is, in Figure 1A, due to the lack of alignment between the sequencing sequences, the deleted sequencing sequences in which TTTG is partially deleted are erroneously aligned to the sequencing sequence in which G->A and A->G are present, ie, one is The types of mutations are erroneously compared to different types of mutations, and in the subsequent statistical genomic variation types, it is easy to lead to inaccurate genomic variation detection results.
为了提高基因组变异检测结果的准确性,本申请实施例提供的基因组变异检测方法、装置及终端将参考序列片段和所有测序序列片段放在一起进行多序列比对,由于多序列比对倾向于把相似度较高的序列优先聚在一起对齐,因此,将参考序列片段和所有测序序列片段放在一起进行多序列比对,可以把具有相同变异类型的测序序列片段聚在一起对齐,测序序列片段对齐较为准确,避免将属于一种类型的变异错误地比对成不同类型的变异,提高基因组变异检测结果的准确性。In order to improve the accuracy of the genomic variation detection result, the genomic variation detection method, apparatus and terminal provided by the embodiments of the present application put together the reference sequence fragment and all the sequenced sequence fragments to perform multiple sequence alignment, since the multiple sequence alignment tends to Sequences with higher similarity are preferentially aligned and aligned. Therefore, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment, and the sequence fragments of the same variation type can be aligned and aligned, and the sequence fragment is sequenced. Alignment is more accurate, avoiding erroneously comparing one type of variation to different types of variation, and improving the accuracy of genomic variation detection results.
参见图2,为本申请实施例提供的一种基因组变异检测方法流程示意图,该方法包括如下步骤:2 is a schematic flowchart of a method for detecting genomic variation according to an embodiment of the present application, and the method includes the following steps:
步骤201:将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果。Step 201: Double-sequence alignment of multiple sequencing sequences of the genome and the reference sequence to obtain a double sequence alignment result.
在本申请实施例中,参考序列为基因组没有发生变异时的碱基序列,其代表了基因组中碱基的正确排列顺序,测序序列为基因组待检测的碱基序列,因此,可以以参考序列为基准判断测序序列的变异情况,当测序序列与参考序列的碱基排列顺序一致时,说明测序序列没有发生变异;当测序序列与参考序列的碱基排列顺序不一致时,说明测序序列发生了变异,其中,测序序列的变异主要包括碱基的错配、插入和删除。In the embodiment of the present application, the reference sequence is a base sequence when the genome does not mutate, which represents the correct arrangement order of the bases in the genome, and the sequencing sequence is the base sequence to be detected by the genome, and therefore, the reference sequence can be The benchmark judges the variation of the sequencing sequence. When the sequencing sequence is consistent with the base sequence of the reference sequence, it indicates that the sequencing sequence does not mutate; when the sequence of the sequence of the sequencing sequence and the reference sequence are inconsistent, the sequencing sequence is mutated. Among them, the variation of the sequencing sequence mainly includes base mismatch, insertion and deletion.
通常情况下,测序序列为短序列片段,测序序列的数量越多,在基因组变异检测过程中得到的原始数据越多,则在后续步骤中对基因组变异检测结果进行统计分析时,可利用的数据越多,基因组变异检测结果越准确。将基因组的多条测序序列分别与参考序列进行双序列比对,可以将每条测序序列定位到参考序列的相应位置,且获得每条测序序列相对参考序列详细的变异信息,包括匹配、错配、插入或删除等信息。 Usually, the sequencing sequence is a short sequence fragment. The more the number of sequencing sequences, the more raw data obtained during the detection of genomic mutation, and the available data when statistically analyzing the results of genomic variation detection in subsequent steps. The more the genomic variation test results, the more accurate. By double-sequence alignment of multiple sequencing sequences of the genome with the reference sequence, each sequencing sequence can be positioned to the corresponding position of the reference sequence, and detailed information about the variation of each sequencing sequence relative to the reference sequence, including matching and mismatching, can be obtained. , insert or delete information.
步骤202:根据所述双序列比对结果,确定所述基因组的潜在变异区域。Step 202: Determine a potential variation region of the genome according to the double sequence alignment result.
在基因组检测技术领域,为了对基因组中的碱基进行定位,为基因组中的每个碱基分配一个编码,则单个编码代表基因组中的一个碱基对,连续的编码区间代表基因组中的一段碱基片段。In the field of genomic detection technology, in order to locate a base in a genome and assign a code to each base in the genome, a single code represents a base pair in the genome, and a continuous coding interval represents a base in the genome. Base segment.
在本申请实施例中,首先根据基因组的编码顺序,将基因组划分为多个编码区间,然后依次判断每个编码区间是否为潜在变异区域,实现对基因组变异位置的初步筛选,提高检测效率。In the embodiment of the present application, the genome is first divided into multiple coding intervals according to the coding sequence of the genome, and then each coding interval is sequentially determined as a potential variation region, and the initial screening of the genomic variation position is realized, thereby improving the detection efficiency.
在本申请一种可选实施例中,将基因组划分为连续、等长的编码区间,根据编码区间的排列顺序,依次判断每个编码区间是否为潜在变异区域,直到遍历整个基因组,避免检测区域的遗漏。其中,编码区间的长度可以根据实际需要相应调整,例如,可以选择50-300bp(bp代表碱基对)范围内的任一长度,本申请对此不做限制。In an optional embodiment of the present application, the genome is divided into continuous, equal-length coding intervals, and according to the order of the coding intervals, each coding interval is sequentially determined to be a potential variation region until the entire genome is traversed, and the detection region is avoided. Missing. The length of the coding interval may be adjusted according to actual needs. For example, any length within a range of 50-300 bp (bp represents a base pair) may be selected, which is not limited in this application.
参见图3,为本申请实施例提供的一种基因组的编码区间划分示意图,由于参考序列为基因组没有发生变异时的碱基序列,则参考序列中碱基对的编码区间即基因组的编码区间,因此,可以用参考序列的编码区间代表基因组的编码区间对本方案进行说明。如图3所示,沿着基因组的编码顺序,将基因组划分为长度为100bp的编码区间,依次形成第一编码区间(1510531,1510630)、第二编码区间(1510631,1510730)、第三编码区间(1510731,1510830)、第四编码区间(1510831,1510930)等。3 is a schematic diagram of a coding interval division of a genome according to an embodiment of the present application. Since the reference sequence is a base sequence in which the genome does not undergo mutation, the coding interval of the base pair in the reference sequence is the coding interval of the genome. Therefore, the scheme can be explained by the coding interval of the reference sequence representing the coding interval of the genome. As shown in FIG. 3, along the coding sequence of the genome, the genome is divided into coding intervals of length 100 bp, and the first coding interval (1510531, 1510630), the second coding interval (1510631, 1510730), and the third coding interval are sequentially formed. (1510731, 1510830), the fourth coding interval (1510831, 1510930), and the like.
编码区间划分完成后,依次判断每个编码区间是否为潜在变异区域,在所有编码区间中筛选出基因组的潜在变异区域。需要指出的是,在一个基因组中,潜在变异区域的数量可以为一个或一个以上的多个,本申请对此不做限制。After the coding interval is divided, it is determined in turn whether each coding interval is a potential variation region, and the potential variation region of the genome is selected in all coding intervals. It should be noted that, in a genome, the number of potentially mutated regions may be one or more than one, and this application does not limit this.
其中,判断编码区间是否为潜在变异区域的方法可以有多种。例如,由于信息熵可以反映序列的混杂程度,信息熵越大说明序列越混乱,测序序列发生变异的可能性就越大,因此,在本申请一种可能的实现方式中,可以通过信息熵确定潜在变异区域;再如,由于编码区间内发生变异的测序序列的数量越多,编码区间为潜在变异区域的可能性就越大,因此,在本申请另一种可能的实现方式中,可以通过编码区间内发生变异的测序序列的数量确定潜在变异区域。Among them, there are various methods for judging whether the coding interval is a potential variation region. For example, since the information entropy can reflect the degree of confounding of the sequence, the larger the information entropy, the more chaotic the sequence is, and the more likely the sequencing sequence is to be mutated. Therefore, in a possible implementation manner of the present application, information entropy can be determined. Potentially mutated regions; for example, the greater the number of sequencing sequences that mutate within the coding interval, the greater the likelihood that the coding interval is a potentially mutated region, and thus, in another possible implementation of the present application, The number of sequencing sequences that mutate within the coding interval determines the potential variation region.
其中,通过信息熵确定潜在变异区域的方法,具体为:Among them, a method for determining a potential variation region by information entropy is specifically:
首先根据所述双序列比对结果,确定所有测序序列的变异类型。由于测序序列和参考序列的双序列比对结果中包括测序序列相对于参考序列详细的匹配、错配、插入 和删除等信息,因此,根据双序列比对结果可以直接确定测序序列的变异类型。在本文中,相同变异类型的测序序列是指相对参考序列具有完全相同变异信息的测序序列,其中,没有发生变异的测序序列也作为变异类型的一种。First, the type of variation of all sequencing sequences is determined based on the results of the double sequence alignment. Due to the double sequence alignment of the sequencing sequence and the reference sequence, the results include detailed matching, mismatching, and insertion of the sequencing sequence relative to the reference sequence. And delete information, therefore, the type of variation of the sequencing sequence can be directly determined based on the double sequence alignment result. As used herein, a sequencing sequence of the same variation type refers to a sequencing sequence having identical information of the same variation relative to a reference sequence, wherein a sequencing sequence in which no variation occurs is also a type of variation.
确定所有测序序列的变异类型后,根据测序序列的变异类型信息,统计不同变异类型的测序序列的概率分布值。具体包括:根据测序序列的变异类型,依次计算编码区间内每种变异类型下测序序列的数量与测序序列总数的比值,得到不同变异类型的测序序列的概率分布值,记为piAfter determining the type of variation of all the sequencing sequences, the probability distribution values of the sequencing sequences of different mutation types are counted according to the variation type information of the sequencing sequence. Specifically, according to the variation type of the sequencing sequence, the ratio of the number of sequencing sequences and the total number of sequencing sequences in each variation type in the coding interval is sequentially calculated, and the probability distribution values of the sequencing sequences of different mutation types are obtained, and are recorded as p i .
假如在编码区间内存在两种变异类型,分别为第一变异类型和第二变异类型,分别统计第一变异类型和第二变异类型所对应的测序序列的数量,将第一变异类型所对应的测序序列的数量除以测序序列总数,得到第一变异类型的概率值p1;将第二变异类型所对应的测序序列的数量除以测序序列总数,得到第二变异类型的概率值p2。其中,p1和p2即所述编码区间内不同变异类型的测序序列的概率分布值。If there are two types of mutations in the coding interval, respectively, the first mutation type and the second mutation type, respectively counting the number of sequencing sequences corresponding to the first mutation type and the second mutation type, and corresponding to the first variation type The number of sequencing sequences is divided by the total number of sequencing sequences to obtain a probability value p 1 of the first mutation type; the number of sequencing sequences corresponding to the second mutation type is divided by the total number of sequencing sequences to obtain a probability value p 2 of the second mutation type. Wherein p 1 and p 2 are the probability distribution values of the sequencing sequences of different mutation types within the coding interval.
根据所述概率分布值,计算所述编码区间的信息熵。具体包括:将概率分布值pi代入信息熵公式:H(U)=E[-logpi],得到编码区间的信息熵H(U)。And calculating an information entropy of the coding interval according to the probability distribution value. Specifically, the probability distribution value pi is substituted into the information entropy formula: H(U)=E[-logp i ], and the information entropy H(U) of the coding interval is obtained.
判断编码区间的信息熵H(U)是否大于预设的第一阈值,当信息熵H(U)大于第一阈值时,判定所述编码区间作为潜在变异区域。It is determined whether the information entropy H(U) of the coding interval is greater than a preset first threshold, and when the information entropy H(U) is greater than the first threshold, the coding interval is determined as a potential variation region.
另外,通过编码区间内发生变异的测序序列的数量确定潜在变异区域的方法,具体为:In addition, a method for determining a potential variation region by the number of sequencing sequences that vary within the coding interval is specifically:
首先统计编码区间内存在变异的测序序列的数量。其中,只要测序序列和参考序列不能完美匹配,均作为存在变异的测序序列,包括存在错配、插入或删除的测序序列。First, count the number of sequencing sequences in the coding interval that are mutated. Among them, as long as the sequencing sequence and the reference sequence are not perfectly matched, they are used as sequencing sequences with mutations, including sequencing sequences with mismatches, insertions or deletions.
根据上述统计结果,判断所述发生变异的测序序列的数量是否大于第二阈值,当所述发生变异的测序序列的数量大于第二阈值时,判定所述编码区间为潜在变异区域。According to the above statistical result, it is determined whether the number of the sequence of the mutation is greater than a second threshold, and when the number of the sequence of the mutation is greater than the second threshold, the coding interval is determined to be a potential variation region.
例如,在本申请一种可能的实现方式中,将第二阈值设定为50,则当编码区间内发生变异的测序序列的数量大于50时,确定该编码区间为潜在变异区域;否则,确定该编码区域间不是潜在变异区域。其中,本领域技术人员可以根据实际需要对第二阈值的大小进行相应调整,本申请对此不做限制。For example, in a possible implementation manner of the present application, if the second threshold is set to 50, when the number of sequencing sequences that are mutated in the coding interval is greater than 50, the coding interval is determined to be a potential variation region; otherwise, There is no potentially mutated region between the coding regions. A person skilled in the art can adjust the size of the second threshold according to actual needs, which is not limited in this application.
步骤203:根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段。 Step 203: Extract the sequencing sequence fragments from all the sequencing sequences according to the potential variation region.
潜在变异区域确定之后,需要在测序序列中抽取出处于潜在变异区域内的测序序列片段,用于在后续步骤进行分析和处理。After the potential variability region is determined, sequencing sequence fragments in the potential variation region need to be extracted from the sequencing sequence for analysis and processing in subsequent steps.
在本申请实施例中,为了便于对测序序列片段的抽取过程进行说明,将测序序列和参考序列的双序列比对结果中,测序序列与参考序列的交集所对应的编码区域作为该测序序列的编码区间。In the embodiment of the present application, in order to facilitate the process of extracting the sequence of the sequenced sequence, in the double sequence alignment result of the sequencing sequence and the reference sequence, the coding region corresponding to the intersection of the sequence of the sequence and the reference sequence is used as the sequence of the sequence. Coding interval.
在本申请一种可能的实现方式中,抽取每条测序序列与所述潜在变异区域的交集部分作为测序序列片段。例如,当测序序列的编码区间完全处于潜在变异区域的编码区间内时,将所述测序序列作为测序序列片段;当潜在变异区域的编码区间与测序序列的编码区间的存在部分交集时,抽取所述测序序列与潜在变异区域的交集部分作为测序序列片段;当潜在变异区域的编码区间与测序序列的编码区间的不存在交集时,将所述测序序列丢弃。In a possible implementation of the present application, an intersection of each of the sequencing sequences and the potential variation region is extracted as a sequencing sequence fragment. For example, when the coding interval of the sequencing sequence is completely within the coding interval of the potential variation region, the sequencing sequence is used as a sequencing sequence fragment; when the coding interval of the potential variation region intersects with the existence portion of the coding interval of the sequencing sequence, the extraction site The intersection of the sequencing sequence and the potential variation region is used as a sequencing sequence fragment; when the coding interval of the potential variation region and the coding interval of the sequencing sequence are not present, the sequencing sequence is discarded.
参见图4A,为本申请实施例提供的一种测序序列片段和参考序列片段的抽取过程示意图,在图4A中以三种不同类型的测序序列为例,对测序序列片段的抽取过程进行示例性说明。其中,潜在变异区域的编码区间为(1510531,1510630),第一测序序列的编码区间为(1510541,1510590),第二测序序列的编码区间为(1510521,1510570),第三测序序列的编码区间为(1510651,15106700)。4A is a schematic diagram of a process of extracting a sequence segment and a reference sequence segment according to an embodiment of the present application. In FIG. 4A, three different types of sequencing sequences are taken as an example to illustrate an extraction process of a sequence segment. Description. The coding interval of the potential variation region is (1510531, 1510630), the coding interval of the first sequencing sequence is (1510541, 1510590), the coding interval of the second sequencing sequence is (1510521, 1510570), and the coding interval of the third sequencing sequence. For (1510651, 15106700).
对于第一测序序列,其编码区间(1510541,1510590)完全处于潜在变异区域的编码区间(1510531,1510630)内,则抽取第一测序序列作为测序序列片段;对于第二测序序列,其编码区间(1510521,1510570)与潜在变异区域的编码区间(1510531,1510630)存在部分交集,该交集部分的编码区间为(1510531,1510570),则在第二测序序列中抽取编码区间为(1510531,1510570)的部分作为测序序列片段;对于第三测序序列,其编码区间(1510651,15106700)与潜在变异区域的编码区间(1510531,1510630)的不存在交集,则将第三测序序列丢弃,从而抽取到的测序序列片段为第一测序序列的全部以及第二测序序列编码区间为(1510531,1510570)的部分。For the first sequencing sequence, the coding interval (1510541, 1510590) is completely within the coding interval of the potential variant region (1510531, 1510630), and the first sequencing sequence is extracted as the sequencing sequence fragment; for the second sequencing sequence, the coding interval (for the second sequencing sequence) 1510521, 1510570) There is a partial intersection with the coding interval (1510531, 1510630) of the potential variation region, and the coding interval of the intersection portion is (1510531, 1510570), and the coding interval is extracted in the second sequencing sequence (1510531, 1510570). Partially used as a sequencing sequence fragment; for the third sequencing sequence, the coding interval (1510651, 15106700) and the coding interval of the potential variation region (1510531, 1510630) do not exist, the third sequencing sequence is discarded, and the extracted sequencing is performed. The sequence fragment is the portion of the first sequencing sequence and the second sequencing sequence encoding portion (1510531, 1510570).
从上述实施例可以看出,当测序序列的编码区间和潜在变异区域的编码区间存在部分交集时,在测序序列片段的抽取过程中会将测序序列打断,抽取测序序列与潜在变异区域的交集部分作为测序序列片段。其中,将测序序列打断会使测序序列失去完整性,从而丢失测序序列的部分信息,进而影响基因组变异检测结果的准确性。It can be seen from the above embodiment that when there is a partial intersection between the coding interval of the sequencing sequence and the coding interval of the potential variation region, the sequencing sequence is interrupted during the extraction process of the sequencing sequence segment, and the intersection of the sequencing sequence and the potential variation region is extracted. Partially as a sequence fragment. Interruption of the sequencing sequence will result in loss of integrity of the sequencing sequence, thereby losing part of the information of the sequencing sequence, thereby affecting the accuracy of the genomic variation detection result.
在本申请另一种可能的实现方式中,首先判断每条测序序列与所述潜在变异区域是否存在交集;当测序序列与所述潜在变异区域存在交集时,抽取该测序序列作为测 序序列片段。其相当于,当潜在变异区域的编码区间与测序序列的编码区间存在部分交集时,以测序序列的编码区间为基准对潜在变异区域进行扩展,避免在测序序列片段的抽取过程中将测序序列打断,保证测序序列的完整性。In another possible implementation manner of the present application, first determining whether there is an intersection between each of the sequenced sequences and the potential variation region; and when the sequence of the sequence and the potential variation region intersect, extracting the sequence of the sequence as a measurement Sequence sequence fragment. The equivalent is that when the coding interval of the potential variation region and the coding interval of the sequencing sequence partially overlap, the potential variation region is extended based on the coding interval of the sequencing sequence, so as to avoid the sequencing sequence during the extraction process of the sequencing sequence fragment. Broken to ensure the integrity of the sequencing sequence.
参见图4B,为本申请实施例提供的另一种测序序列片段和参考序列片段的抽取过程示意图,在图4B中测序序列片段的抽取过程与图4A基本相似,其不同之处在于,对于第二测序序列,由于其编码区间与潜在变异区域的编码区间存在部分交集,则将第二测序序列的编码区间(1510521,1510570)与潜在变异区域的编码区间(1510531,1510630)的并集(1510521,1510630)作为扩展后的潜在变异区域的编码区间,然后在第二测序序列的潜在变异区域(此时的潜在变异区域已经更新为扩展后的潜在变异区域)内抽取测序序列片段。由于第二测序序列的编码区间完全落在潜在变异区域的编码区间内,因此,抽取整条第二测序序列作为测序序列片段。也就是说,在本实现方式中,若测序序列与潜在变异区域存在交集,则抽取整条测序序列作为测序序列片段。4B is a schematic diagram of a process of extracting another sequence segment and a reference sequence segment provided by an embodiment of the present application. The process of extracting a sequence segment in FIG. 4B is substantially similar to that of FIG. 4A, and the difference is that The second sequencing sequence, because of the partial intersection of the coding interval and the coding interval of the potential variation region, the union of the coding interval of the second sequencing sequence (1510521, 1510570) and the coding interval of the potential variation region (1510531, 1510630) (1510521) , 1510630) as a coding interval of the expanded potential variation region, and then extracting the sequence fragment in the potential variation region of the second sequencing sequence (the potential variation region at this time has been updated to the expanded potential variation region). Since the coding interval of the second sequencing sequence completely falls within the coding interval of the potential variation region, the entire second sequencing sequence is extracted as a sequencing sequence fragment. That is to say, in the present implementation, if there is an intersection between the sequencing sequence and the potential variation region, the entire sequencing sequence is extracted as a sequencing sequence fragment.
需要指出的是,上述对潜在变异区域的扩展方式仅是本申请实施例所示出的一种具体实现方式,本领域的技术人员可以根据实际需要进行相应调整,其均应当落入本申请的保护范围之内。例如,在对测序序列片段进行抽取之前,可以先将潜在变异区域的编码区间与潜在变异区域内的所有测序序列的编码区间作并集处理(潜在变异区域内的测序序列包括与潜在变异区域存在部分交集的测序序列以及完全落在潜在变异区域内的测序序列),以并集处理结果作为扩展后的潜在变异区域的编码区间。It should be noted that the foregoing expansion manner of the potential variation region is only a specific implementation manner shown in the embodiment of the present application, and those skilled in the art may perform corresponding adjustments according to actual needs, which should all fall into the present application. Within the scope of protection. For example, before the sequence of the sequencing sequence is extracted, the coding interval of the potential variation region and the coding interval of all the sequencing sequences in the potential variation region can be combined and processed (the sequence of the potential variation region includes the potential variation region). Partially intersected sequencing sequences and sequencing sequences that fall entirely within the potential variation region, and the result of the union processing is used as the coding interval of the expanded potential variation region.
在本申请实施例中,由于在测序序列片段的抽取过程中没有将测序序列打断,因此,可以保证测序序列的完整性,进而提高基因组变异检测的准确性。In the embodiment of the present application, since the sequencing sequence is not interrupted during the extraction process of the sequencing sequence fragment, the integrity of the sequencing sequence can be ensured, thereby improving the accuracy of the genomic variation detection.
步骤204:根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段。Step 204: Extract a reference sequence segment from the reference sequence according to the potential variation region.
在本申请实施例中,以潜在变异区域的编码区间为基准,在参考序列中抽取出参考序列片段。例如,在图4A中,潜在变异区域的编码区间为(1510531,1510630),则在参考序列中抽取出编码区间(1510531,1510630)部分,作为参考序列片段;在图4B中,扩展后的潜在变异区域的编码区间为(1510521,1510630),则在参考序列中抽取出编码区间(1510521,1510630)部分,作为参考序列片段。In the embodiment of the present application, the reference sequence segment is extracted in the reference sequence based on the coding interval of the potential variation region. For example, in FIG. 4A, the coding interval of the potential variation region is (1510531, 1510630), and the coding interval (1510531, 1510630) portion is extracted in the reference sequence as the reference sequence segment; in FIG. 4B, the expanded potential is shown in FIG. 4B. The coding interval of the mutation region is (1510521, 1510630), and the coding interval (1510521, 1510630) portion is extracted in the reference sequence as the reference sequence segment.
步骤205:对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果。 Step 205: performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results.
在本申请实施例中,将参考序列片段和所有测序序列片段放在一起进行多序列比对,多序列比对的过程包括:In the embodiment of the present application, the reference sequence fragment and all the sequenced sequence fragments are put together for multiple sequence alignment, and the process of multiple sequence alignment includes:
建立距离矩阵:分别计算两两序列之间的距离(包括参考序列片段和任意一个测序序列片段的距离,任意两个测序序列片段之间的距离),建立两两序列之间的距离矩阵;Establish a distance matrix: separately calculate the distance between the two sequences (including the distance between the reference sequence fragment and any one of the sequencing sequence fragments, the distance between any two sequencing sequence fragments), and establish a distance matrix between the two sequences;
构建聚类树:首先将距离矩阵中距离最近的两个序列聚在一起,然后对距离矩阵进行更新,将更新后的距离矩阵中距离最近的两个序列或两类序列聚在一起,依次类推,直到将所有序列聚在一起,得到参考序列片段和测序序列片段的聚类树;Construct a clustering tree: firstly gather the two closest distances in the distance matrix, then update the distance matrix, and gather the two closest sequences or two types of sequences in the updated distance matrix, and so on. Until all the sequences are brought together to obtain a clustering tree of reference sequence fragments and sequencing sequence fragments;
将序列对齐:根据聚类树中测序序列和参考序列的聚类层次,首先将最内层的两个序列对齐,然后次之,直到将所有的测序序列片段和参考序列片段比对对齐。Aligning the sequences: According to the clustering hierarchy of the sequencing sequence and the reference sequence in the clustering tree, the two innermost sequences are first aligned, and then all the sequencing sequence fragments and the reference sequence fragments are aligned.
由于在构建聚类树的过程中,根据序列之间的距离大小(序列之间的距离代表序列之间的相似度,距离越小,相似度越高)把相似度较高的序列优先聚在一起,因此,将参考序列片段和所有测序序列片段放在一起进行多序列比对,在得到测序序列片段的变异类型的同时,还可以把具有相同变异类型的测序序列聚在一起对齐,避免将属于一种类型的变异错误地比对成不同类型的变异,从而提高基因组变异检测结果的准确性。Because in the process of constructing the clustering tree, according to the distance between the sequences (the distance between the sequences represents the similarity between the sequences, the smaller the distance, the higher the similarity), the higher the similarity sequence is preferentially concentrated. Together, therefore, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment. When the variation type of the sequence fragment is obtained, the sequencing sequences with the same mutation type can be clustered together to avoid One type of variation is erroneously compared to different types of variation, thereby improving the accuracy of genomic variation detection results.
参见图5A,为本申请实施例提供的一种测序序列片段和参考序列片段的多序列比对状态示意图,在图5A中,测序序列片段存在三种不同的变异类型,将所有测序序列片段和第一参考序列片段放在一起进行多序列比对之后,三种不同变异类型的测序序列片段分别聚在一起对齐。另外,由于在双倍体或多倍体的同一单倍型中测序序列片段的变异类型通常相同,因此,将参考序列片段和所有测序序列片段放在一起进行多序列比对,还可以将属于同一单倍型的测序序列片段聚在一起,进而实现双倍体或多倍体的基因组变异检测。5A is a schematic diagram showing a multi-sequence alignment state of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application. In FIG. 5A, there are three different mutation types in the sequencing sequence segment, and all sequencing sequence fragments and After the first reference sequence fragments are put together for multiple sequence alignment, the sequencing sequence fragments of the three different mutation types are respectively aligned and aligned. In addition, since the types of mutations of the sequenced fragments are usually the same in the same haplotype of diploid or polyploid, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment, and may also belong to Sequencing fragments of the same haplotype are brought together to detect genomic variation of diploid or polyploid.
步骤206:根据所述多序列比对结果,确定所述基因组的变异检测结果。Step 206: Determine a mutation detection result of the genome according to the multi-sequence alignment result.
由于多序列比对结果中具有测序序列片段详细的变异信息,包括测序序列片段的变异位置和变异位置处错配、插入或删除信息,因此,根据多序列比对结果即可确定基因组的变异检测结果。Due to the detailed variation information of the sequence fragment in the multi-sequence alignment result, including the mismatch, insertion or deletion information at the mutation position and the mutation position of the sequencing sequence fragment, the variability detection of the genome can be determined according to the multi-sequence alignment result. result.
在本申请实施例中,首先根据所述多序列比对结果,确定潜在变异区域中的变异位置;然后在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处 的变异信息;根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,确定所述测序序列集合中测序序列片段的变异信息为所述基因组变异检测结果。In the embodiment of the present application, first, according to the multi-sequence alignment result, determining a mutation position in a potential variation region; and then extracting, in the multiple sequence alignment result, all the sequencing sequence fragments at the mutation position Mutating information; arranging all of the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the variation information of the sequencing sequence fragments in the same sequencing sequence set at the mutation position is the same; Whether the number of sequencing sequence fragments in the set of sequencing sequences is greater than a third threshold; determining the number of sequencing sequence fragments in the sequencing sequence set when the number of sequencing sequence fragments in the set of sequencing sequences is greater than the third threshold The mutation information is the result of the detection of the genomic variation.
例如,在图5A中,根据多序列比对结果,确定编码1510581处为潜在变异区域中的变异位置;提取所有测序序列片段在编码1510581处的变异信息,共存在三种,分别为:不存在变异,存在碱基段CCT插入,存在碱基段CCT删除;根据所述变异信息,将所有测序序列片段汇聚至三个测序序列集合,分别为第一测序序列集合(变异信息为不存在变异,测序序列片段的数量为11条)、第二测序序列集合(变异信息为存在碱基段CCT插入,测序序列片段的数量为7条)和第三测序序列集合(变异信息为存在碱基段CCT删除,测序序列片段的数量为8条);依次判断每个测序序列集合中的测序序列片段的数量是否大于第三阈值。For example, in FIG. 5A, according to the multi-sequence alignment result, the mutation position in the potential variation region is determined as 1510581; the variation information of all the sequencing sequence fragments in the coding 1510581 is extracted, and there are three kinds, respectively: non-existent Variant, there is a base segment CCT insertion, and there is a base segment CCT deletion; according to the mutation information, all the sequence fragments are aggregated into three sequencing sequence sets, which are respectively a first sequencing sequence set (variation information is no variation, The number of sequencing sequence fragments is 11), the second sequencing sequence set (variation information is the presence of the base segment CCT insertion, the number of sequencing sequence fragments is 7) and the third sequencing sequence set (the variation information is the presence of the base segment CCT) Deletion, the number of sequencing sequence fragments is 8); it is sequentially determined whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold.
假如第三阈值为6,则上述三个测序序列集合中测序序列片段的数量均大于第三阈值,从而得到基因组在编码1510581处的变异检测结果为:不存在变异;碱基段CCT插入;碱基段CCT删除。其也可以表明三倍体的三个单倍型在编码1510581处的变异检测结果分别为:不存在变异;碱基段CCT插入;碱基段CCT删除。If the third threshold is 6, the number of sequencing sequence fragments in the above three sequencing sequence sets is greater than the third threshold, so that the mutation detection result of the genome at code 1510581 is: no mutation; base segment CCT insertion; alkali The base segment CCT is deleted. It can also be shown that the mutation results of the three haplotypes of triploid at 1510581 are: no mutation; base segment CCT insertion; base segment CCT deletion.
假如第三阈值为10,则在上述三个测序序列集合中只有第一测序序列集合中测序序列片段的数量大于第三阈值,从而得到基因组在编码1510581处的变异检测结果为:不存在变异。If the third threshold is 10, then only the number of sequencing sequence fragments in the first sequencing sequence set is greater than the third threshold in the above three sequencing sequence sets, so that the mutation detection result of the genome at code 1510581 is: there is no variation.
需要指出的是,上述第三阈值的大小仅是本申请实施例中的一种示例性说明,本领域的技术人员可以根据实际需要对第三阈值的大小进行相应调整,其均应当落入本申请的保护范围之内。It should be noted that the size of the foregoing third threshold is only an exemplary description in the embodiment of the present application, and those skilled in the art may adjust the size of the third threshold according to actual needs, and all of them should fall into the present embodiment. Within the scope of protection of the application.
从上述实施例可以看出,通过将参考序列片段和所有测序序列片段放在一起进行多序列比对,可以把具有相同变异类型的测序序列片段聚在一起对齐,测序序列片段对齐较为准确,避免将属于一种类型的变异错误地比对成不同类型的变异,从而提高基因组变异检测结果的准确性。It can be seen from the above embodiment that by arranging the reference sequence fragment and all the sequenced fragments together for multiple sequence alignment, the sequencing sequence fragments having the same variation type can be aligned and aligned, and the sequencing sequence segments are aligned accurately, avoiding Falsely categorize one type of variation into different types of variation, thereby improving the accuracy of genomic variation detection results.
但是,由于多序列比对过程中聚类树的构建方式存在一些缺陷,使得多序列比对结果中有可能存在测序序列片段相对参考序列片段整体偏移的问题。 However, due to some defects in the construction of the clustering tree in the multi-sequence alignment process, there is a possibility that the sequencing sequence fragment has an overall offset with respect to the reference sequence fragment in the multiple sequence alignment result.
参见图5B,为本申请实施例提供的另一种测序序列片段和参考序列片段的多序列比对状态示意图,如图5B所示,在第一参考序列片段和测序序列片段的多序列比对结果中,虽然已经将具有相同变异类型的测序序列片段聚在一起对齐,但其中部分具有相同变异类型的测序序列片段相对参考序列片段存在整体偏移的现象。测序序列片段相对参考序列片段的偏移会导致测序序列片段相对参考序列片段变异类型的改变,进而影响基因组变异检测的准确性。因此,有必要在测序序列片段和参考序列片段进行多序列比对后,对测序序列片段相对参考序列片段的变异类型进行校正。5B is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment provided by an embodiment of the present application. As shown in FIG. 5B, multiple sequence alignments of the first reference sequence fragment and the sequencing sequence fragment are shown in FIG. 5B. In the results, although the sequencing sequence fragments having the same variation type have been clustered together, some of the sequencing sequence fragments having the same variation type have an overall offset with respect to the reference sequence fragment. The deviation of the sequenced fragment from the reference sequence fragment results in a change in the type of variation of the sequenced fragment relative to the reference sequence fragment, which in turn affects the accuracy of the detection of the genomic variation. Therefore, it is necessary to correct the variation type of the sequencing sequence fragment relative to the reference sequence fragment after performing multiple sequence alignment between the sequencing sequence fragment and the reference sequence fragment.
参见图6,本申请实施例提供的另一种基因组变异检测方法流程示意图,该方法在图2所示实施例的基础上,在步骤205之后,还可以包括以下步骤:FIG. 6 is a schematic flowchart of another method for detecting a genomic variation according to an embodiment of the present application. The method may further include the following steps after the step 205 on the basis of the embodiment shown in FIG. 2:
步骤601:根据所述多序列比对结果,确定所有测序序列片段的变异类型。Step 601: Determine a variation type of all the sequence fragments according to the multiple sequence alignment result.
在本申请实施例中,将参考序列片段和所有测序序列片段放在一起进行多序列比对之后,即可将测序序列片段中具有相同变异类型的测序序列片段聚在一起对齐,且可以获得所有测序序列片段相对参考序列片段的变异类型,如图5A和图5B所示。由于在图5B中,部分测序序列片段相对参考序列片段发生了整体偏移,因此,若要得到图5A所示的多序列比对结果,需要对图5B中发生偏移的测序序列片段的变异类型进行校正。In the embodiment of the present application, after the reference sequence fragment and all the sequenced sequence fragments are put together for multi-sequence alignment, the sequencing sequence fragments having the same variation type in the sequencing sequence fragments can be put together and aligned, and all can be obtained. The type of variation of the sequenced fragment relative to the reference sequence fragment is shown in Figures 5A and 5B. Since in FIG. 5B, the partial sequencing sequence fragment is totally offset from the reference sequence fragment, if the multi-sequence alignment result shown in FIG. 5A is to be obtained, the variation of the sequence fragment which is shifted in FIG. 5B is required. Type is corrected.
步骤602:根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇。Step 602: Concentrate all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments.
在本申请实施例中,根据测序序列片段的变异类型,对所有测序序列片段进行分类,将具有相同变异类型的测序序列片段汇聚至同一测序序列簇中,以便于对测序序列片段的变异类型进行校正。In the embodiment of the present application, all the sequencing sequence fragments are classified according to the variation type of the sequencing sequence fragment, and the sequencing sequence fragments having the same variation type are aggregated into the same sequencing sequence cluster, so as to facilitate the variation type of the sequencing sequence fragment. Correction.
参见图7,为本申请实施例提供的一种测序序列簇的汇聚结果示意图,其根据测序序列片段的变异类型,将图5B所示的多序列比对结果中所有测序序列片段汇聚为三个测序序列簇。其中,第一测序序列簇中的测序序列片段不存在变异;第二测序序列簇中的测序序列片段存在碱基段CCT的插入;第三测序序列簇中的测序序列片段存在碱基段CGCCAG的删除和一段碱基序列的错配。7 is a schematic diagram of a clustering result cluster of clusters according to an embodiment of the present application, which aggregates all sequenced fragments in the multi-sequence alignment result shown in FIG. 5B into three according to the variation type of the sequencing sequence fragment. Sequencing sequence clusters. Wherein, the sequencing sequence fragment in the first sequencing sequence cluster has no variation; the sequencing sequence fragment in the second sequencing sequence cluster has the insertion of the base segment CCT; and the sequencing sequence fragment in the third sequencing sequence cluster has the base segment CGCCAG Deletion and mismatch of a base sequence.
步骤603:分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列。Step 603: Perform a union process on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters.
由于同一个测序序列簇中的测序序列片段相对参考序列片段具有相同的变异类 型,因此,在同一个测序序列簇中,任意两个测序序列片段的重叠编码区间具有相同的碱基序列,则对测序序列簇中的所有测序序列片段作并集处理即将测序序列片段之间的重叠编码区间合并,获得测序序列簇的特征序列。以下结合附图,对并集处理过程进行示例性说明。Since the sequencing sequence fragments in the same sequencing sequence cluster have the same variation class relative to the reference sequence fragment Therefore, in the same sequencing sequence cluster, if the overlapping coding intervals of any two sequencing sequence fragments have the same base sequence, then the sequencing of all the sequenced fragments in the sequencing sequence cluster is performed. The overlapping coding intervals are combined to obtain the characteristic sequences of the sequence clusters. The union processing process will be exemplified in the following with reference to the accompanying drawings.
参见图8,为本申请实施例提供的一种并集处理过程示意图,在图8中包括两条测序序列片段,其中,第一测序序列片段的编码区间为(1,15),第二测序序列片段的编码区间为(4,18),在两条测序序列片段的重叠编码区间(4,15)内具有相同的碱基序列TCCCCTCCTCCT,则将两条测序序列片段的重叠编码区间合并,测序序列片段中未合并的部分分别作为特征序列的头部和尾部,获得编码区间为(1,18)的特征序列GACTCCCCTCCTCCTCCT。FIG. 8 is a schematic diagram of a process of a union process according to an embodiment of the present application. FIG. 8 includes two sequencing sequence segments, wherein a coding interval of the first sequencing sequence segment is (1, 15), and the second sequencing is performed. The coding interval of the sequence fragment is (4,18). The same base sequence TCCCCTCCTCCT is included in the overlapping coding interval (4, 15) of the two sequencing sequence fragments, and the overlapping coding intervals of the two sequencing sequence fragments are combined and sequenced. The uncombined parts of the sequence fragments are used as the head and tail of the feature sequence, respectively, and the feature sequence GACTCCCCTCCTCCTCCT with the coding interval (1, 18) is obtained.
参见图9A,为本申请实施例将图7中的测序序列簇作并集处理得到的特征序列示意图,其分别将图7中的第一测序序列簇、第二测序序列簇和第三测序序列簇中的所有测序序列片段作并集处理,得到与其相对应的第一特征序列、第二特征序列和第三特征序列。9A is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 7 according to an embodiment of the present application, which respectively adopts a first sequencing sequence cluster, a second sequencing sequence cluster, and a third sequencing sequence in FIG. 7 . All the sequenced fragments in the cluster are subjected to a union process to obtain a first feature sequence, a second feature sequence and a third feature sequence corresponding thereto.
步骤604:将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型。Step 604: Perform double sequence alignment on each of the feature sequences and the reference sequence segment to obtain a variation type of each of the feature sequences.
由于双序列比对可以获得两个序列的最佳比对结果,因此,将特征序列与参考序列进行双序列比对所得到的特征序列的变异类型,为最佳比对结果下特征序列的变异类型。基于此,可以在后续步骤中,根据特征序列的变异类型对特征序列所对应的测序序列片段进行校正。Since the double sequence alignment can obtain the optimal alignment result of the two sequences, the variation type of the feature sequence obtained by performing the double sequence alignment between the feature sequence and the reference sequence is the variation of the feature sequence under the optimal alignment result. Types of. Based on this, in the subsequent step, the sequence segment corresponding to the feature sequence can be corrected according to the variation type of the feature sequence.
假如测序序列簇中的测序序列片段相对于参考序列片段存在偏移,则将参考序列片段做并集处理后所得到的特征序列也会存在同样的偏移,将存在偏移的特征序列与参考序列片段进行双序列比对,可以对特征序列进行校正。也就是说,若测序序列片段发生偏移,则将测序序列片段所对应的特征序列与参考序列片段进行双序列比对后,特征序列的变异类型会发生变化;若测序序列片段没有发生偏移,则将测序序列片段所对应的特征序列与参考序列片段进行双序列比对后,特征序列的变异类型不变。因此,在本申请实施例中,可以根据双序列比对前后特征序列的变异类型判断是否需要对特征序列所对应的测序序列片段进行校正。If there is an offset of the sequence segment in the sequence cluster relative to the reference sequence segment, then the feature sequence obtained by performing the union process on the reference sequence segment will also have the same offset, and the offset feature sequence and reference will be present. The sequence fragments are subjected to double sequence alignment, and the feature sequences can be corrected. That is to say, if the sequence of the sequencing sequence is shifted, the sequence of the sequence of the sequence sequence is compared with the sequence of the reference sequence, and the variation type of the sequence is changed; if the sequence of the sequence is not offset Then, after the sequence sequence corresponding to the sequence fragment of the sequence is compared with the reference sequence fragment, the variation type of the feature sequence is unchanged. Therefore, in the embodiment of the present application, whether the sequence of the sequence corresponding to the feature sequence needs to be corrected may be determined according to the mutation type of the two-sequence alignment.
参见图9B,为本申请实施例将图9A中的特征序列与参考序列片段进行双序列比对,得到的双序列比对结果示意图,其分别将图9A所示的第一特征序列、第二特 征序列和第三特征序列与参考序列片段进行双序列比对,获得的比对结果如图9B所示。对比图9A和图9B可知,在将特征序列与参考序列片段进行双序列比对之后,第一特征序列和第二特征序列的变异类型没有发生变化,第三特征序列的变异类型发生了改变。也就是说,第一特征序列和第二特征序列所对应的测序序列片段在多序列比对之后已经取得了最佳的比对效果,不需要进行校正;第三特征序列所对应的测序序列片段相对参考序列片段发生了整体偏移,需要进一步校正。9B is a schematic diagram of double sequence alignment of the feature sequence of FIG. 9A and the reference sequence segment according to an embodiment of the present application, which respectively obtain the first feature sequence and the second sequence shown in FIG. 9A. Special The sequence sequence and the third feature sequence are subjected to double sequence alignment with the reference sequence fragment, and the obtained alignment result is shown in Fig. 9B. Comparing FIG. 9A and FIG. 9B, after the dual sequence alignment of the feature sequence and the reference sequence segment, the mutation type of the first feature sequence and the second feature sequence does not change, and the variation type of the third feature sequence changes. That is to say, the sequence of the sequence corresponding to the first feature sequence and the second feature sequence has achieved the best alignment effect after multiple sequence alignment, and no correction is needed; the sequence sequence segment corresponding to the third feature sequence An overall offset has occurred relative to the reference sequence segment and further correction is required.
步骤605:根据每个所述特征序列的变异类型对所述多序列比对结果进行校正。Step 605: Correct the multi-sequence alignment result according to the variation type of each of the feature sequences.
在本申请实施例中,以特征序列的变异类型为基准,对特征序列所对应的测序序列片段的变异类型进行校正,也就是对多序列比对结果进行校正。具体为:当特征序列的变异类型和与其相对应的测序序列片段的变异类型不同时,将测序序列片段的变异类型调整为特征序列的变异类型,使得校正后的多序列比对结果中测序序列片段的变异类型与所述测序序列片段所对应的特征序列的变异类型相同。In the embodiment of the present application, the variation type of the sequence segment corresponding to the feature sequence is corrected based on the variation type of the feature sequence, that is, the result of the multiple sequence alignment is corrected. Specifically, when the variation type of the characteristic sequence is different from the variation type of the corresponding sequencing sequence fragment, the variation type of the sequencing sequence fragment is adjusted to the variation type of the characteristic sequence, so that the sequence of the corrected multiple sequence alignment result is sequenced. The variation type of the fragment is the same as the variation type of the characteristic sequence corresponding to the fragment of the sequencing sequence.
例如,在图9B中,将第三特征序列与参考序列片段进行双序列比对之后,第三特征序列的变异类型发生了改变,导致第三特征序列的变异类型与第三测序序列簇的测序序列片段的变异类型不同,因此,需要根据第三特征序列的变异类型对第三测序序列簇的测序序列片段的变异类型进行调整。For example, in FIG. 9B, after the third sequence of the third feature sequence is aligned with the reference sequence segment, the variation type of the third feature sequence is changed, resulting in the mutation type of the third feature sequence and the sequencing of the third sequencing sequence cluster. The variation types of the sequence fragments are different. Therefore, it is necessary to adjust the variation type of the sequenced sequence fragments of the third sequencing sequence cluster according to the variation type of the third characteristic sequence.
参见图9C,为本申请实施例根据图9B中的特征序列的变异类型对多序列比对结果进行校正,得到的校正后的多序列比对结果示意图,其中,将第三测序序列簇的测序序列片段的变异类型调整为第三特征序列的变异类型。FIG. 9C is a schematic diagram of the corrected multi-sequence alignment result obtained by correcting the multi-sequence alignment result according to the variation type of the characteristic sequence in FIG. 9B according to the embodiment of the present application, wherein the third sequencing sequence cluster is sequenced. The variation type of the sequence fragment is adjusted to the variation type of the third feature sequence.
从上述实施例可以看出,在本申请实施例中,首先通过特性序列与参考序列片段的双序列比对,对特征序列进行校正;然后根据校正后的特征序列对特征序列所对应的测序序列片段进行校正,克服了多序列比对结果中部分测序序列片段相对参考序列片段发生偏移的问题,提高基因组变异检测结果的准确性。As can be seen from the above embodiment, in the embodiment of the present application, the feature sequence is first corrected by the double sequence alignment of the characteristic sequence and the reference sequence segment; and then the sequence corresponding to the feature sequence is corrected according to the corrected feature sequence. The fragment was corrected to overcome the problem of partial sequencing sequence fragment deviation from the reference sequence fragment in the multi-sequence alignment result, and the accuracy of the genomic variation detection result was improved.
通常情况下,在双序列比对过程中,两条序列的长度差距越大,出现多种比对结果的可能性越大,即双序列比对结果出错的可能性越大。即在上述步骤604中,将特征序列与参考序列片段进行双序列比对时,特征序列越长,特征序列与参考序列片段的双序列比对结果的准确性越高。In general, in the double sequence alignment process, the greater the difference in length between the two sequences, the greater the possibility of multiple alignment results, that is, the greater the probability that the double sequence alignment will be wrong. That is, in the above step 604, when the feature sequence is compared with the reference sequence segment by double sequence, the longer the feature sequence is, the higher the accuracy of the double sequence alignment result of the feature sequence and the reference sequence segment is.
参见图10A,为本申请实施例提供的另一种测序序列片段和参考序列片段的多序列比对状态示意图,在图10A中根据测序序列片段的变异类型,将测序序列片段汇 聚为三个测序序列簇,分别为第四测序序列簇、第五测序序列簇和第六测序序列簇。FIG. 10A is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment provided by an embodiment of the present application. In FIG. 10A, according to the variation type of the sequencing sequence fragment, the sequencing sequence fragment is merged. The clusters are clustered into three sequencing sequences, which are a fourth sequencing sequence cluster, a fifth sequencing sequence cluster and a sixth sequencing sequence cluster, respectively.
参见图10B,为本申请实施例将图10A中的测序序列簇作并集处理,得到的特征序列示意图,其分别对第四测序序列簇、第五测序序列簇和第六测序序列簇中的所有测序序列片段作并集处理,得到与其相对应的第四特征序列、第五特征序列和第六特征序列。10B is a schematic diagram of a feature sequence obtained by performing the union processing of the sequence clusters in FIG. 10A in the embodiment of the present application, respectively, in the fourth sequencing sequence cluster, the fifth sequencing sequence cluster, and the sixth sequencing sequence cluster. All the sequenced fragments are processed in a union, and the fourth, fifth and sixth characteristic sequences corresponding thereto are obtained.
对照图10A和图10B,由于第五测序序列簇和第六测序序列簇中的测序序列片段较短(相对参考序列片段),则对测序序列簇中的所有测序序列片段作并集处理后,得到的第五特征序列和第六特征序列同样较短。如果将第五特征序列或第六特征序列直接与参考序列片段进行双序列比对,很有可能不能得到理想的比对结果,导致特征序列的变异类型不准确,进而影响测序序列片段的校正。10A and 10B, since the sequencing sequence fragments in the fifth sequencing sequence cluster and the sixth sequencing sequence cluster are shorter (relative reference sequence fragments), after all the sequencing sequence fragments in the sequencing sequence cluster are subjected to the union processing, The resulting fifth and sixth feature sequences are also shorter. If the fifth characteristic sequence or the sixth characteristic sequence is directly compared with the reference sequence fragment, it is likely that the ideal alignment result cannot be obtained, and the variation type of the characteristic sequence is inaccurate, thereby affecting the correction of the sequence segment.
参见图11,为本申请实施例提供的另一种基因组变异检测方法流程示意图,该方法在图6所示实施例的基础上,在步骤603之后,还可以包括以下步骤:FIG. 11 is a schematic flowchart of another method for detecting a genomic variation according to an embodiment of the present application. The method may further include the following steps after the step 603, based on the embodiment shown in FIG. 6 :
步骤1101:将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对。Step 1101: Perform double sequence alignment on any two of the characteristic sequences of each of the obtained sequencing sequence clusters.
在本申请实施例中,得到测序序列簇的特征序列后,分别将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对,以判断是否可以将两个特征序列所对应的测序序列簇进一步进行合并。例如,对于图10B所示的特征序列,分别将第四特征序列和第五特征序列,第四特征序列和第六特征序列,第五特征序列和第六特征序列进行双序列比对。In the embodiment of the present application, after obtaining the characteristic sequence of the sequence cluster, respectively, any two of the feature sequences of each of the obtained sequencing sequence clusters are double-sequence-aligned to determine whether two The sequencing sequence cluster corresponding to the characteristic sequence is further combined. For example, for the feature sequence shown in FIG. 10B, the fourth feature sequence and the fifth feature sequence, the fourth feature sequence, and the sixth feature sequence, the fifth feature sequence, and the sixth feature sequence are respectively subjected to double sequence alignment.
步骤1102:判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内。Step 1102: Determine whether there is an exact matching of overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region.
若两个特征序列的重叠区域不能完全匹配,说明两个特征序列在其重叠区域内具有不同的变异类型,则不能将其进行合并,因此,两个特征序列的重叠区域完全匹配是对两个特征序列进行合并的大前提。至少一个特征序列相对于参考序列片段的变异位置完全落在所述重叠区域内可以保证两个特征序列在其重叠区域内至少具有一个变异信息相同的变异位置。If the overlapping regions of the two feature sequences cannot be completely matched, it means that the two feature sequences have different mutation types in their overlapping regions, so they cannot be merged. Therefore, the overlapping regions of the two feature sequences are completely matched. The premise of merging feature sequences. The fact that the variation position of the at least one feature sequence relative to the reference sequence segment completely falls within the overlap region ensures that the two feature sequences have at least one variation position with the same variation information in their overlapping regions.
例如,在图10B所示的第一变异位置,第四特征序列和第五特征序列相对第二参考序列片段存在碱基段CC的删除,且第一变异位置位于第四特征序列和第五特征序列的重叠区域内,说明第四特征序列和第五特征序列满足上述判断条件;在图10B 所示的第二变异位置,第四特征序列和第六特征序列相对第二参考序列片段存在碱基段CC的插入,且第二变异位置位于第四特征序列和第六特征序列的重叠区域内,说明第四特征序列和第六特征序列同样满足上述判断条件。For example, in the first variation position shown in FIG. 10B, the fourth feature sequence and the fifth feature sequence have a deletion of the base segment CC relative to the second reference sequence segment, and the first variation position is located in the fourth feature sequence and the fifth feature. In the overlapping region of the sequence, it is explained that the fourth feature sequence and the fifth feature sequence satisfy the above judgment condition; in FIG. 10B The second variation position, the fourth feature sequence and the sixth feature sequence have the insertion of the base segment CC relative to the second reference sequence segment, and the second mutation position is located in the overlapping region of the fourth feature sequence and the sixth feature sequence. It is to be noted that the fourth feature sequence and the sixth feature sequence also satisfy the above-described judgment condition.
当满足上述判断条件时,则进入步骤1103,对测序序列簇进一步合并;否则,则进入步骤604,将每个特征序列与参考序列片段进行双序列比对。When the above judgment condition is satisfied, the process proceeds to step 1103 to further merge the sequence clusters; otherwise, proceed to step 604 to perform a double sequence alignment of each feature sequence with the reference sequence segment.
步骤1103:将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。Step 1103: Combine the sequenced sequence clusters corresponding to the two characteristic sequences to obtain a merged sequence cluster, and combine the two feature sequences to obtain the characteristics of the combined sequence clusters. sequence.
由于测序序列簇和特征序列具有一一对应的关系,因此,将测序序列簇合并后,测序序列簇的特征序列也需要对应合并。其中,将两个特征序列所对应的测序序列簇合并是指以合并后的测序序列簇取代合并前的两个测序序列簇,实现测序序列簇的更新;将两个特征序列作并集处理是指以并集处理获得的特征序列取代并集处理前的两个特征序列,实现对特征序列的更新。Since the sequencing sequence cluster and the characteristic sequence have a one-to-one correspondence, after the sequencing sequence clusters are combined, the characteristic sequences of the sequencing sequence clusters also need to be combined. The merging of the sequencing sequence clusters corresponding to the two characteristic sequences refers to replacing the two sequencing sequence clusters before the combination with the merged sequence clusters to realize the update of the sequencing sequence clusters; Refers to the feature sequence obtained by the union process instead of the two feature sequences before the union process to achieve the update of the feature sequence.
步骤1103执行完成后,返回步骤1101,继续对特征序列进行双序列比对,以判断是否还存在符合合并条件的测序序列簇。其中,步骤1101中的特征序列包括通过并集处理得到的特征序列,步骤1103中的测序序列簇包括合并后的测序序列簇。After the execution of step 1103 is completed, the process returns to step 1101 to continue the dual sequence alignment of the feature sequences to determine whether there are still clusters of sequencing sequences that meet the merge conditions. Wherein, the feature sequence in step 1101 includes a feature sequence obtained by the union process, and the sequence cluster in step 1103 includes the merged sequence cluster.
参见图12A和图12B,其中,图12A为本申请实施例将图10B中的特征序列进行合并的合并过程示意图,图12B为本申请实施例将图10A中的测序序列簇进行合并的合并过程示意图。如图12A所示,首先对第四特征序列和第五特征序列进行双序列比对,由于第四特征序列和第五特征序列的重叠区域完全匹配,且存在第一变异位置(碱基段CC的删除)完全落在其重叠区域内,因此,将第四特征序列和第五特征序列合并,得到第七特征序列。相应地,如图12B所示,将第四测序序列簇和第五测序序列簇合并,得到第七测序序列簇。Referring to FIG. 12A and FIG. 12B, FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application, and FIG. 12B is a merge process of merging the sequence clusters in FIG. 10A according to an embodiment of the present application. schematic diagram. As shown in FIG. 12A, the fourth feature sequence and the fifth feature sequence are first subjected to double sequence alignment, because the overlapping regions of the fourth feature sequence and the fifth feature sequence are completely matched, and the first variation position exists (base segment CC) The deletion is completely within its overlapping region, and therefore, the fourth feature sequence and the fifth feature sequence are combined to obtain a seventh feature sequence. Accordingly, as shown in FIG. 12B, the fourth sequencing sequence cluster and the fifth sequencing sequence cluster are combined to obtain a seventh sequencing sequence cluster.
进一步地,对第七特征序列和第六特征序列进行双序列比对,由于第七特征序列和第六特征序列的重叠区域完全匹配,且存在第二变异位置(碱基段CC的插入)完全落在其重叠区域内,因此,将第七特征序列和第六特征序列合并,得到第八特征序列。相应地,将第七测序序列簇和第六测序序列簇合并,得到第八测序序列簇。则在后续的步骤604中,仅将第八特征序列与参考序列片段进行双序列比对,根据第八特征序列的变异类型对第八测序序列簇中的测序序列片段进行校正。在步骤604中,将每个特征序列与参考序列片段进行双序列比对,这里的每个特征序列中既包括因为不 符合合并条件未进行合并的测序序列簇的特征序列,也包括后续对测序系列簇进行合并得到的合并后的测序序列簇的特征序列。Further, the seventh feature sequence and the sixth feature sequence are subjected to double sequence alignment, because the overlapping regions of the seventh feature sequence and the sixth feature sequence are completely matched, and the second mutation position (the insertion of the base segment CC) is completely present. Falling within its overlapping region, therefore, the seventh feature sequence and the sixth feature sequence are combined to obtain an eighth feature sequence. Correspondingly, the seventh sequencing sequence cluster and the sixth sequencing sequence cluster are combined to obtain an eighth sequencing sequence cluster. Then, in the subsequent step 604, only the eighth feature sequence and the reference sequence segment are double-sequence aligned, and the sequence segment in the eighth sequencing sequence cluster is corrected according to the mutation type of the eighth feature sequence. In step 604, each feature sequence is compared with a reference sequence segment by a double sequence, where each feature sequence includes both The characteristic sequence of the sequencing sequence cluster which is not merged according to the combination condition, and the characteristic sequence of the merged sequencing sequence cluster obtained by combining the sequenced clusters.
从上述实施例可以看出,在本申请实施例中,通过把符合合并条件的测序序列簇进一步合并,增加特征序列的长度,进而提高了特征序列与参考序列片段双序列比对结果的准确性。It can be seen from the above embodiment that in the embodiment of the present application, the sequence of the sequence sequence that meets the merge condition is further combined to increase the length of the feature sequence, thereby improving the accuracy of the double sequence alignment of the feature sequence and the reference sequence segment. .
与本申请基因组变异检测方法相对应,本申请还提供了基因组变异检测装置。Corresponding to the genomic variation detecting method of the present application, the present application also provides a genomic variation detecting device.
参见图13,为本申请实施例提供的第一基因组变异检测装置结构示意图。FIG. 13 is a schematic structural diagram of a first genomic variation detecting apparatus according to an embodiment of the present application.
所述第一基因组变异检测装置1300可以包括:第一双序列比对单元1301、潜在变异区域确定单元1302、测序序列片段抽取单元1303、参考序列片段抽取单元1304、多序列比对单元1305及变异检测结果确定单元1306。The first genomic variation detecting apparatus 1300 may include: a first dual sequence aligning unit 1301, a potential mutated region determining unit 1302, a sequencing sequence segment extracting unit 1303, a reference sequence segment extracting unit 1304, a multiple sequence aligning unit 1305, and a variation. The detection result determining unit 1306.
其中,第一双序列比对单元1301,用于将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列。The first double sequence alignment unit 1301 is configured to perform double sequence alignment on multiple sequence sequences of the genome and the reference sequence, respectively, wherein the reference sequence is that the genome has no variation. The base sequence at the time, the sequencing sequence being the base sequence to be detected in the genome.
潜在变异区域确定单元1302,用于根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间。The potential variation region determining unit 1302 is configured to determine a potential variation region of the genome according to the double sequence alignment result, where the potential variation region is a base coding interval in which a potential variation occurs in the genome.
测序序列片段抽取单元1303,用于根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段。The sequencing sequence fragment extracting unit 1303 is configured to extract a sequencing sequence fragment from all the sequencing sequences according to the potential variation region.
参考序列片段抽取单元1304,用于根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段。The reference sequence segment extracting unit 1304 is configured to extract a reference sequence segment in the reference sequence according to the potential variation region.
多序列比对单元1305,用于对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果。The multiple sequence alignment unit 1305 is configured to perform multiple sequence alignment on the reference sequence fragment and all the sequenced sequence fragments to obtain a multiple sequence alignment result.
变异检测结果确定单元1306,用于根据所述多序列比对结果,确定所述基因组的变异检测结果。The mutation detection result determining unit 1306 is configured to determine a mutation detection result of the genome according to the multiple sequence alignment result.
在本申请一种可能的实现方式中,所述潜在变异区域确定单元1302包括:第一编码区间划分子单元,用于根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;变异类型确定子单元,用于根据所述双序列比对结果,确定所有测序序列的变异类型;概率分布值统计子单元,用于依次统计每个所述编码区间内不同变异类型的测序序列的概率分布值;信息熵计算子单元,用于根据所述概率分布值,计算每个所述编码区间的信息熵;第一阈值判断子单元,用于依次判断每个所述编码区间的信息熵是否大于第一阈值;第一潜在变异区域判定子单元,用于当一个所述编码区 间的信息熵大于所述第一阈值时,判定该编码区间为潜在变异区域。In a possible implementation manner of the present application, the potential variation region determining unit 1302 includes: a first coding interval dividing subunit, configured to divide the genome into multiple codes according to a base coding order of the genome. Interval; a mutation type determining subunit for determining a variation type of all sequencing sequences according to the double sequence alignment result; a probability distribution value statistical subunit for sequentially counting sequencing of different mutation types in each of the coding intervals a probability distribution value of the sequence; an information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value; and a first threshold determination subunit, configured to sequentially determine each of the coding intervals Whether the information entropy is greater than a first threshold; the first potential variation region determining subunit is used to be one of the coding regions When the information entropy is greater than the first threshold, it is determined that the coding interval is a potential variation region.
在本申请一种可能的实现方式中,所述潜在变异区域确定单元1302包括:第二编码区间划分子单元,用于根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;变异数量统计子单元,用于依次统计每个所述编码区间内发生变异的测序序列的数量;第二阈值判断子单元,用于判断每个所述编码区间内发生变异的测序序列的数量是否大于第二阈值;第二潜在变异区域判定子单元,用于当一个所述编码区间内发生变异的测序序列的数量大于所述第二阈值时,判定该编码区间为潜在变异区域。In a possible implementation manner of the present application, the potential variation region determining unit 1302 includes: a second coding interval dividing subunit, configured to divide the genome into multiple codes according to a base coding order of the genome. a section; a variance quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals; and a second threshold determining subunit for determining a sequencing sequence in which each of the coding intervals is mutated Whether the number is greater than a second threshold; and the second potential variation region determining subunit is configured to determine that the coding interval is a potential variation region when the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold.
在本申请一种可能的实现方式中,所述测序序列片段抽取单元1303,具体用于抽取每条所述测序序列与所述潜在变异区域的交集部分作为所述测序序列片段。In a possible implementation manner of the present application, the sequencing sequence segment extracting unit 1303 is specifically configured to extract an intersection portion of each of the sequencing sequence and the potential variation region as the sequencing sequence segment.
在本申请一种可能的实现方式中,所述测序序列片段抽取单元1303,具体用于当所述交集判断子单元判断所述测序序列与所述潜在变异区域存在交集时,抽取所述测序序列作为所述测序序列片段。In a possible implementation manner of the present application, the sequencing sequence segment extracting unit 1303 is configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection, extract the sequencing sequence. As the fragment of the sequencing sequence.
在本申请一种可能的实现方式中,所述参考序列片段抽取单元1304,具体用于抽取所述参考序列与所述潜在变异区域的交集部分作为所述参考序列片段。In a possible implementation manner of the present application, the reference sequence segment extracting unit 1304 is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence segment.
在本申请一种可能的实现方式中,所述变异检测结果确定单元1306,包括:变异位置确定子单元,用于根据所述多序列比对结果,确定所述潜在变异区域中的变异位置;变异信息提取子单元,用于在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处的变异信息;测序序列集合汇聚子单元,用于根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;第三阈值判断子单元,用于依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;变异检测结果判定子单元,用于当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,判定所述测序序列集合中测序序列片段的变异信息为所述基因组的变异检测结果。In a possible implementation manner of the present application, the mutation detection result determining unit 1306 includes: a mutation position determining subunit, configured to determine a mutation position in the potential variation region according to the multiple sequence alignment result; a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position; and a sequencing sequence collection convergence subunit, configured to use the mutation information according to the variation information, Converging all of the sequencing sequence fragments into at least one sequencing sequence set, wherein the sequencing information fragments in the same sequencing sequence set have the same variation information at the mutation position; and the third threshold determination subunit is used to sequentially determine each Whether the number of sequencing sequence fragments in the set of sequencing sequences is greater than a third threshold; a mutation detection result determining subunit, configured to determine when the number of sequencing sequence segments in one of the sequencing sequence sets is greater than the third threshold The variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.
参见图14,为本申请实施例提供的第二基因组变异检测装置结构示意图。FIG. 14 is a schematic structural diagram of a second genomic variation detecting apparatus according to an embodiment of the present application.
所述第二基因组变异检测装置1400在图13所示的第一基因组变异检测装置1300的基础上,还包括:变异类型确定单元1401、测序序列簇汇聚单元1402、并集处理单元1403、第二双序列比对单元1404及校正单元1405。The second genomic variation detecting apparatus 1400 further includes: a mutation type determining unit 1401, a sequencing sequence cluster merging unit 1402, a union processing unit 1403, and a second, based on the first genomic variation detecting apparatus 1300 shown in FIG. The dual sequence alignment unit 1404 and the correction unit 1405.
其中,变异类型确定单元1401,用于根据所述多序列比对结果,确定所有测序序列片段的变异类型。 The mutation type determining unit 1401 is configured to determine a variation type of all the sequence segments according to the multiple sequence alignment result.
测序序列簇汇聚单元1402,用于根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇,其中,同一测序序列簇中的测序序列片段的变异类型相同。The sequencing sequence cluster converging unit 1402 is configured to aggregate all the sequenced sequence fragments into at least one sequencing sequence cluster according to the variation type of the all sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type.
并集处理单元1403,用于分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列。The union processing unit 1403 is configured to perform a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters.
第二双序列比对单元1404,用于将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型。The second double sequence alignment unit 1404 is configured to perform double sequence alignment on each of the feature sequences and the reference sequence segments to obtain a variation type of each of the feature sequences.
校正单元1405,用于根据每个所述特征序列的变异类型对所述多序列比对结果进行校正,其中,所述校正后的多序列比对结果中每个测序序列片段的变异类型与所述每个测序序列片段所对应的特征序列的变异类型相同。a correcting unit 1405, configured to correct the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein a variation type and a variation type of each sequence segment in the corrected multiple sequence alignment result The characteristic sequences corresponding to each of the sequencing sequence fragments have the same type of variation.
参见图15,为本申请实施例提供的第三基因组变异检测装置结构示意图。15 is a schematic structural diagram of a third genomic variation detecting apparatus according to an embodiment of the present application.
所述第三基因组变异检测装置1500在图14所示的第二基因组变异检测装置1400的基础上,还包括:第三双序列比对单元1501、重叠区域判断单元1502及合并单元1503。The third genomic variation detecting apparatus 1500 further includes a third dual sequence matching unit 1501, an overlapping area determining unit 1502, and a merging unit 1503, based on the second genomic variation detecting apparatus 1400 shown in FIG.
其中,第三双序列比对单元1501,用于将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对。The third double sequence alignment unit 1501 is configured to perform double sequence alignment on any two of the feature sequences of each of the obtained sequencing sequence clusters.
重叠区域判断单元1502,用于判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内。The overlap region determining unit 1502 is configured to determine whether there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the mutation position of the at least one feature sequence is completely within the overlapping region.
合并单元1503,用于当存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内时,将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。a merging unit 1503, configured to merge the sequence sequence clusters corresponding to the two feature sequences when the overlapping regions of the two feature sequences are completely matched, and the mutation positions of the at least one feature sequence are completely within the overlapping region The merged sequence clusters are obtained, and the two feature sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence clusters.
其中,本申请实施例提供的基因组变异检测装置中各功能单元之间的关系可以参见前述基因组变异检测方法中的步骤,在此不再赘述。For the relationship between the functional units in the genomic variation detecting apparatus provided in the embodiment of the present application, reference may be made to the steps in the foregoing genomic variation detecting method, and details are not described herein again.
与本申请基因组变异检测方法相对应,本申请还提供了基因组变异检测终端。Corresponding to the genomic variation detection method of the present application, the present application also provides a genomic mutation detection terminal.
参见图16,为本申请实施例提供的一种基因组变异检测终端结构示意图,所述基因组变异检测终端1600可以包括:处理器1601、存储器1602及通信单元1603。这些组件通过一条或多条总线进行通信,本领域技术人员可以理解,图中示出的服务器的结构并不构成对本申请的限定,它既可以是总线形结构,也可以是星型结构,还可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。 FIG. 16 is a schematic structural diagram of a genomic mutation detecting terminal according to an embodiment of the present application. The genomic variation detecting terminal 1600 may include: a processor 1601, a memory 1602, and a communication unit 1603. These components communicate through one or more buses. It will be understood by those skilled in the art that the structure of the server shown in the figure does not constitute a limitation of the present application, and it may be a bus structure or a star structure. More or fewer components may be included than in the drawings, or some components may be combined, or different component arrangements.
其中,所述通信单元1603,用于建立通信信道,从而使所述存储设备可以与其它设备进行通信。接收其他设备发是的用户数据或者向其他设备发送用户数据。The communication unit 1603 is configured to establish a communication channel, so that the storage device can communicate with other devices. Receive user data sent by other devices or send user data to other devices.
所述处理器1601,为存储设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器1602内的软件程序和/或模块,以及调用存储在存储器内的数据,以执行电子设备的各种功能和/或处理数据。所述处理器可以由集成电路(Integrated Circuit,简称IC)组成,例如可以由单颗封装的IC所组成,也可以由连接多颗相同功能或不同功能的封装IC而组成。举例来说,处理器1601可以仅包括中央处理器(Central Processing Unit,简称CPU)。在本申请实施方式中,CPU可以是单运算核心,也可以包括多运算核心。The processor 1601, which is a control center of the storage device, connects various parts of the entire electronic device by using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 1602, and calling the storage in the memory. Data to perform various functions of the electronic device and/or process data. The processor may be composed of an integrated circuit (IC), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs that have the same function or different functions. For example, the processor 1601 may include only a Central Processing Unit (CPU). In the embodiment of the present application, the CPU may be a single operation core, and may also include a multi-operation core.
所述存储器1602,用于存储处理器1601的执行指令,存储器1602可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 1602 is configured to store execution instructions of the processor 1601, and the memory 1602 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), Erase programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
当存储器1602中的执行指令由处理器1601执行时,使得基因组变异检测终端1600能够执行以下步骤:When the execution instructions in the memory 1602 are executed by the processor 1601, the genomic mutation detecting terminal 1600 is enabled to perform the following steps:
将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列;根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间;根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段;根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段;对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果;根据所述多序列比对结果,确定所述基因组的变异检测结果。Double-sequence alignment of a plurality of sequencing sequences of the genome and the reference sequence, respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected by the genome; determining a potential variation region of the genome according to the double sequence alignment result, the potential variation region being a base coding interval in which a potential mutation occurs in the genome; according to the potential variation a region, a sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment is extracted from the reference sequence according to the potential variation region; and the reference sequence fragment and all the sequenced fragments are subjected to multiple sequence alignment to obtain Multiple sequence alignment results; determining the variation detection results of the genome based on the multiple sequence alignment results.
具体实现中,本申请还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时可包括本申请提供的呼叫方法的各实施例中的部分或全部步骤。所述的存储介质可为磁碟、光盘、只读存储记忆体(英文:read-only memory,简称:ROM)或随机存储记忆体(英文:random access memory,简称:RAM)等。In a specific implementation, the present application further provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in various embodiments of the calling method provided by the application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (English: read-only memory, abbreviated as: ROM) or a random access memory (English: random access memory, abbreviation: RAM).
本领域的技术人员可以清楚地了解到本申请实施例中的技术可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请实施例中的技术方案本质上 或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。Those skilled in the art can clearly understand that the technology in the embodiments of the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution in the embodiment of the present application is essentially Or the part contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, etc., including a plurality of instructions for making one A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments or portions of the embodiments of the present application.
本说明书中各个实施例之间相同相似的部分互相参见即可。尤其,对于装置实施例和终端实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例中的说明即可。The same and similar parts between the various embodiments in this specification can be referred to each other. In particular, for the device embodiment and the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.
以上所述的本申请实施方式并不构成对本申请保护范围的限定。 The embodiments of the present application described above are not intended to limit the scope of the present application.

Claims (19)

  1. 一种基因组变异检测方法,其特征在于,包括:A method for detecting genomic variation, comprising:
    将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列;Double-sequence alignment of a plurality of sequencing sequences of the genome and the reference sequence, respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is The base sequence to be detected in the genome;
    根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间;Determining, according to the double sequence alignment result, a potential variation region of the genome, wherein the potential variation region is a base coding interval in which a potential mutation occurs in the genome;
    根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段;According to the potential variation region, a sequencing sequence fragment is extracted from all sequencing sequences;
    根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段;Extracting a reference sequence segment from the reference sequence according to the potential variation region;
    对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果;Performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results;
    根据所述多序列比对结果,确定所述基因组的变异检测结果。Based on the multi-sequence alignment result, the variation detection result of the genome is determined.
  2. 根据权利要求1所述的基因组变异检测方法,其特征在于,在对所述参考序列片段和所有测序序列片段进行多序列比对,得到多序列比对结果之后,还包括:The genomic variation detecting method according to claim 1, wherein after the plurality of sequence alignments are performed on the reference sequence fragment and all the sequenced fragments, the multi-sequence alignment result is obtained, and the method further comprises:
    根据所述多序列比对结果,确定所有测序序列片段的变异类型;Determining the type of variation of all sequencing sequence fragments based on the multiple sequence alignment results;
    根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇,其中,同一测序序列簇中的测序序列片段的变异类型相同;Depending on the type of variation of all of the sequencing sequence fragments, all of the sequencing sequence fragments are aggregated into at least one sequencing sequence cluster, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type;
    分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列;Performing a union process on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters;
    将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型;Performing a double sequence alignment on each of the feature sequences and the reference sequence segments to obtain a variation type of each of the feature sequences;
    根据每个所述特征序列的变异类型对所述多序列比对结果进行校正,其中,所述校正后的多序列比对结果中每个测序序列片段的变异类型与所述每个测序序列片段所对应的特征序列的变异类型相同。Correcting the multi-sequence alignment result according to a variation type of each of the characteristic sequences, wherein the corrected multi-sequence alignment result has a variation type of each of the sequencing sequence fragments and each of the sequencing sequence fragments The corresponding feature sequences have the same variation type.
  3. 根据权利要求2所述的基因组变异检测方法,其特征在于,在分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列之后,还包括:The genomic variation detecting method according to claim 2, wherein after separately performing clustering processing on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters, Also includes:
    将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序 列比对;Performing double ordering on any two of the characteristic sequences of each of the obtained sequencing sequence clusters Column alignment
    判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内;Determining whether there is an exact overlap of overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region;
    当存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内时,将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。When the overlapping regions of the two feature sequences are completely matched, and the mutation positions of at least one of the feature sequences are completely within the overlapping region, the sequencing sequence clusters corresponding to the two feature sequences are combined to obtain a combined sequencing. Sequence clusters, and the two feature sequences are processed in a union to obtain a characteristic sequence of the merged sequence clusters.
  4. 根据权利要求1所述的基因组变异检测方法,其特征在于,根据所述双序列比对结果,确定所述基因组的潜在变异区域,包括:The genomic variation detecting method according to claim 1, wherein the potential variation region of the genome is determined according to the double sequence alignment result, including:
    根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;Dividing the genome into a plurality of coding intervals according to a base coding order of the genome;
    根据所述双序列比对结果,确定所有测序序列的变异类型;Determining the type of variation of all sequencing sequences based on the double sequence alignment results;
    依次统计每个所述编码区间内不同变异类型的测序序列的概率分布值;And sequentially calculating a probability distribution value of the sequencing sequence of different mutation types in each of the coding intervals;
    根据所述概率分布值,计算每个所述编码区间的信息熵;Calculating an information entropy of each of the coding intervals according to the probability distribution value;
    依次判断每个所述编码区间的信息熵是否大于第一阈值;Determining, in turn, whether the information entropy of each of the coding intervals is greater than a first threshold;
    当一个所述编码区间的信息熵大于所述第一阈值时,判定该编码区间为潜在变异区域。When the information entropy of one of the coding intervals is greater than the first threshold, it is determined that the coding interval is a potential variation region.
  5. 根据权利要求1所述的基因组变异检测方法,其特征在于,根据所述双序列比对结果,确定所述基因组的潜在变异区域,包括:The genomic variation detecting method according to claim 1, wherein the potential variation region of the genome is determined according to the double sequence alignment result, including:
    根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;Dividing the genome into a plurality of coding intervals according to a base coding order of the genome;
    依次统计每个所述编码区间内发生变异的测序序列的数量;Counting the number of sequencing sequences in each of the coding intervals in turn;
    判断每个所述编码区间内发生变异的测序序列的数量是否大于第二阈值;Determining whether the number of sequencing sequences in each of the coding intervals is greater than a second threshold;
    当一个所述编码区间内发生变异的测序序列的数量大于所述第二阈值时,判定该编码区间为潜在变异区域。When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region.
  6. 根据权利要求1所述的基因组变异检测方法,其特征在于,根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段,包括:The genomic variation detecting method according to claim 1, wherein the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including:
    抽取每条所述测序序列与所述潜在变异区域的交集部分作为所述测序序列片段。An intersection of each of the sequencing sequence and the potential variant region is extracted as the sequencing sequence fragment.
  7. 根据权利要求1所述的基因组变异检测方法,其特征在于,根据所述潜 在变异区域,在所有测序序列中抽取出测序序列片段,包括:The method for detecting genomic variation according to claim 1, characterized in that In the variant region, sequencing sequence fragments are extracted from all sequencing sequences, including:
    当每条所述测序序列与所述潜在变异区域存在交集时,抽取所述测序序列作为所述测序序列片段。The sequencing sequence is extracted as the sequencing sequence fragment when there is an intersection of each of the sequencing sequences and the potential variation region.
  8. 根据权利要求1所述的基因组变异检测方法,其特征在于,根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段,包括:The method for detecting a genomic variation according to claim 1, wherein extracting a reference sequence segment from the reference sequence according to the potential variation region comprises:
    抽取所述参考序列与所述潜在变异区域的交集部分作为所述参考序列片段。An intersection portion of the reference sequence and the potential variation region is extracted as the reference sequence segment.
  9. 根据权利要求1所述的基因组变异检测方法,其特征在于,根据所述多序列比对结果,确定所述基因组的变异检测结果,包括:The genomic variation detecting method according to claim 1, wherein the variability detection result of the genomic group is determined according to the multi-sequence alignment result, comprising:
    根据所述多序列比对结果,确定所述潜在变异区域中的变异位置;Determining a variation position in the potential variation region according to the multi-sequence alignment result;
    在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处的变异信息;Extracting variation information of all of the sequencing sequence fragments at the mutation position in the multiple sequence alignment result;
    根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;And merging all the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the variation information of the sequencing sequence fragments in the same sequencing sequence set at the mutation position is the same;
    依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;Determining, in turn, whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold;
    当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,判定所述测序序列集合中测序序列片段的变异信息为所述基因组的变异检测结果。When the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining that the variation information of the sequenced sequence fragments in the sequence of the sequencing sequences is the variation detection result of the genome.
  10. 一种基因组变异检测装置,其特征在于,包括:A genomic variation detecting device, comprising:
    第一双序列比对单元,用于将基因组的多条测序序列分别和参考序列进行双序列比对,得到双序列比对结果,其中,所述参考序列为所述基因组没有发生变异时的碱基序列,所述测序序列为所述基因组待检测的碱基序列;a first double sequence alignment unit for performing double sequence alignment on a plurality of sequencing sequences of the genome and the reference sequence, wherein the reference sequence is a base when the genome is not mutated a base sequence, wherein the sequencing sequence is a base sequence to be detected in the genome;
    潜在变异区域确定单元,用于根据所述双序列比对结果,确定所述基因组的潜在变异区域,所述潜在变异区域为所述基因组中发生潜在变异的碱基编码区间;a potential variation region determining unit, configured to determine a potential variation region of the genome according to the double sequence alignment result, wherein the potential variation region is a base coding interval in which a potential variation occurs in the genome;
    测序序列片段抽取单元,用于根据所述潜在变异区域,在所有测序序列中抽取出测序序列片段;a sequencing sequence fragment extracting unit for extracting a sequencing sequence fragment from all the sequencing sequences according to the potential variation region;
    参考序列片段抽取单元,用于根据所述潜在变异区域,在所述参考序列中抽取出参考序列片段;a reference sequence segment extracting unit, configured to extract a reference sequence segment in the reference sequence according to the potential variation region;
    多序列比对单元,用于对所述参考序列片段和所有测序序列片段进行多序列 比对,得到多序列比对结果;a multi-sequence aligning unit for performing multiple sequences on the reference sequence fragment and all sequencing sequence fragments Compare, get multiple sequence alignment results;
    变异检测结果确定单元,用于根据所述多序列比对结果,确定所述基因组的变异检测结果。The mutation detection result determining unit is configured to determine a variation detection result of the genome according to the multiple sequence alignment result.
  11. 根据权利要求10所述的基因组变异检测装置,其特征在于,还包括:The genomic variation detecting apparatus according to claim 10, further comprising:
    变异类型确定单元,用于根据所述多序列比对结果,确定所有测序序列片段的变异类型;a mutation type determining unit, configured to determine a variation type of all the sequence segments according to the multiple sequence alignment result;
    测序序列簇汇聚单元,用于根据所述所有测序序列片段的变异类型,将所有测序序列片段汇聚为至少一个测序序列簇,其中,同一测序序列簇中的测序序列片段的变异类型相同;a sequencing sequence cluster converging unit, configured to aggregate all the sequenced fragments into at least one sequencing sequence cluster according to the variation type of the all sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type;
    并集处理单元,用于分别对每个所述测序序列簇中的所有测序序列片段作并集处理,得到每个所述测序序列簇的特征序列;a union processing unit, configured to perform a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters;
    第二双序列比对单元,用于将每个所述特征序列与所述参考序列片段进行双序列比对,得到每个所述特征序列的变异类型;a second dual sequence aligning unit, configured to perform a double sequence alignment on each of the feature sequences and the reference sequence segment to obtain a variation type of each of the feature sequences;
    校正单元,用于根据每个所述特征序列的变异类型对所述多序列比对结果进行校正,其中,所述校正后的多序列比对结果中每个测序序列片段的变异类型与所述每个测序序列片段所对应的特征序列的变异类型相同。a correcting unit, configured to correct the multi-sequence alignment result according to a variation type of each of the characteristic sequences, wherein a variation type of each of the sequencing sequence fragments in the corrected multi-sequence alignment result is The characteristic sequences corresponding to each of the sequencing sequence fragments have the same type of variation.
  12. 根据权利要求11所述的基因组变异检测装置,其特征在于,还包括:The genomic variation detecting apparatus according to claim 11, further comprising:
    第三双序列比对单元,用于将得到的每个所述测序序列簇的特征序列中的任意两个特征序列进行双序列比对;a third double sequence alignment unit, configured to perform double sequence alignment on any two of the characteristic sequences of each of the obtained sequencing sequence clusters;
    重叠区域判断单元,用于判断是否存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内;An overlapping area determining unit, configured to determine whether there is an exact matching of overlapping regions of the two feature sequences, and wherein the variation position of the at least one feature sequence is completely within the overlapping region;
    合并单元,用于当存在两个特征序列的重叠区域完全匹配,且其中至少一个特征序列的变异位置完全处于所述重叠区域内时,将所述两个特征序列所对应的测序序列簇合并,得到合并后的测序序列簇,且将所述两个特征序列作并集处理,得到所述合并后的测序序列簇的特征序列。a merging unit, configured to merge the sequence sequence clusters corresponding to the two feature sequences when the overlapping regions of the two feature sequences are completely matched, and the mutation positions of the at least one feature sequence are completely within the overlapping region, The merged sequence clusters are obtained, and the two feature sequences are subjected to a union process to obtain a characteristic sequence of the combined sequence clusters.
  13. 根据权利要求10所述的基因组变异检测装置,其特征在于,所述潜在变异区域确定单元包括:The genomic variation detecting apparatus according to claim 10, wherein the potential variation region determining unit comprises:
    第一编码区间划分子单元,用于根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间; a first coding interval dividing subunit, configured to divide the genome into a plurality of coding intervals according to a base coding order of the genome;
    变异类型确定子单元,用于根据所述双序列比对结果,确定所有测序序列的变异类型;a mutation type determining subunit for determining a variation type of all sequencing sequences according to the double sequence alignment result;
    概率分布值统计子单元,用于依次统计每个所述编码区间内不同变异类型的测序序列的概率分布值;a probability distribution value statistical sub-unit for sequentially counting probability distribution values of sequencing sequences of different mutation types in each of the coding intervals;
    信息熵计算子单元,用于根据所述概率分布值,计算每个所述编码区间的信息熵;An information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value;
    第一阈值判断子单元,用于依次判断每个所述编码区间的信息熵是否大于第一阈值;a first threshold determining subunit, configured to sequentially determine whether an information entropy of each of the encoding intervals is greater than a first threshold;
    第一潜在变异区域判定子单元,用于当一个所述编码区间的信息熵大于所述第一阈值时,判定该编码区间为潜在变异区域。The first latent variation region determining subunit is configured to determine that the coding interval is a potential variation region when an information entropy of one of the coding intervals is greater than the first threshold.
  14. 根据权利要求10所述的基因组变异检测装置,其特征在于,所述潜在变异区域确定单元包括:The genomic variation detecting apparatus according to claim 10, wherein the potential variation region determining unit comprises:
    第二编码区间划分子单元,用于根据所述基因组的碱基编码顺序,将所述基因组划分为多个编码区间;a second coding interval dividing subunit, configured to divide the genome into a plurality of coding intervals according to a base coding order of the genome;
    变异数量统计子单元,用于依次统计每个所述编码区间内发生变异的测序序列的数量;a mutation quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals;
    第二阈值判断子单元,用于判断每个所述编码区间内发生变异的测序序列的数量是否大于第二阈值;a second threshold determining subunit, configured to determine whether the number of sequencing sequences in each of the encoding intervals is greater than a second threshold;
    第二潜在变异区域判定子单元,用于当一个所述编码区间内发生变异的测序序列的数量大于所述第二阈值时,判定该编码区间为潜在变异区域。The second latent variation region determining subunit is configured to determine that the coding interval is a potential variation region when the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold.
  15. 根据权利要求10所述的基因组变异检测装置,其特征在于,The genomic variation detecting apparatus according to claim 10, wherein
    所述测序序列片段抽取单元,具体用于抽取每条所述测序序列与所述潜在变异区域的交集部分作为所述测序序列片段。The sequencing sequence fragment extracting unit is specifically configured to extract an intersection portion of each of the sequencing sequence and the potential variation region as the sequencing sequence segment.
  16. 根据权利要求10所述的基因组变异检测装置,其特征在于,The genomic variation detecting apparatus according to claim 10, wherein
    所述测序序列片段抽取单元,具体用于当所述交集判断子单元判断所述测序序列与所述潜在变异区域存在交集时,抽取所述测序序列作为所述测序序列片段。And the sequencing sequence segment extracting unit is configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection, extract the sequencing sequence as the sequencing sequence segment.
  17. 根据权利要求10所述的基因组变异检测装置,其特征在于, The genomic variation detecting apparatus according to claim 10, wherein
    所述参考序列片段抽取单元,具体用于抽取所述参考序列与所述潜在变异区域的交集部分作为所述参考序列片段。The reference sequence segment extracting unit is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence segment.
  18. 根据权利要求10所述的基因组变异检测装置,其特征在于,所述变异检测结果确定单元,包括:The genomic variation detecting device according to claim 10, wherein the mutation detecting result determining unit comprises:
    变异位置确定子单元,用于根据所述多序列比对结果,确定所述潜在变异区域中的变异位置;a mutation position determining subunit, configured to determine a mutation position in the potential variation region according to the multiple sequence alignment result;
    变异信息提取子单元,用于在所述多序列比对结果中提取出所有所述测序序列片段在所述变异位置处的变异信息;a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position;
    测序序列集合汇聚子单元,用于根据所述变异信息,将所有所述测序序列片段汇聚为至少一个测序序列集合,其中,同一测序序列集合中测序序列片段在所述变异位置处的变异信息相同;a sequencing sequence aggregation converging subunit, configured to aggregate all of the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the sequencing sequence fragments in the same sequencing sequence set have the same variation information at the mutation position ;
    第三阈值判断子单元,用于依次判断每个所述测序序列集合中的测序序列片段的数量是否大于第三阈值;a third threshold determining subunit, configured to sequentially determine whether the number of sequencing sequence segments in each of the sequencing sequence sets is greater than a third threshold;
    变异检测结果判定子单元,用于当一个所述测序序列集合中测序序列片段的数量大于所述第三阈值时,判定所述测序序列集合中测序序列片段的变异信息为所述基因组的变异检测结果。a mutation detection result determining subunit, configured to determine, when the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining mutation information of the sequenced sequence fragments in the sequencing sequence set as the variation detection of the genome result.
  19. 一种基因组变异检测终端,其特征在于,包括:A genomic variation detecting terminal, comprising:
    处理器;processor;
    用于存储处理器的执行指令的存储器;a memory for storing execution instructions of the processor;
    其中,所述处理器被配置为执行权利要求1-9任一项所述的方法。 Wherein the processor is configured to perform the method of any of claims 1-9.
PCT/CN2016/079745 2016-04-20 2016-04-20 Method, device and terminal for detecting genome variations WO2017181368A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/079745 WO2017181368A1 (en) 2016-04-20 2016-04-20 Method, device and terminal for detecting genome variations
CN201680084673.7A CN109074429B (en) 2016-04-20 2016-04-20 Genome variation detection method, device and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/079745 WO2017181368A1 (en) 2016-04-20 2016-04-20 Method, device and terminal for detecting genome variations

Publications (1)

Publication Number Publication Date
WO2017181368A1 true WO2017181368A1 (en) 2017-10-26

Family

ID=60116530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/079745 WO2017181368A1 (en) 2016-04-20 2016-04-20 Method, device and terminal for detecting genome variations

Country Status (2)

Country Link
CN (1) CN109074429B (en)
WO (1) WO2017181368A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110592208A (en) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN111445950A (en) * 2020-03-19 2020-07-24 西安交通大学 High-fault-tolerance genome complex structure variation detection method based on filtering strategy
CN115910197A (en) * 2021-12-29 2023-04-04 上海智峪生物科技有限公司 Gene sequence processing method, gene sequence processing device, storage medium and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789823A (en) * 2024-02-27 2024-03-29 中国人民解放军军事科学院军事医学研究院 Identification method, device, storage medium and equipment of pathogen genome co-evolution mutation cluster

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040241735A1 (en) * 1994-06-17 2004-12-02 Perlin Mark W. Method and system for genotyping
US20050181410A1 (en) * 2004-02-13 2005-08-18 Shaffer Lisa G. Methods and apparatuses for achieving precision genetic diagnoses
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
CN103617256A (en) * 2013-11-29 2014-03-05 北京诺禾致源生物信息科技有限公司 Method and device for processing file needing mutation detection
CN105349617A (en) * 2014-08-19 2016-02-24 复旦大学 High-throughput RNA sequencing data quality control method and high-throughput RNA sequencing data quality control apparatus
CN105404793A (en) * 2015-12-07 2016-03-16 浙江大学 Method for rapidly discovering phenotype related gene based on probabilistic framework and resequencing technology

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103080333B (en) * 2010-09-14 2015-06-24 深圳华大基因科技服务有限公司 Methods and systems for detecting genomic structure variations
US9411937B2 (en) * 2011-04-15 2016-08-09 Verinata Health, Inc. Detecting and classifying copy number variation
WO2013040583A2 (en) * 2011-09-16 2013-03-21 Complete Genomics, Inc Determining variants in a genome of a heterogeneous sample
EP4148739A1 (en) * 2012-01-20 2023-03-15 Sequenom, Inc. Diagnostic processes that factor experimental conditions

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040241735A1 (en) * 1994-06-17 2004-12-02 Perlin Mark W. Method and system for genotyping
US20050181410A1 (en) * 2004-02-13 2005-08-18 Shaffer Lisa G. Methods and apparatuses for achieving precision genetic diagnoses
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
CN103617256A (en) * 2013-11-29 2014-03-05 北京诺禾致源生物信息科技有限公司 Method and device for processing file needing mutation detection
CN105349617A (en) * 2014-08-19 2016-02-24 复旦大学 High-throughput RNA sequencing data quality control method and high-throughput RNA sequencing data quality control apparatus
CN105404793A (en) * 2015-12-07 2016-03-16 浙江大学 Method for rapidly discovering phenotype related gene based on probabilistic framework and resequencing technology

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110592208A (en) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN110592208B (en) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN111445950A (en) * 2020-03-19 2020-07-24 西安交通大学 High-fault-tolerance genome complex structure variation detection method based on filtering strategy
CN115910197A (en) * 2021-12-29 2023-04-04 上海智峪生物科技有限公司 Gene sequence processing method, gene sequence processing device, storage medium and electronic equipment
CN115910197B (en) * 2021-12-29 2024-03-22 上海智峪生物科技有限公司 Gene sequence processing method, device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN109074429A (en) 2018-12-21
CN109074429B (en) 2022-03-29

Similar Documents

Publication Publication Date Title
WO2017181368A1 (en) Method, device and terminal for detecting genome variations
Zou et al. Nonparametric maximum likelihood approach to multiple change-point problems
Kuo et al. Illuminating the dark side of the human transcriptome with long read transcript sequencing
CN109022553B (en) Genetic chip for Tumor mutations cutting load testing and preparation method thereof and device
Walker et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement
CN107423578B (en) Device for detecting somatic cell mutation
Lopes et al. A combined functional annotation score for non-synonymous variants
Rumble et al. SHRiMP: accurate mapping of short color-space reads
WO2017143585A1 (en) Method and apparatus for assembling separated long fragment sequences
CN108121897B (en) Genome variation detection method and detection device
US9059850B2 (en) Data alignment over multiple physical lanes
CN110268072B (en) Method and system for determining paralogous genes
WO2018218787A1 (en) Third-generation sequencing sequence correction method based on local graph
CN110310702B (en) Method, device and storage medium for repairing genome sequencing assembly result
CN115631789A (en) Pangenome-based group joint variation detection method
Shukla et al. hg19KIndel: ethnicity normalized human reference genome
CN117079720B (en) Processing method and device for high-throughput sequencing data
US20150142328A1 (en) Calculation method for interchromosomal translocation position
CN116469465A (en) Method for reducing single base substitution sequencing error rate in high-throughput sequencing, low-frequency mutation detection method and electronic device
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof
WO2021184178A1 (en) Labeling method and apparatus
CN109504751A (en) A kind of the deletion mutation identification and colony count method of tumour complexity clonal structure
Jiang et al. Long-read based novel sequence insertion detection with rCANID
Van der Borght et al. QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
CN102263791A (en) Method and system for checking resource files

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16898955

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16898955

Country of ref document: EP

Kind code of ref document: A1