WO2017181368A1

WO2017181368A1 - Method, device and terminal for detecting genome variations

Info

Publication number: WO2017181368A1
Application number: PCT/CN2016/079745
Authority: WO
Inventors: 何俊; 张旸; 张洪波
Original assignee: 华为技术有限公司
Priority date: 2016-04-20
Filing date: 2016-04-20
Publication date: 2017-10-26
Also published as: CN109074429A; CN109074429B

Abstract

Provided are a method, a device and a terminal for detecting genome variations, wherein the method for detecting genome variations comprises: respectively performing pairwise sequence alignment between multiple sequencing sequences of a genome and a reference sequence, and obtaining a pairwise sequence alignment result (201); determining a potential variable region of the genome according to the pairwise sequence alignment result (202); extracting a sequencing sequence fragment from all the sequencing sequences according to the potential variable region (203); extracting a reference sequence fragment from the reference sequence according to the potential variable region (204); performing multiple sequence alignment between the reference sequence fragment and all the sequencing sequence fragments, and obtaining a multiple sequence alignment result (205); and determining a variation detection result of the genome according to the multiple sequence alignment result (206). As for performing multiple sequence alignments between the reference sequence fragment and all the sequencing sequence fragments, the sequencing sequence fragments of the same variation type can be aggregated for the alignment, so as to improve the accuracy of the detection result of genome variations.

Description

Genomic variation detection method, device and terminal

Technical field

The present application relates to the field of bioinformatics technology, and in particular, to a method, device and terminal for detecting genomic variation.

Background technique

At the molecular level, genomic variation refers to changes in the base pair composition or order of the genome, including SNP (Single Nucleotide Polymorphism) and indel (short Insertion/Deletion). Or delete). As the cost of genome sequencing continues to decline, the genome sequencing data produced by high-throughput sequencers is exploding, but how to get high-quality genomic variation results from genome sequencing data remains a challenge. work.

The traditional genomic variation detection is usually based on the reference sequence of the genome, and the multiple sequencing sequences of the genome are double-sequenced with the reference sequence to obtain the double sequence alignment result of each sequencing sequence and the reference sequence. Including detailed information such as matching, mismatch, insertion, and deletion of the sequencing sequence relative to the reference sequence, and then determining the genome based on the alignment of all the sequencing sequences and the reference sequence. Variation test results. Wherein, the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence of the detected genome.

However, in the process of implementing the present application, the Applicant found that at least the following problems exist in the prior art: since the traditional genomic variation detection only double-sequences each sequencing sequence with the reference sequence, and compares the results according to the double sequence. Determining the results of genomic variation detection is easy because the alignment of the sequencing sequences is inaccurate, and one type of variation in the sequencing sequence is erroneously compared to different types of mutations, resulting in inaccurate genomic variation detection results.

Summary of the invention

The present application provides a method, device and terminal for detecting genomic variation to solve the problem of inaccurate detection results of genomic variation in the prior art.

In a first aspect, the embodiment of the present application provides a method for detecting genomic variation, which comprises: performing multiple sequence alignment on a plurality of sequencing sequences of a genome and a reference sequence, respectively, to obtain a double sequence alignment result, wherein The reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected by the genome; and the potential variation region of the genome is determined according to the result of the double sequence alignment. The potential variation region is a base coding interval in which a potential mutation occurs in the genome; according to the potential variation region, a sequencing sequence fragment is extracted from all the sequencing sequences; and the reference mutation sequence is extracted according to the potential variation region Deriving a sequence fragment; performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain a multi-sequence alignment result; determining the variation detection result of the genome according to the multi-sequence alignment result. By adopting the present implementation method, the sequencing sequence fragments with the same mutation type can be clustered and aligned, and the sequencing sequence alignment is more accurate, so as to avoid erroneously comparing one type of variation into different types of mutations, thereby improving the genomic variation detection result. The accuracy.

In combination with the first aspect, in a first possible implementation manner of the first aspect, after performing multi-sequence alignment on the reference sequence segment and all the sequence segments, to obtain a multi-sequence alignment result, the method further includes: Multi-sequence alignment results, determining the type of variation of all sequencing sequence fragments; concentrating all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster The mutation types are the same; the sequencing sequence is performed on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; each of the characteristic sequences and the reference sequence fragment Performing a double sequence alignment to obtain a variation type of each of the feature sequences; correcting the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein the corrected multiple sequence alignment result a variation type of each of the sequencing sequence fragments and a variation type of the characteristic sequence corresponding to each of the sequencing sequence fragments . In this implementation mode, the feature sequence is first corrected by double sequence alignment of the characteristic sequence and the reference sequence segment; then the sequence sequence corresponding to the feature sequence is corrected according to the corrected feature sequence, and the multiple sequence alignment is overcome. In the result, the partial sequencing sequence fragment is offset from the reference sequence fragment, and the accuracy of the genomic variation detection result is improved.

In conjunction with the first possible implementation of the first aspect, in a second possible implementation manner of the first aspect, each of the sequencing sequence segments in each of the sequencing sequence clusters are separately processed for each other to obtain each After the sequence of the sequenced sequence clusters, the method further comprises: performing double sequence alignment on any two of the feature sequences of each of the obtained sequence clusters; determining whether there is an overlap of the overlapping regions of the two feature sequences. And wherein the mutation position of at least one of the feature sequences is completely within the overlapping region; when there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region, The sequencing sequence clusters corresponding to the two characteristic sequences are combined to obtain a cluster of the sequenced sequences, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster. By adopting the implementation manner, by further merging the sequence clusters that meet the merge conditions, the length of the feature sequence is increased, thereby improving The accuracy of the alignment of the characteristic sequence with the reference sequence fragment double sequence.

In combination with the first aspect, in a third possible implementation manner of the first aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; according to the double sequence alignment result, the variation types of all the sequencing sequences are determined; and the probability distribution values of the sequencing sequences of different mutation types in each of the coding intervals are sequentially counted; according to the probability a distribution value, calculating an information entropy of each of the coding intervals; determining whether an information entropy of each of the coding intervals is greater than a first threshold; and determining an information entropy of the coding interval when the information entropy of the coding interval is greater than the first threshold The coding interval is a potential variation region. Using this implementation, the potential variation region of the genome is determined by information entropy.

In combination with the first aspect, in a fourth possible implementation manner of the first aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; the number of sequencing sequences in each of the coding intervals is counted in turn; and the number of sequencing sequences in each of the coding intervals is determined to be greater than a second threshold; When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region. With this implementation, the potential variability region of the genome is determined by the number of sequencing sequences that mutate within the coding interval.

In combination with the first aspect, in a fifth possible implementation of the first aspect, the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including: extracting each of the sequencing sequences and the potential variation The intersection of the regions serves as the fragment of the sequencing sequence.

In combination with the first aspect, in a sixth possible implementation of the first aspect, the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including: when each of the sequencing sequences and the potential variation When there is an intersection of the regions, the sequencing sequence is extracted as the fragment of the sequencing sequence.

With reference to the first aspect, in a seventh possible implementation manner of the first aspect, the extracting the reference sequence segment in the reference sequence according to the potential variation region includes: extracting the reference sequence and the potential variation region The intersection portion is used as the reference sequence fragment.

With reference to the first aspect, in an eighth possible implementation manner of the first aspect, determining the mutation detection result of the genome according to the multiple sequence alignment result, including: determining, according to the multiple sequence alignment result, the determining a variation position in the potential variation region; extracting variation information of all the sequencing sequence fragments at the mutation position in the multiple sequence alignment result; and concentrating all the sequencing sequence fragments according to the mutation information At least one set of sequencing sequences, wherein the sequencing information of the sequencing sequence fragments in the same sequencing sequence set is the same at the mutation position; sequentially determining whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; Determining that the number of sequencing sequence fragments in the set of sequencing sequences is greater than the third threshold The variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.

In a second aspect, the embodiment of the present application further provides a genomic variation detecting apparatus, which comprises: a first double sequence aligning unit, configured to perform multiple sequence alignment of a plurality of sequencing sequences of the genome and a reference sequence respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, the sequencing sequence is a base sequence to be detected by the genome; a potential mutation region determining unit is used according to the Determining, by the double sequence alignment result, a potential variation region of the genome, the potential variation region being a base coding interval in which a potential mutation occurs in the genome; and a sequencing sequence fragment extraction unit for using the potential variation region, A sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment extraction unit is configured to extract a reference sequence fragment from the reference sequence according to the potential variation region; a multi-sequence alignment unit for the reference Multiple sequence alignment of sequence fragments and all sequencing sequence fragments to obtain multiple sequence alignment results; mutation detection If the determination unit, according to the multiple sequence alignment results, the detection result of the variability of the genome.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the apparatus further includes: a mutation type determining unit, configured to determine, according to the multiple sequence alignment result, a variation type of all the sequence segments; the sequencing sequence a clustering unit for concentrating all of the sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type; the union processing unit For separately combining all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; a second double sequence alignment unit for each of the described Performing a double sequence alignment with the reference sequence segment to obtain a variation type of each of the feature sequences; and a correction unit, configured to correct the multiple sequence alignment result according to the mutation type of each of the feature sequences Wherein the corrected multi-sequence alignment results in a variation type of each of the sequencing sequence fragments and each of the sequencing sequence fragments Variation of the same feature type corresponding to the sequence.

With reference to the second aspect, in a second possible implementation manner of the second aspect, the apparatus further includes: a third dual sequence aligning unit, configured to use any two of the obtained feature sequences of each of the sequencing sequence clusters The double-sequence alignment is performed on the feature sequences; the overlap region determining unit is configured to determine whether there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the mutation position of at least one of the feature sequences is completely within the overlapping region; When the overlapping regions of the two feature sequences are completely matched, and the mutation positions of at least one of the feature sequences are completely within the overlapping region, the sequencing sequence clusters corresponding to the two feature sequences are merged to obtain a combined The cluster of the sequence is sequenced, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.

With reference to the second aspect, in a third possible implementation manner of the second aspect, the potential variation region determining unit includes: a first coding interval dividing subunit, configured to: according to a base coding order of the genome Base The group is divided into a plurality of coding intervals; the mutation type determining subunit is configured to determine a variation type of all the sequencing sequences according to the double sequence alignment result; a probability distribution value statistical subunit, for sequentially counting each of the codes a probability distribution value of a sequence of different mutation types in the interval; an information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value; and a first threshold value determining subunit for sequentially determining Whether the information entropy of each of the coding intervals is greater than a first threshold; the first latent variation region determining subunit, configured to determine that the coding interval is a potential variation when an information entropy of one of the coding intervals is greater than the first threshold region.

With reference to the second aspect, in a fourth possible implementation manner of the second aspect, the potential variation region determining unit includes: a second coding interval dividing subunit, configured to: according to a base coding order of the genome The genome is divided into a plurality of coding intervals; a variance quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals; and a second threshold determining subunit for determining each of the coding intervals Whether the number of sequencing sequences in which the mutation occurs is greater than a second threshold; and the second latent variation region determining subunit is configured to determine the encoding interval when the number of sequencing sequences in which the mutation occurs within the encoding interval is greater than the second threshold For potential variation areas.

In conjunction with the second aspect, in a fifth possible implementation of the second aspect, the sequencing sequence segment extracting unit is specifically configured to extract an intersection of each of the sequencing sequence and the potential variation region as the sequencing sequence Fragment.

With reference to the second aspect, in a sixth possible implementation manner of the second aspect, the sequencing sequence segment extracting unit is specifically configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection And extracting the sequencing sequence as the fragment of the sequencing sequence.

In conjunction with the second aspect, in a seventh possible implementation of the second aspect, the reference sequence segment extracting unit is specifically configured to extract an intersection of the reference sequence and the potential variation region as the reference sequence segment.

With reference to the second aspect, in the eighth possible implementation of the second aspect, the mutation detection result determining unit includes: a mutation position determining subunit, configured to determine the potential variation according to the multiple sequence alignment result a mutation position in the region; a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position; and a sequencing sequence collection convergence subunit, And merging all the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the mutation sequence fragments in the same sequencing sequence set have the same variation information at the mutation position; the third threshold determination subunit is used And determining, in sequence, whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; the mutation detection result determining subunit, configured to: when the number of sequencing sequence fragments in one of the sequencing sequence sets is greater than the third At the threshold, determining the variation information of the sequence fragment in the sequence of the sequencing sequence is the gene The variation test results of the group.

In a third aspect, the embodiment of the present application further provides a genomic mutation detecting terminal, the terminal comprising: a processor; a memory for storing execution instructions of the processor; wherein the processor is configured to perform the step of: performing a genome The plurality of sequencing sequences are respectively subjected to double sequence alignment with the reference sequence to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is the genome sequence Detecting a base sequence; determining, according to the double sequence alignment result, a potential variation region of the genome, wherein the potential variation region is a base coding interval in which a potential mutation occurs in the genome; and according to the potential variation region, A sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment is extracted from the reference sequence according to the potential variation region; and the reference sequence fragment and all the sequenced fragments are subjected to multiple sequence alignment to obtain a plurality of sequences Aligning the results; determining the variation detection result of the genome according to the multi-sequence alignment result.

With reference to the third aspect, in a first possible implementation manner of the third aspect, after performing multi-sequence alignment on the reference sequence segment and all the sequence segments, to obtain the multi-sequence alignment result, the method further includes: Multi-sequence alignment results, determining the type of variation of all sequencing sequence fragments; concentrating all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster The mutation types are the same; the sequencing sequence is performed on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters; each of the characteristic sequences and the reference sequence fragment Performing a double sequence alignment to obtain a variation type of each of the feature sequences; correcting the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein the corrected multiple sequence alignment result a variation type of each of the sequencing sequence fragments and a variation type of the characteristic sequence corresponding to each of the sequencing sequence fragments .

In combination with the third aspect, in a second possible implementation manner of the third aspect, the sequence of each of the sequencing sequence clusters is obtained by performing a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters. After that, the method further includes: performing double sequence alignment on any two of the feature sequences of each of the obtained sequencing sequence clusters; determining whether there is an exact matching of overlapping regions of the two feature sequences, and at least one of the feature sequences The position of the mutation is completely within the overlapping region; when there is an exact overlap of the overlapping regions of the two feature sequences, and the variation position of at least one of the feature sequences is completely within the overlapping region, the two feature sequences are corresponding The sequenced sequence clusters are combined to obtain a cluster of the sequenced sequences, and the two characteristic sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence cluster.

In combination with the third aspect, in a third possible implementation manner of the third aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, Base The group is divided into a plurality of coding intervals; according to the double sequence alignment result, the variation types of all the sequencing sequences are determined; and the probability distribution values of the sequencing sequences of different mutation types in each of the coding intervals are sequentially counted; according to the probability a distribution value, calculating an information entropy of each of the coding intervals; determining whether an information entropy of each of the coding intervals is greater than a first threshold; and determining an information entropy of the coding interval when the information entropy of the coding interval is greater than the first threshold The coding interval is a potential variation region.

With reference to the third aspect, in a fourth possible implementation manner of the third aspect, determining a potential variation region of the genome according to the double sequence alignment result, including: according to a base coding order of the genome, The genome is divided into a plurality of coding intervals; the number of sequencing sequences in each of the coding intervals is counted in turn; and the number of sequencing sequences in each of the coding intervals is determined to be greater than a second threshold; When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region.

In conjunction with the third aspect, in a fifth possible implementation of the third aspect, the sequencing sequence fragment is extracted from all the sequencing sequences according to the potential variation region, comprising: extracting each of the sequencing sequences and the potential variation The intersection of the regions serves as the fragment of the sequencing sequence.

In combination with the third aspect, in a sixth possible implementation of the third aspect, the sequencing sequence fragment is extracted from all the sequencing sequences according to the potential variation region, including: when each of the sequencing sequences and the potential variation When there is an intersection of the regions, the sequencing sequence is extracted as the fragment of the sequencing sequence.

With reference to the third aspect, in a seventh possible implementation manner of the third aspect, the extracting the reference sequence segment in the reference sequence according to the potential variation region includes: extracting the reference sequence and the potential variation region The intersection portion is used as the reference sequence fragment.

With reference to the third aspect, in an eighth possible implementation manner of the third aspect, determining the mutation detection result of the genome according to the multiple sequence alignment result, including: determining, according to the multiple sequence alignment result, the determining a variation position in the potential variation region; extracting variation information of all the sequencing sequence fragments at the mutation position in the multiple sequence alignment result; and concentrating all the sequencing sequence fragments according to the mutation information At least one set of sequencing sequences, wherein the sequencing information of the sequencing sequence fragments in the same sequencing sequence set is the same at the mutation position; sequentially determining whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold; When the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining that the variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.

In a fourth aspect, the embodiment of the present application further provides a storage medium, where the storage medium may store a program, and the program may include some or all of the steps in each embodiment of the genomic variation detection method provided by the application.

The genomic variation detection method, device and terminal provided by the embodiments of the present application are used to perform multiple sequence alignment on the reference sequence fragment and all the sequenced sequence fragments to obtain multiple sequence alignment results; and the genomic variation is determined according to the multiple sequence alignment results. Test results. Since multiple sequence alignments tend to preferentially align sequences with higher similarity, the sequence fragment and all sequenced fragments are put together for multiple sequence alignment, and sequence fragments with the same variation type can be sequenced. When aligned together, the alignment of the sequenced fragments is more accurate, avoiding erroneously comparing one type of variation into different types of variations, thereby improving the accuracy of the genomic variation detection results.

DRAWINGS

In order to more clearly illustrate the technical solutions of the present application, the drawings used in the embodiments will be briefly described below. Obviously, for those skilled in the art, without any creative labor, Other drawings can also be obtained from these figures.

1A is a schematic diagram of a dual sequence alignment state of a sequencing sequence and a reference sequence according to an embodiment of the present application;

FIG. 1B is a schematic diagram showing an alignment state after the alignment sequence of FIG. 1A is aligned and corrected according to an embodiment of the present application; FIG.

2 is a schematic flow chart of a method for detecting genomic variation according to an embodiment of the present application;

FIG. 3 is a schematic diagram of coding interval division of a genome according to an embodiment of the present application; FIG.

4A is a schematic diagram of a process of extracting a sequence segment and a reference sequence segment according to an embodiment of the present application;

4B is a schematic diagram of a process of extracting another sequencing sequence segment and a reference sequence segment according to an embodiment of the present application;

5A is a schematic diagram showing a multi-sequence alignment state of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application;

5B is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application;

6 is a schematic flow chart of another method for detecting genomic variation according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a convergence result of a cluster of sequencing sequences according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a process of a union process according to an embodiment of the present application;

9A is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 7 according to an embodiment of the present application;

9B is a schematic diagram of a double sequence alignment result obtained by performing double sequence alignment on the feature sequence of FIG. 9A and the reference sequence segment according to an embodiment of the present application;

9C is a schematic diagram of the corrected multi-sequence alignment result obtained by correcting the multi-sequence alignment result according to the variation type of the feature sequence in FIG. 9B according to the embodiment of the present application;

FIG. 10A is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application; FIG.

10B is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 10A according to an embodiment of the present application;

11 is a schematic flow chart of another method for detecting genomic variation according to an embodiment of the present application;

FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application; FIG.

FIG. 12B is a schematic diagram of a merge process of merging the sequence clusters in FIG. 10A according to an embodiment of the present application; FIG.

FIG. 13 is a schematic structural diagram of a first genomic variation detecting apparatus according to an embodiment of the present application;

14 is a schematic structural diagram of a second genomic variation detecting apparatus according to an embodiment of the present application;

15 is a schematic structural diagram of a third genomic variation detecting apparatus according to an embodiment of the present application;

FIG. 16 is a schematic structural diagram of a genomic mutation detecting terminal according to an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. It is a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

1A is a schematic diagram of a dual sequence alignment of a sequencing sequence and a reference sequence according to an embodiment of the present application. Referring to FIG. 1B, it is a schematic diagram of the alignment state after the alignment sequence of FIG. 1A is aligned and corrected according to an embodiment of the present application. In FIG. 1A and FIG. 1B, the two identical base sequences represent the reference sequence, the dotted line represents the sequencing sequence, and the mismatch and deletion of the sequencing sequence relative to the reference sequence respectively use the base letters in the sequencing sequence. And dots are indicated.

Comparing FIG. 1A with FIG. 1B, in FIG. 1A, a part of the sequencing sequence has both G->A (mismatch from base G to base A) and A->G (from base A to base G). The mismatch), the deletion of TTTG in the partial sequencing sequence (deletion of the base segment TTTG); and in Figure 1B, after sequencing the sequencing sequence, those sequencing sequences with both G->A and A->G are present. The sequence was confirmed to be deleted in the presence of TTTG. That is, in Figure 1A, due to the lack of alignment between the sequencing sequences, the deleted sequencing sequences in which TTTG is partially deleted are erroneously aligned to the sequencing sequence in which G->A and A->G are present, ie, one is The types of mutations are erroneously compared to different types of mutations, and in the subsequent statistical genomic variation types, it is easy to lead to inaccurate genomic variation detection results.

In order to improve the accuracy of the genomic variation detection result, the genomic variation detection method, apparatus and terminal provided by the embodiments of the present application put together the reference sequence fragment and all the sequenced sequence fragments to perform multiple sequence alignment, since the multiple sequence alignment tends to Sequences with higher similarity are preferentially aligned and aligned. Therefore, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment, and the sequence fragments of the same variation type can be aligned and aligned, and the sequence fragment is sequenced. Alignment is more accurate, avoiding erroneously comparing one type of variation to different types of variation, and improving the accuracy of genomic variation detection results.

2 is a schematic flowchart of a method for detecting genomic variation according to an embodiment of the present application, and the method includes the following steps:

Step 201: Double-sequence alignment of multiple sequencing sequences of the genome and the reference sequence to obtain a double sequence alignment result.

In the embodiment of the present application, the reference sequence is a base sequence when the genome does not mutate, which represents the correct arrangement order of the bases in the genome, and the sequencing sequence is the base sequence to be detected by the genome, and therefore, the reference sequence can be The benchmark judges the variation of the sequencing sequence. When the sequencing sequence is consistent with the base sequence of the reference sequence, it indicates that the sequencing sequence does not mutate; when the sequence of the sequence of the sequencing sequence and the reference sequence are inconsistent, the sequencing sequence is mutated. Among them, the variation of the sequencing sequence mainly includes base mismatch, insertion and deletion.

Usually, the sequencing sequence is a short sequence fragment. The more the number of sequencing sequences, the more raw data obtained during the detection of genomic mutation, and the available data when statistically analyzing the results of genomic variation detection in subsequent steps. The more the genomic variation test results, the more accurate. By double-sequence alignment of multiple sequencing sequences of the genome with the reference sequence, each sequencing sequence can be positioned to the corresponding position of the reference sequence, and detailed information about the variation of each sequencing sequence relative to the reference sequence, including matching and mismatching, can be obtained. , insert or delete information.

Step 202: Determine a potential variation region of the genome according to the double sequence alignment result.

In the field of genomic detection technology, in order to locate a base in a genome and assign a code to each base in the genome, a single code represents a base pair in the genome, and a continuous coding interval represents a base in the genome. Base segment.

In the embodiment of the present application, the genome is first divided into multiple coding intervals according to the coding sequence of the genome, and then each coding interval is sequentially determined as a potential variation region, and the initial screening of the genomic variation position is realized, thereby improving the detection efficiency.

In an optional embodiment of the present application, the genome is divided into continuous, equal-length coding intervals, and according to the order of the coding intervals, each coding interval is sequentially determined to be a potential variation region until the entire genome is traversed, and the detection region is avoided. Missing. The length of the coding interval may be adjusted according to actual needs. For example, any length within a range of 50-300 bp (bp represents a base pair) may be selected, which is not limited in this application.

3 is a schematic diagram of a coding interval division of a genome according to an embodiment of the present application. Since the reference sequence is a base sequence in which the genome does not undergo mutation, the coding interval of the base pair in the reference sequence is the coding interval of the genome. Therefore, the scheme can be explained by the coding interval of the reference sequence representing the coding interval of the genome. As shown in FIG. 3, along the coding sequence of the genome, the genome is divided into coding intervals of length 100 bp, and the first coding interval (1510531, 1510630), the second coding interval (1510631, 1510730), and the third coding interval are sequentially formed. (1510731, 1510830), the fourth coding interval (1510831, 1510930), and the like.

After the coding interval is divided, it is determined in turn whether each coding interval is a potential variation region, and the potential variation region of the genome is selected in all coding intervals. It should be noted that, in a genome, the number of potentially mutated regions may be one or more than one, and this application does not limit this.

Among them, there are various methods for judging whether the coding interval is a potential variation region. For example, since the information entropy can reflect the degree of confounding of the sequence, the larger the information entropy, the more chaotic the sequence is, and the more likely the sequencing sequence is to be mutated. Therefore, in a possible implementation manner of the present application, information entropy can be determined. Potentially mutated regions; for example, the greater the number of sequencing sequences that mutate within the coding interval, the greater the likelihood that the coding interval is a potentially mutated region, and thus, in another possible implementation of the present application, The number of sequencing sequences that mutate within the coding interval determines the potential variation region.

Among them, a method for determining a potential variation region by information entropy is specifically:

First, the type of variation of all sequencing sequences is determined based on the results of the double sequence alignment. Due to the double sequence alignment of the sequencing sequence and the reference sequence, the results include detailed matching, mismatching, and insertion of the sequencing sequence relative to the reference sequence. And delete information, therefore, the type of variation of the sequencing sequence can be directly determined based on the double sequence alignment result. As used herein, a sequencing sequence of the same variation type refers to a sequencing sequence having identical information of the same variation relative to a reference sequence, wherein a sequencing sequence in which no variation occurs is also a type of variation.

After determining the type of variation of all the sequencing sequences, the probability distribution values of the sequencing sequences of different mutation types are counted according to the variation type information of the sequencing sequence. Specifically, according to the variation type of the sequencing sequence, the ratio of the number of sequencing sequences and the total number of sequencing sequences in each variation type in the coding interval is sequentially calculated, and the probability distribution values of the sequencing sequences of different mutation types are obtained, and are recorded as p _i .

If there are two types of mutations in the coding interval, respectively, the first mutation type and the second mutation type, respectively counting the number of sequencing sequences corresponding to the first mutation type and the second mutation type, and corresponding to the first variation type The number of sequencing sequences is divided by the total number of sequencing sequences to obtain a probability value p _{1 of the} first mutation type; the number of sequencing sequences corresponding to the second mutation type is divided by the total number of sequencing sequences to obtain a probability value p _{2 of the} second mutation type. Wherein p ₁ and p ₂ are the probability distribution values of the sequencing sequences of different mutation types within the coding interval.

And calculating an information entropy of the coding interval according to the probability distribution value. Specifically, the probability distribution value pi is substituted into the information entropy formula: H(U)=E[-logp _i ], and the information entropy H(U) of the coding interval is obtained.

It is determined whether the information entropy H(U) of the coding interval is greater than a preset first threshold, and when the information entropy H(U) is greater than the first threshold, the coding interval is determined as a potential variation region.

In addition, a method for determining a potential variation region by the number of sequencing sequences that vary within the coding interval is specifically:

First, count the number of sequencing sequences in the coding interval that are mutated. Among them, as long as the sequencing sequence and the reference sequence are not perfectly matched, they are used as sequencing sequences with mutations, including sequencing sequences with mismatches, insertions or deletions.

According to the above statistical result, it is determined whether the number of the sequence of the mutation is greater than a second threshold, and when the number of the sequence of the mutation is greater than the second threshold, the coding interval is determined to be a potential variation region.

For example, in a possible implementation manner of the present application, if the second threshold is set to 50, when the number of sequencing sequences that are mutated in the coding interval is greater than 50, the coding interval is determined to be a potential variation region; otherwise, There is no potentially mutated region between the coding regions. A person skilled in the art can adjust the size of the second threshold according to actual needs, which is not limited in this application.

Step 203: Extract the sequencing sequence fragments from all the sequencing sequences according to the potential variation region.

After the potential variability region is determined, sequencing sequence fragments in the potential variation region need to be extracted from the sequencing sequence for analysis and processing in subsequent steps.

In the embodiment of the present application, in order to facilitate the process of extracting the sequence of the sequenced sequence, in the double sequence alignment result of the sequencing sequence and the reference sequence, the coding region corresponding to the intersection of the sequence of the sequence and the reference sequence is used as the sequence of the sequence. Coding interval.

In a possible implementation of the present application, an intersection of each of the sequencing sequences and the potential variation region is extracted as a sequencing sequence fragment. For example, when the coding interval of the sequencing sequence is completely within the coding interval of the potential variation region, the sequencing sequence is used as a sequencing sequence fragment; when the coding interval of the potential variation region intersects with the existence portion of the coding interval of the sequencing sequence, the extraction site The intersection of the sequencing sequence and the potential variation region is used as a sequencing sequence fragment; when the coding interval of the potential variation region and the coding interval of the sequencing sequence are not present, the sequencing sequence is discarded.

4A is a schematic diagram of a process of extracting a sequence segment and a reference sequence segment according to an embodiment of the present application. In FIG. 4A, three different types of sequencing sequences are taken as an example to illustrate an extraction process of a sequence segment. Description. The coding interval of the potential variation region is (1510531, 1510630), the coding interval of the first sequencing sequence is (1510541, 1510590), the coding interval of the second sequencing sequence is (1510521, 1510570), and the coding interval of the third sequencing sequence. For (1510651, 15106700).

For the first sequencing sequence, the coding interval (1510541, 1510590) is completely within the coding interval of the potential variant region (1510531, 1510630), and the first sequencing sequence is extracted as the sequencing sequence fragment; for the second sequencing sequence, the coding interval (for the second sequencing sequence) 1510521, 1510570) There is a partial intersection with the coding interval (1510531, 1510630) of the potential variation region, and the coding interval of the intersection portion is (1510531, 1510570), and the coding interval is extracted in the second sequencing sequence (1510531, 1510570). Partially used as a sequencing sequence fragment; for the third sequencing sequence, the coding interval (1510651, 15106700) and the coding interval of the potential variation region (1510531, 1510630) do not exist, the third sequencing sequence is discarded, and the extracted sequencing is performed. The sequence fragment is the portion of the first sequencing sequence and the second sequencing sequence encoding portion (1510531, 1510570).

It can be seen from the above embodiment that when there is a partial intersection between the coding interval of the sequencing sequence and the coding interval of the potential variation region, the sequencing sequence is interrupted during the extraction process of the sequencing sequence segment, and the intersection of the sequencing sequence and the potential variation region is extracted. Partially as a sequence fragment. Interruption of the sequencing sequence will result in loss of integrity of the sequencing sequence, thereby losing part of the information of the sequencing sequence, thereby affecting the accuracy of the genomic variation detection result.

In another possible implementation manner of the present application, first determining whether there is an intersection between each of the sequenced sequences and the potential variation region; and when the sequence of the sequence and the potential variation region intersect, extracting the sequence of the sequence as a measurement Sequence sequence fragment. The equivalent is that when the coding interval of the potential variation region and the coding interval of the sequencing sequence partially overlap, the potential variation region is extended based on the coding interval of the sequencing sequence, so as to avoid the sequencing sequence during the extraction process of the sequencing sequence fragment. Broken to ensure the integrity of the sequencing sequence.

4B is a schematic diagram of a process of extracting another sequence segment and a reference sequence segment provided by an embodiment of the present application. The process of extracting a sequence segment in FIG. 4B is substantially similar to that of FIG. 4A, and the difference is that The second sequencing sequence, because of the partial intersection of the coding interval and the coding interval of the potential variation region, the union of the coding interval of the second sequencing sequence (1510521, 1510570) and the coding interval of the potential variation region (1510531, 1510630) (1510521) , 1510630) as a coding interval of the expanded potential variation region, and then extracting the sequence fragment in the potential variation region of the second sequencing sequence (the potential variation region at this time has been updated to the expanded potential variation region). Since the coding interval of the second sequencing sequence completely falls within the coding interval of the potential variation region, the entire second sequencing sequence is extracted as a sequencing sequence fragment. That is to say, in the present implementation, if there is an intersection between the sequencing sequence and the potential variation region, the entire sequencing sequence is extracted as a sequencing sequence fragment.

It should be noted that the foregoing expansion manner of the potential variation region is only a specific implementation manner shown in the embodiment of the present application, and those skilled in the art may perform corresponding adjustments according to actual needs, which should all fall into the present application. Within the scope of protection. For example, before the sequence of the sequencing sequence is extracted, the coding interval of the potential variation region and the coding interval of all the sequencing sequences in the potential variation region can be combined and processed (the sequence of the potential variation region includes the potential variation region). Partially intersected sequencing sequences and sequencing sequences that fall entirely within the potential variation region, and the result of the union processing is used as the coding interval of the expanded potential variation region.

In the embodiment of the present application, since the sequencing sequence is not interrupted during the extraction process of the sequencing sequence fragment, the integrity of the sequencing sequence can be ensured, thereby improving the accuracy of the genomic variation detection.

Step 204: Extract a reference sequence segment from the reference sequence according to the potential variation region.

In the embodiment of the present application, the reference sequence segment is extracted in the reference sequence based on the coding interval of the potential variation region. For example, in FIG. 4A, the coding interval of the potential variation region is (1510531, 1510630), and the coding interval (1510531, 1510630) portion is extracted in the reference sequence as the reference sequence segment; in FIG. 4B, the expanded potential is shown in FIG. 4B. The coding interval of the mutation region is (1510521, 1510630), and the coding interval (1510521, 1510630) portion is extracted in the reference sequence as the reference sequence segment.

Step 205: performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results.

In the embodiment of the present application, the reference sequence fragment and all the sequenced sequence fragments are put together for multiple sequence alignment, and the process of multiple sequence alignment includes:

Establish a distance matrix: separately calculate the distance between the two sequences (including the distance between the reference sequence fragment and any one of the sequencing sequence fragments, the distance between any two sequencing sequence fragments), and establish a distance matrix between the two sequences;

Construct a clustering tree: firstly gather the two closest distances in the distance matrix, then update the distance matrix, and gather the two closest sequences or two types of sequences in the updated distance matrix, and so on. Until all the sequences are brought together to obtain a clustering tree of reference sequence fragments and sequencing sequence fragments;

Aligning the sequences: According to the clustering hierarchy of the sequencing sequence and the reference sequence in the clustering tree, the two innermost sequences are first aligned, and then all the sequencing sequence fragments and the reference sequence fragments are aligned.

Because in the process of constructing the clustering tree, according to the distance between the sequences (the distance between the sequences represents the similarity between the sequences, the smaller the distance, the higher the similarity), the higher the similarity sequence is preferentially concentrated. Together, therefore, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment. When the variation type of the sequence fragment is obtained, the sequencing sequences with the same mutation type can be clustered together to avoid One type of variation is erroneously compared to different types of variation, thereby improving the accuracy of genomic variation detection results.

5A is a schematic diagram showing a multi-sequence alignment state of a sequencing sequence fragment and a reference sequence fragment according to an embodiment of the present application. In FIG. 5A, there are three different mutation types in the sequencing sequence segment, and all sequencing sequence fragments and After the first reference sequence fragments are put together for multiple sequence alignment, the sequencing sequence fragments of the three different mutation types are respectively aligned and aligned. In addition, since the types of mutations of the sequenced fragments are usually the same in the same haplotype of diploid or polyploid, the reference sequence fragment and all the sequenced fragments are put together for multiple sequence alignment, and may also belong to Sequencing fragments of the same haplotype are brought together to detect genomic variation of diploid or polyploid.

Step 206: Determine a mutation detection result of the genome according to the multi-sequence alignment result.

Due to the detailed variation information of the sequence fragment in the multi-sequence alignment result, including the mismatch, insertion or deletion information at the mutation position and the mutation position of the sequencing sequence fragment, the variability detection of the genome can be determined according to the multi-sequence alignment result. result.

In the embodiment of the present application, first, according to the multi-sequence alignment result, determining a mutation position in a potential variation region; and then extracting, in the multiple sequence alignment result, all the sequencing sequence fragments at the mutation position Mutating information; arranging all of the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the variation information of the sequencing sequence fragments in the same sequencing sequence set at the mutation position is the same; Whether the number of sequencing sequence fragments in the set of sequencing sequences is greater than a third threshold; determining the number of sequencing sequence fragments in the sequencing sequence set when the number of sequencing sequence fragments in the set of sequencing sequences is greater than the third threshold The mutation information is the result of the detection of the genomic variation.

For example, in FIG. 5A, according to the multi-sequence alignment result, the mutation position in the potential variation region is determined as 1510581; the variation information of all the sequencing sequence fragments in the coding 1510581 is extracted, and there are three kinds, respectively: non-existent Variant, there is a base segment CCT insertion, and there is a base segment CCT deletion; according to the mutation information, all the sequence fragments are aggregated into three sequencing sequence sets, which are respectively a first sequencing sequence set (variation information is no variation, The number of sequencing sequence fragments is 11), the second sequencing sequence set (variation information is the presence of the base segment CCT insertion, the number of sequencing sequence fragments is 7) and the third sequencing sequence set (the variation information is the presence of the base segment CCT) Deletion, the number of sequencing sequence fragments is 8); it is sequentially determined whether the number of sequencing sequence fragments in each sequencing sequence set is greater than a third threshold.

If the third threshold is 6, the number of sequencing sequence fragments in the above three sequencing sequence sets is greater than the third threshold, so that the mutation detection result of the genome at code 1510581 is: no mutation; base segment CCT insertion; alkali The base segment CCT is deleted. It can also be shown that the mutation results of the three haplotypes of triploid at 1510581 are: no mutation; base segment CCT insertion; base segment CCT deletion.

If the third threshold is 10, then only the number of sequencing sequence fragments in the first sequencing sequence set is greater than the third threshold in the above three sequencing sequence sets, so that the mutation detection result of the genome at code 1510581 is: there is no variation.

It should be noted that the size of the foregoing third threshold is only an exemplary description in the embodiment of the present application, and those skilled in the art may adjust the size of the third threshold according to actual needs, and all of them should fall into the present embodiment. Within the scope of protection of the application.

It can be seen from the above embodiment that by arranging the reference sequence fragment and all the sequenced fragments together for multiple sequence alignment, the sequencing sequence fragments having the same variation type can be aligned and aligned, and the sequencing sequence segments are aligned accurately, avoiding Falsely categorize one type of variation into different types of variation, thereby improving the accuracy of genomic variation detection results.

However, due to some defects in the construction of the clustering tree in the multi-sequence alignment process, there is a possibility that the sequencing sequence fragment has an overall offset with respect to the reference sequence fragment in the multiple sequence alignment result.

5B is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment provided by an embodiment of the present application. As shown in FIG. 5B, multiple sequence alignments of the first reference sequence fragment and the sequencing sequence fragment are shown in FIG. 5B. In the results, although the sequencing sequence fragments having the same variation type have been clustered together, some of the sequencing sequence fragments having the same variation type have an overall offset with respect to the reference sequence fragment. The deviation of the sequenced fragment from the reference sequence fragment results in a change in the type of variation of the sequenced fragment relative to the reference sequence fragment, which in turn affects the accuracy of the detection of the genomic variation. Therefore, it is necessary to correct the variation type of the sequencing sequence fragment relative to the reference sequence fragment after performing multiple sequence alignment between the sequencing sequence fragment and the reference sequence fragment.

FIG. 6 is a schematic flowchart of another method for detecting a genomic variation according to an embodiment of the present application. The method may further include the following steps after the step 205 on the basis of the embodiment shown in FIG. 2:

Step 601: Determine a variation type of all the sequence fragments according to the multiple sequence alignment result.

In the embodiment of the present application, after the reference sequence fragment and all the sequenced sequence fragments are put together for multi-sequence alignment, the sequencing sequence fragments having the same variation type in the sequencing sequence fragments can be put together and aligned, and all can be obtained. The type of variation of the sequenced fragment relative to the reference sequence fragment is shown in Figures 5A and 5B. Since in FIG. 5B, the partial sequencing sequence fragment is totally offset from the reference sequence fragment, if the multi-sequence alignment result shown in FIG. 5A is to be obtained, the variation of the sequence fragment which is shifted in FIG. 5B is required. Type is corrected.

Step 602: Concentrate all sequencing sequence fragments into at least one sequencing sequence cluster according to the variation type of all the sequencing sequence fragments.

In the embodiment of the present application, all the sequencing sequence fragments are classified according to the variation type of the sequencing sequence fragment, and the sequencing sequence fragments having the same variation type are aggregated into the same sequencing sequence cluster, so as to facilitate the variation type of the sequencing sequence fragment. Correction.

7 is a schematic diagram of a clustering result cluster of clusters according to an embodiment of the present application, which aggregates all sequenced fragments in the multi-sequence alignment result shown in FIG. 5B into three according to the variation type of the sequencing sequence fragment. Sequencing sequence clusters. Wherein, the sequencing sequence fragment in the first sequencing sequence cluster has no variation; the sequencing sequence fragment in the second sequencing sequence cluster has the insertion of the base segment CCT; and the sequencing sequence fragment in the third sequencing sequence cluster has the base segment CGCCAG Deletion and mismatch of a base sequence.

Step 603: Perform a union process on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters.

Since the sequencing sequence fragments in the same sequencing sequence cluster have the same variation class relative to the reference sequence fragment Therefore, in the same sequencing sequence cluster, if the overlapping coding intervals of any two sequencing sequence fragments have the same base sequence, then the sequencing of all the sequenced fragments in the sequencing sequence cluster is performed. The overlapping coding intervals are combined to obtain the characteristic sequences of the sequence clusters. The union processing process will be exemplified in the following with reference to the accompanying drawings.

FIG. 8 is a schematic diagram of a process of a union process according to an embodiment of the present application. FIG. 8 includes two sequencing sequence segments, wherein a coding interval of the first sequencing sequence segment is (1, 15), and the second sequencing is performed. The coding interval of the sequence fragment is (4,18). The same base sequence TCCCCTCCTCCT is included in the overlapping coding interval (4, 15) of the two sequencing sequence fragments, and the overlapping coding intervals of the two sequencing sequence fragments are combined and sequenced. The uncombined parts of the sequence fragments are used as the head and tail of the feature sequence, respectively, and the feature sequence GACTCCCCTCCTCCTCCT with the coding interval (1, 18) is obtained.

9A is a schematic diagram of a feature sequence obtained by performing a union process on the cluster of sequencing sequences in FIG. 7 according to an embodiment of the present application, which respectively adopts a first sequencing sequence cluster, a second sequencing sequence cluster, and a third sequencing sequence in FIG. 7 . All the sequenced fragments in the cluster are subjected to a union process to obtain a first feature sequence, a second feature sequence and a third feature sequence corresponding thereto.

Step 604: Perform double sequence alignment on each of the feature sequences and the reference sequence segment to obtain a variation type of each of the feature sequences.

Since the double sequence alignment can obtain the optimal alignment result of the two sequences, the variation type of the feature sequence obtained by performing the double sequence alignment between the feature sequence and the reference sequence is the variation of the feature sequence under the optimal alignment result. Types of. Based on this, in the subsequent step, the sequence segment corresponding to the feature sequence can be corrected according to the variation type of the feature sequence.

If there is an offset of the sequence segment in the sequence cluster relative to the reference sequence segment, then the feature sequence obtained by performing the union process on the reference sequence segment will also have the same offset, and the offset feature sequence and reference will be present. The sequence fragments are subjected to double sequence alignment, and the feature sequences can be corrected. That is to say, if the sequence of the sequencing sequence is shifted, the sequence of the sequence of the sequence sequence is compared with the sequence of the reference sequence, and the variation type of the sequence is changed; if the sequence of the sequence is not offset Then, after the sequence sequence corresponding to the sequence fragment of the sequence is compared with the reference sequence fragment, the variation type of the feature sequence is unchanged. Therefore, in the embodiment of the present application, whether the sequence of the sequence corresponding to the feature sequence needs to be corrected may be determined according to the mutation type of the two-sequence alignment.

9B is a schematic diagram of double sequence alignment of the feature sequence of FIG. 9A and the reference sequence segment according to an embodiment of the present application, which respectively obtain the first feature sequence and the second sequence shown in FIG. 9A. Special The sequence sequence and the third feature sequence are subjected to double sequence alignment with the reference sequence fragment, and the obtained alignment result is shown in Fig. 9B. Comparing FIG. 9A and FIG. 9B, after the dual sequence alignment of the feature sequence and the reference sequence segment, the mutation type of the first feature sequence and the second feature sequence does not change, and the variation type of the third feature sequence changes. That is to say, the sequence of the sequence corresponding to the first feature sequence and the second feature sequence has achieved the best alignment effect after multiple sequence alignment, and no correction is needed; the sequence sequence segment corresponding to the third feature sequence An overall offset has occurred relative to the reference sequence segment and further correction is required.

Step 605: Correct the multi-sequence alignment result according to the variation type of each of the feature sequences.

In the embodiment of the present application, the variation type of the sequence segment corresponding to the feature sequence is corrected based on the variation type of the feature sequence, that is, the result of the multiple sequence alignment is corrected. Specifically, when the variation type of the characteristic sequence is different from the variation type of the corresponding sequencing sequence fragment, the variation type of the sequencing sequence fragment is adjusted to the variation type of the characteristic sequence, so that the sequence of the corrected multiple sequence alignment result is sequenced. The variation type of the fragment is the same as the variation type of the characteristic sequence corresponding to the fragment of the sequencing sequence.

For example, in FIG. 9B, after the third sequence of the third feature sequence is aligned with the reference sequence segment, the variation type of the third feature sequence is changed, resulting in the mutation type of the third feature sequence and the sequencing of the third sequencing sequence cluster. The variation types of the sequence fragments are different. Therefore, it is necessary to adjust the variation type of the sequenced sequence fragments of the third sequencing sequence cluster according to the variation type of the third characteristic sequence.

FIG. 9C is a schematic diagram of the corrected multi-sequence alignment result obtained by correcting the multi-sequence alignment result according to the variation type of the characteristic sequence in FIG. 9B according to the embodiment of the present application, wherein the third sequencing sequence cluster is sequenced. The variation type of the sequence fragment is adjusted to the variation type of the third feature sequence.

As can be seen from the above embodiment, in the embodiment of the present application, the feature sequence is first corrected by the double sequence alignment of the characteristic sequence and the reference sequence segment; and then the sequence corresponding to the feature sequence is corrected according to the corrected feature sequence. The fragment was corrected to overcome the problem of partial sequencing sequence fragment deviation from the reference sequence fragment in the multi-sequence alignment result, and the accuracy of the genomic variation detection result was improved.

In general, in the double sequence alignment process, the greater the difference in length between the two sequences, the greater the possibility of multiple alignment results, that is, the greater the probability that the double sequence alignment will be wrong. That is, in the above step 604, when the feature sequence is compared with the reference sequence segment by double sequence, the longer the feature sequence is, the higher the accuracy of the double sequence alignment result of the feature sequence and the reference sequence segment is.

FIG. 10A is a schematic diagram showing a multi-sequence alignment state of another sequencing sequence fragment and a reference sequence fragment provided by an embodiment of the present application. In FIG. 10A, according to the variation type of the sequencing sequence fragment, the sequencing sequence fragment is merged. The clusters are clustered into three sequencing sequences, which are a fourth sequencing sequence cluster, a fifth sequencing sequence cluster and a sixth sequencing sequence cluster, respectively.

10B is a schematic diagram of a feature sequence obtained by performing the union processing of the sequence clusters in FIG. 10A in the embodiment of the present application, respectively, in the fourth sequencing sequence cluster, the fifth sequencing sequence cluster, and the sixth sequencing sequence cluster. All the sequenced fragments are processed in a union, and the fourth, fifth and sixth characteristic sequences corresponding thereto are obtained.

10A and 10B, since the sequencing sequence fragments in the fifth sequencing sequence cluster and the sixth sequencing sequence cluster are shorter (relative reference sequence fragments), after all the sequencing sequence fragments in the sequencing sequence cluster are subjected to the union processing, The resulting fifth and sixth feature sequences are also shorter. If the fifth characteristic sequence or the sixth characteristic sequence is directly compared with the reference sequence fragment, it is likely that the ideal alignment result cannot be obtained, and the variation type of the characteristic sequence is inaccurate, thereby affecting the correction of the sequence segment.

FIG. 11 is a schematic flowchart of another method for detecting a genomic variation according to an embodiment of the present application. The method may further include the following steps after the step 603, based on the embodiment shown in FIG. 6 :

Step 1101: Perform double sequence alignment on any two of the characteristic sequences of each of the obtained sequencing sequence clusters.

In the embodiment of the present application, after obtaining the characteristic sequence of the sequence cluster, respectively, any two of the feature sequences of each of the obtained sequencing sequence clusters are double-sequence-aligned to determine whether two The sequencing sequence cluster corresponding to the characteristic sequence is further combined. For example, for the feature sequence shown in FIG. 10B, the fourth feature sequence and the fifth feature sequence, the fourth feature sequence, and the sixth feature sequence, the fifth feature sequence, and the sixth feature sequence are respectively subjected to double sequence alignment.

Step 1102: Determine whether there is an exact matching of overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region.

If the overlapping regions of the two feature sequences cannot be completely matched, it means that the two feature sequences have different mutation types in their overlapping regions, so they cannot be merged. Therefore, the overlapping regions of the two feature sequences are completely matched. The premise of merging feature sequences. The fact that the variation position of the at least one feature sequence relative to the reference sequence segment completely falls within the overlap region ensures that the two feature sequences have at least one variation position with the same variation information in their overlapping regions.

For example, in the first variation position shown in FIG. 10B, the fourth feature sequence and the fifth feature sequence have a deletion of the base segment CC relative to the second reference sequence segment, and the first variation position is located in the fourth feature sequence and the fifth feature. In the overlapping region of the sequence, it is explained that the fourth feature sequence and the fifth feature sequence satisfy the above judgment condition; in FIG. 10B The second variation position, the fourth feature sequence and the sixth feature sequence have the insertion of the base segment CC relative to the second reference sequence segment, and the second mutation position is located in the overlapping region of the fourth feature sequence and the sixth feature sequence. It is to be noted that the fourth feature sequence and the sixth feature sequence also satisfy the above-described judgment condition.

When the above judgment condition is satisfied, the process proceeds to step 1103 to further merge the sequence clusters; otherwise, proceed to step 604 to perform a double sequence alignment of each feature sequence with the reference sequence segment.

Step 1103: Combine the sequenced sequence clusters corresponding to the two characteristic sequences to obtain a merged sequence cluster, and combine the two feature sequences to obtain the characteristics of the combined sequence clusters. sequence.

Since the sequencing sequence cluster and the characteristic sequence have a one-to-one correspondence, after the sequencing sequence clusters are combined, the characteristic sequences of the sequencing sequence clusters also need to be combined. The merging of the sequencing sequence clusters corresponding to the two characteristic sequences refers to replacing the two sequencing sequence clusters before the combination with the merged sequence clusters to realize the update of the sequencing sequence clusters; Refers to the feature sequence obtained by the union process instead of the two feature sequences before the union process to achieve the update of the feature sequence.

After the execution of step 1103 is completed, the process returns to step 1101 to continue the dual sequence alignment of the feature sequences to determine whether there are still clusters of sequencing sequences that meet the merge conditions. Wherein, the feature sequence in step 1101 includes a feature sequence obtained by the union process, and the sequence cluster in step 1103 includes the merged sequence cluster.

Referring to FIG. 12A and FIG. 12B, FIG. 12A is a schematic diagram of a merge process of merging the feature sequences in FIG. 10B according to an embodiment of the present application, and FIG. 12B is a merge process of merging the sequence clusters in FIG. 10A according to an embodiment of the present application. schematic diagram. As shown in FIG. 12A, the fourth feature sequence and the fifth feature sequence are first subjected to double sequence alignment, because the overlapping regions of the fourth feature sequence and the fifth feature sequence are completely matched, and the first variation position exists (base segment CC) The deletion is completely within its overlapping region, and therefore, the fourth feature sequence and the fifth feature sequence are combined to obtain a seventh feature sequence. Accordingly, as shown in FIG. 12B, the fourth sequencing sequence cluster and the fifth sequencing sequence cluster are combined to obtain a seventh sequencing sequence cluster.

Further, the seventh feature sequence and the sixth feature sequence are subjected to double sequence alignment, because the overlapping regions of the seventh feature sequence and the sixth feature sequence are completely matched, and the second mutation position (the insertion of the base segment CC) is completely present. Falling within its overlapping region, therefore, the seventh feature sequence and the sixth feature sequence are combined to obtain an eighth feature sequence. Correspondingly, the seventh sequencing sequence cluster and the sixth sequencing sequence cluster are combined to obtain an eighth sequencing sequence cluster. Then, in the subsequent step 604, only the eighth feature sequence and the reference sequence segment are double-sequence aligned, and the sequence segment in the eighth sequencing sequence cluster is corrected according to the mutation type of the eighth feature sequence. In step 604, each feature sequence is compared with a reference sequence segment by a double sequence, where each feature sequence includes both The characteristic sequence of the sequencing sequence cluster which is not merged according to the combination condition, and the characteristic sequence of the merged sequencing sequence cluster obtained by combining the sequenced clusters.

It can be seen from the above embodiment that in the embodiment of the present application, the sequence of the sequence sequence that meets the merge condition is further combined to increase the length of the feature sequence, thereby improving the accuracy of the double sequence alignment of the feature sequence and the reference sequence segment. .

Corresponding to the genomic variation detecting method of the present application, the present application also provides a genomic variation detecting device.

FIG. 13 is a schematic structural diagram of a first genomic variation detecting apparatus according to an embodiment of the present application.

The first genomic variation detecting apparatus 1300 may include: a first dual sequence aligning unit 1301, a potential mutated region determining unit 1302, a sequencing sequence segment extracting unit 1303, a reference sequence segment extracting unit 1304, a multiple sequence aligning unit 1305, and a variation. The detection result determining unit 1306.

The first double sequence alignment unit 1301 is configured to perform double sequence alignment on multiple sequence sequences of the genome and the reference sequence, respectively, wherein the reference sequence is that the genome has no variation. The base sequence at the time, the sequencing sequence being the base sequence to be detected in the genome.

The potential variation region determining unit 1302 is configured to determine a potential variation region of the genome according to the double sequence alignment result, where the potential variation region is a base coding interval in which a potential variation occurs in the genome.

The sequencing sequence fragment extracting unit 1303 is configured to extract a sequencing sequence fragment from all the sequencing sequences according to the potential variation region.

The reference sequence segment extracting unit 1304 is configured to extract a reference sequence segment in the reference sequence according to the potential variation region.

The multiple sequence alignment unit 1305 is configured to perform multiple sequence alignment on the reference sequence fragment and all the sequenced sequence fragments to obtain a multiple sequence alignment result.

The mutation detection result determining unit 1306 is configured to determine a mutation detection result of the genome according to the multiple sequence alignment result.

In a possible implementation manner of the present application, the potential variation region determining unit 1302 includes: a first coding interval dividing subunit, configured to divide the genome into multiple codes according to a base coding order of the genome. Interval; a mutation type determining subunit for determining a variation type of all sequencing sequences according to the double sequence alignment result; a probability distribution value statistical subunit for sequentially counting sequencing of different mutation types in each of the coding intervals a probability distribution value of the sequence; an information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value; and a first threshold determination subunit, configured to sequentially determine each of the coding intervals Whether the information entropy is greater than a first threshold; the first potential variation region determining subunit is used to be one of the coding regions When the information entropy is greater than the first threshold, it is determined that the coding interval is a potential variation region.

In a possible implementation manner of the present application, the potential variation region determining unit 1302 includes: a second coding interval dividing subunit, configured to divide the genome into multiple codes according to a base coding order of the genome. a section; a variance quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals; and a second threshold determining subunit for determining a sequencing sequence in which each of the coding intervals is mutated Whether the number is greater than a second threshold; and the second potential variation region determining subunit is configured to determine that the coding interval is a potential variation region when the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold.

In a possible implementation manner of the present application, the sequencing sequence segment extracting unit 1303 is specifically configured to extract an intersection portion of each of the sequencing sequence and the potential variation region as the sequencing sequence segment.

In a possible implementation manner of the present application, the sequencing sequence segment extracting unit 1303 is configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection, extract the sequencing sequence. As the fragment of the sequencing sequence.

In a possible implementation manner of the present application, the reference sequence segment extracting unit 1304 is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence segment.

In a possible implementation manner of the present application, the mutation detection result determining unit 1306 includes: a mutation position determining subunit, configured to determine a mutation position in the potential variation region according to the multiple sequence alignment result; a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position; and a sequencing sequence collection convergence subunit, configured to use the mutation information according to the variation information, Converging all of the sequencing sequence fragments into at least one sequencing sequence set, wherein the sequencing information fragments in the same sequencing sequence set have the same variation information at the mutation position; and the third threshold determination subunit is used to sequentially determine each Whether the number of sequencing sequence fragments in the set of sequencing sequences is greater than a third threshold; a mutation detection result determining subunit, configured to determine when the number of sequencing sequence segments in one of the sequencing sequence sets is greater than the third threshold The variation information of the sequenced sequence fragments in the sequencing sequence set is the mutation detection result of the genome.

FIG. 14 is a schematic structural diagram of a second genomic variation detecting apparatus according to an embodiment of the present application.

The second genomic variation detecting apparatus 1400 further includes: a mutation type determining unit 1401, a sequencing sequence cluster merging unit 1402, a union processing unit 1403, and a second, based on the first genomic variation detecting apparatus 1300 shown in FIG. The dual sequence alignment unit 1404 and the correction unit 1405.

The mutation type determining unit 1401 is configured to determine a variation type of all the sequence segments according to the multiple sequence alignment result.

The sequencing sequence cluster converging unit 1402 is configured to aggregate all the sequenced sequence fragments into at least one sequencing sequence cluster according to the variation type of the all sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type.

The union processing unit 1403 is configured to perform a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters.

The second double sequence alignment unit 1404 is configured to perform double sequence alignment on each of the feature sequences and the reference sequence segments to obtain a variation type of each of the feature sequences.

a correcting unit 1405, configured to correct the multiple sequence alignment result according to a variation type of each of the feature sequences, wherein a variation type and a variation type of each sequence segment in the corrected multiple sequence alignment result The characteristic sequences corresponding to each of the sequencing sequence fragments have the same type of variation.

15 is a schematic structural diagram of a third genomic variation detecting apparatus according to an embodiment of the present application.

The third genomic variation detecting apparatus 1500 further includes a third dual sequence matching unit 1501, an overlapping area determining unit 1502, and a merging unit 1503, based on the second genomic variation detecting apparatus 1400 shown in FIG.

The third double sequence alignment unit 1501 is configured to perform double sequence alignment on any two of the feature sequences of each of the obtained sequencing sequence clusters.

The overlap region determining unit 1502 is configured to determine whether there is an exact overlap of the overlapping regions of the two feature sequences, and wherein the mutation position of the at least one feature sequence is completely within the overlapping region.

a merging unit 1503, configured to merge the sequence sequence clusters corresponding to the two feature sequences when the overlapping regions of the two feature sequences are completely matched, and the mutation positions of the at least one feature sequence are completely within the overlapping region The merged sequence clusters are obtained, and the two feature sequences are subjected to a union process to obtain a characteristic sequence of the merged sequence clusters.

For the relationship between the functional units in the genomic variation detecting apparatus provided in the embodiment of the present application, reference may be made to the steps in the foregoing genomic variation detecting method, and details are not described herein again.

Corresponding to the genomic variation detection method of the present application, the present application also provides a genomic mutation detection terminal.

FIG. 16 is a schematic structural diagram of a genomic mutation detecting terminal according to an embodiment of the present application. The genomic variation detecting terminal 1600 may include: a processor 1601, a memory 1602, and a communication unit 1603. These components communicate through one or more buses. It will be understood by those skilled in the art that the structure of the server shown in the figure does not constitute a limitation of the present application, and it may be a bus structure or a star structure. More or fewer components may be included than in the drawings, or some components may be combined, or different component arrangements.

The communication unit 1603 is configured to establish a communication channel, so that the storage device can communicate with other devices. Receive user data sent by other devices or send user data to other devices.

The processor 1601, which is a control center of the storage device, connects various parts of the entire electronic device by using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 1602, and calling the storage in the memory. Data to perform various functions of the electronic device and/or process data. The processor may be composed of an integrated circuit (IC), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs that have the same function or different functions. For example, the processor 1601 may include only a Central Processing Unit (CPU). In the embodiment of the present application, the CPU may be a single operation core, and may also include a multi-operation core.

The memory 1602 is configured to store execution instructions of the processor 1601, and the memory 1602 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), Erase programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

When the execution instructions in the memory 1602 are executed by the processor 1601, the genomic mutation detecting terminal 1600 is enabled to perform the following steps:

Double-sequence alignment of a plurality of sequencing sequences of the genome and the reference sequence, respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is a base sequence to be detected by the genome; determining a potential variation region of the genome according to the double sequence alignment result, the potential variation region being a base coding interval in which a potential mutation occurs in the genome; according to the potential variation a region, a sequencing sequence fragment is extracted from all the sequencing sequences; a reference sequence fragment is extracted from the reference sequence according to the potential variation region; and the reference sequence fragment and all the sequenced fragments are subjected to multiple sequence alignment to obtain Multiple sequence alignment results; determining the variation detection results of the genome based on the multiple sequence alignment results.

In a specific implementation, the present application further provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in various embodiments of the calling method provided by the application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (English: read-only memory, abbreviated as: ROM) or a random access memory (English: random access memory, abbreviation: RAM).

Those skilled in the art can clearly understand that the technology in the embodiments of the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution in the embodiment of the present application is essentially Or the part contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, etc., including a plurality of instructions for making one A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments or portions of the embodiments of the present application.

The same and similar parts between the various embodiments in this specification can be referred to each other. In particular, for the device embodiment and the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.

The embodiments of the present application described above are not intended to limit the scope of the present application.

Claims

A method for detecting genomic variation, comprising:

Double-sequence alignment of a plurality of sequencing sequences of the genome and the reference sequence, respectively, to obtain a double sequence alignment result, wherein the reference sequence is a base sequence when the genome is not mutated, and the sequencing sequence is The base sequence to be detected in the genome;

Determining, according to the double sequence alignment result, a potential variation region of the genome, wherein the potential variation region is a base coding interval in which a potential mutation occurs in the genome;

According to the potential variation region, a sequencing sequence fragment is extracted from all sequencing sequences;

Extracting a reference sequence segment from the reference sequence according to the potential variation region;

Performing multiple sequence alignment on the reference sequence fragment and all sequencing sequence fragments to obtain multiple sequence alignment results;

Based on the multi-sequence alignment result, the variation detection result of the genome is determined.
The genomic variation detecting method according to claim 1, wherein after the plurality of sequence alignments are performed on the reference sequence fragment and all the sequenced fragments, the multi-sequence alignment result is obtained, and the method further comprises:

Determining the type of variation of all sequencing sequence fragments based on the multiple sequence alignment results;

Depending on the type of variation of all of the sequencing sequence fragments, all of the sequencing sequence fragments are aggregated into at least one sequencing sequence cluster, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type;

Performing a union process on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters;

Performing a double sequence alignment on each of the feature sequences and the reference sequence segments to obtain a variation type of each of the feature sequences;

Correcting the multi-sequence alignment result according to a variation type of each of the characteristic sequences, wherein the corrected multi-sequence alignment result has a variation type of each of the sequencing sequence fragments and each of the sequencing sequence fragments The corresponding feature sequences have the same variation type.
The genomic variation detecting method according to claim 2, wherein after separately performing clustering processing on all the sequenced fragments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters, Also includes:

Performing double ordering on any two of the characteristic sequences of each of the obtained sequencing sequence clusters Column alignment

Determining whether there is an exact overlap of overlapping regions of the two feature sequences, and wherein the variation position of at least one of the feature sequences is completely within the overlapping region;

When the overlapping regions of the two feature sequences are completely matched, and the mutation positions of at least one of the feature sequences are completely within the overlapping region, the sequencing sequence clusters corresponding to the two feature sequences are combined to obtain a combined sequencing. Sequence clusters, and the two feature sequences are processed in a union to obtain a characteristic sequence of the merged sequence clusters.
The genomic variation detecting method according to claim 1, wherein the potential variation region of the genome is determined according to the double sequence alignment result, including:

Dividing the genome into a plurality of coding intervals according to a base coding order of the genome;

Determining the type of variation of all sequencing sequences based on the double sequence alignment results;

And sequentially calculating a probability distribution value of the sequencing sequence of different mutation types in each of the coding intervals;

Calculating an information entropy of each of the coding intervals according to the probability distribution value;

Determining, in turn, whether the information entropy of each of the coding intervals is greater than a first threshold;

When the information entropy of one of the coding intervals is greater than the first threshold, it is determined that the coding interval is a potential variation region.
The genomic variation detecting method according to claim 1, wherein the potential variation region of the genome is determined according to the double sequence alignment result, including:

Dividing the genome into a plurality of coding intervals according to a base coding order of the genome;

Counting the number of sequencing sequences in each of the coding intervals in turn;

Determining whether the number of sequencing sequences in each of the coding intervals is greater than a second threshold;

When the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold, the coding interval is determined to be a potential variation region.
The genomic variation detecting method according to claim 1, wherein the sequencing sequence fragments are extracted from all the sequencing sequences according to the potential variation region, including:

An intersection of each of the sequencing sequence and the potential variant region is extracted as the sequencing sequence fragment.
The method for detecting genomic variation according to claim 1, characterized in that In the variant region, sequencing sequence fragments are extracted from all sequencing sequences, including:

The sequencing sequence is extracted as the sequencing sequence fragment when there is an intersection of each of the sequencing sequences and the potential variation region.
The method for detecting a genomic variation according to claim 1, wherein extracting a reference sequence segment from the reference sequence according to the potential variation region comprises:

An intersection portion of the reference sequence and the potential variation region is extracted as the reference sequence segment.
The genomic variation detecting method according to claim 1, wherein the variability detection result of the genomic group is determined according to the multi-sequence alignment result, comprising:

Determining a variation position in the potential variation region according to the multi-sequence alignment result;

Extracting variation information of all of the sequencing sequence fragments at the mutation position in the multiple sequence alignment result;

And merging all the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the variation information of the sequencing sequence fragments in the same sequencing sequence set at the mutation position is the same;

Determining, in turn, whether the number of sequencing sequence fragments in each of the sequencing sequence sets is greater than a third threshold;

When the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining that the variation information of the sequenced sequence fragments in the sequence of the sequencing sequences is the variation detection result of the genome.
A genomic variation detecting device, comprising:

a first double sequence alignment unit for performing double sequence alignment on a plurality of sequencing sequences of the genome and the reference sequence, wherein the reference sequence is a base when the genome is not mutated a base sequence, wherein the sequencing sequence is a base sequence to be detected in the genome;

a potential variation region determining unit, configured to determine a potential variation region of the genome according to the double sequence alignment result, wherein the potential variation region is a base coding interval in which a potential variation occurs in the genome;

a sequencing sequence fragment extracting unit for extracting a sequencing sequence fragment from all the sequencing sequences according to the potential variation region;

a reference sequence segment extracting unit, configured to extract a reference sequence segment in the reference sequence according to the potential variation region;

a multi-sequence aligning unit for performing multiple sequences on the reference sequence fragment and all sequencing sequence fragments Compare, get multiple sequence alignment results;

The mutation detection result determining unit is configured to determine a variation detection result of the genome according to the multiple sequence alignment result.
The genomic variation detecting apparatus according to claim 10, further comprising:

a mutation type determining unit, configured to determine a variation type of all the sequence segments according to the multiple sequence alignment result;

a sequencing sequence cluster converging unit, configured to aggregate all the sequenced fragments into at least one sequencing sequence cluster according to the variation type of the all sequencing sequence fragments, wherein the sequencing sequence fragments in the same sequencing sequence cluster have the same variation type;

a union processing unit, configured to perform a union process on each of the sequencing sequence segments in each of the sequencing sequence clusters to obtain a characteristic sequence of each of the sequencing sequence clusters;

a second dual sequence aligning unit, configured to perform a double sequence alignment on each of the feature sequences and the reference sequence segment to obtain a variation type of each of the feature sequences;

a correcting unit, configured to correct the multi-sequence alignment result according to a variation type of each of the characteristic sequences, wherein a variation type of each of the sequencing sequence fragments in the corrected multi-sequence alignment result is The characteristic sequences corresponding to each of the sequencing sequence fragments have the same type of variation.
The genomic variation detecting apparatus according to claim 11, further comprising:

a third double sequence alignment unit, configured to perform double sequence alignment on any two of the characteristic sequences of each of the obtained sequencing sequence clusters;

An overlapping area determining unit, configured to determine whether there is an exact matching of overlapping regions of the two feature sequences, and wherein the variation position of the at least one feature sequence is completely within the overlapping region;

a merging unit, configured to merge the sequence sequence clusters corresponding to the two feature sequences when the overlapping regions of the two feature sequences are completely matched, and the mutation positions of the at least one feature sequence are completely within the overlapping region, The merged sequence clusters are obtained, and the two feature sequences are subjected to a union process to obtain a characteristic sequence of the combined sequence clusters.
The genomic variation detecting apparatus according to claim 10, wherein the potential variation region determining unit comprises:

a first coding interval dividing subunit, configured to divide the genome into a plurality of coding intervals according to a base coding order of the genome;

a mutation type determining subunit for determining a variation type of all sequencing sequences according to the double sequence alignment result;

a probability distribution value statistical sub-unit for sequentially counting probability distribution values of sequencing sequences of different mutation types in each of the coding intervals;

An information entropy calculation subunit, configured to calculate an information entropy of each of the coding intervals according to the probability distribution value;

a first threshold determining subunit, configured to sequentially determine whether an information entropy of each of the encoding intervals is greater than a first threshold;

The first latent variation region determining subunit is configured to determine that the coding interval is a potential variation region when an information entropy of one of the coding intervals is greater than the first threshold.
The genomic variation detecting apparatus according to claim 10, wherein the potential variation region determining unit comprises:

a second coding interval dividing subunit, configured to divide the genome into a plurality of coding intervals according to a base coding order of the genome;

a mutation quantity statistical subunit for sequentially counting the number of sequencing sequences in each of the coding intervals;

a second threshold determining subunit, configured to determine whether the number of sequencing sequences in each of the encoding intervals is greater than a second threshold;

The second latent variation region determining subunit is configured to determine that the coding interval is a potential variation region when the number of sequencing sequences in which the mutation occurs within the coding interval is greater than the second threshold.
The genomic variation detecting apparatus according to claim 10, wherein

The sequencing sequence fragment extracting unit is specifically configured to extract an intersection portion of each of the sequencing sequence and the potential variation region as the sequencing sequence segment.
The genomic variation detecting apparatus according to claim 10, wherein

And the sequencing sequence segment extracting unit is configured to: when the intersection determining subunit determines that the sequencing sequence and the potential variation region have an intersection, extract the sequencing sequence as the sequencing sequence segment.
The genomic variation detecting apparatus according to claim 10, wherein

The reference sequence segment extracting unit is specifically configured to extract an intersection portion of the reference sequence and the potential variation region as the reference sequence segment.
The genomic variation detecting device according to claim 10, wherein the mutation detecting result determining unit comprises:

a mutation position determining subunit, configured to determine a mutation position in the potential variation region according to the multiple sequence alignment result;

a mutation information extraction subunit, configured to extract, in the multiple sequence alignment result, mutation information of all the sequencing sequence fragments at the mutation position;

a sequencing sequence aggregation converging subunit, configured to aggregate all of the sequencing sequence fragments into at least one sequencing sequence set according to the mutation information, wherein the sequencing sequence fragments in the same sequencing sequence set have the same variation information at the mutation position ;

a third threshold determining subunit, configured to sequentially determine whether the number of sequencing sequence segments in each of the sequencing sequence sets is greater than a third threshold;

a mutation detection result determining subunit, configured to determine, when the number of the sequenced sequence fragments in the set of the sequencing sequences is greater than the third threshold, determining mutation information of the sequenced sequence fragments in the sequencing sequence set as the variation detection of the genome result.
A genomic variation detecting terminal, comprising:

processor;

a memory for storing execution instructions of the processor;

Wherein the processor is configured to perform the method of any of claims 1-9.