CN113555062B

CN113555062B - Data analysis system and analysis method for genome base variation detection

Info

Publication number: CN113555062B
Application number: CN202110836939.1A
Authority: CN
Inventors: 刘珍; 刘志岩; 王海宁
Original assignee: Harbin Yinji Technology Co ltd
Current assignee: Harbin Yinji Technology Co ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2022-07-12
Anticipated expiration: 2041-07-23
Also published as: CN113555062A

Abstract

The invention provides a data analysis method and an analysis system for genome base variation detection, which divide structural data of original genome sequencing into a plurality of subdata sequences and store the subdata sequences into a hierarchical distributed database; performing quality filtering on a plurality of subdata sequences stored in the layered distributed database, and comparing the filtered subdata sequences with a reference sequence; evaluating the similarity of the subdata sequence and the reference sequence to further obtain the probability of the base type; generating a classification score of the likelihood of the candidate variant at the base position of interest.

Description

Data analysis system and analysis method for genome base variation detection

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a data analysis system and method for detecting genome base variation.

Background

The development of the second generation sequencing technology brings revolutionary breakthrough to the field of life science research, so that researchers can quickly and conveniently acquire genome sequence data, and unprecedented opportunities are provided for understanding life mechanisms and realizing precise medical treatment. However, how to rapidly analyze these massive sequencing data to serve clinical diagnosis and treatment is an urgent problem for biological researchers.

At present, on a cluster server generally configured, human genome data with a sequencing depth of 30X is calculated, which usually requires 30-40 hours and cannot keep up with the generation speed of sequencing data, so that the analysis efficiency of the human genome data needs to be improved.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a data analysis method for detecting genomic base variation, comprising the steps of:

s1, dividing the original genome sequenced structure data into a plurality of subdata sequences and storing the subdata sequences into a distributed database;

s2, filtering the quality of the subdata sequences stored in the distributed database, and comparing the filtered subdata sequences with the reference sequence;

s3, evaluating the similarity between the subdata sequence and the reference sequence to further obtain the probability of the base type, which comprises the following steps:

s31, defining an active region, and identifying at least one active region according to the sequence alignment data of the sequencing data;

s32, constructing an assembly graph for each active region by using the De-Bruijn graph and the hash table and taking the reference sequence as a template, sequentially reading each segment of Reads to match with one segment of the graph, generating all possibly formed paths in a mode of dynamically generating nodes, and screening out paths with good support to represent possible haplotypes;

s33, adding the generated basic data of all possible paths into a probability calculation queue;

s34, re-aligning the sequences by using a Pair HMM algorithm, and combining the calculated score M to obtain each haploid likelihood value;

s35, calculating the probability of the base type by adopting Bayes theorem;

based on Bayes total probability formula, the probability that genotype G is basic type under the condition of observing data D

Calculated by the following formula:

；

wherein i is equal to 1, 2, 3, i.e. having

；

A probability representing the base type;

is the probability of the occurrence of observation data D given genotype G, the value of which is determined by the haploid likelihood value

Substituting the following equation yields:

；

s4, generating a classification score of the likelihood of the candidate variant at the target base position.

Further, in the step S1,

the method comprises the steps of adopting a characteristic extraction and segmentation method to segment structural data, segmenting according to a set length, using IDs of reads of segmented sequences as indexes, using chain labels, base sequences and quality as values, assembling sequences with the same index into a sub-data sequence, calculating the length of each sub-data sequence, and storing the sub-data sequences with different indexes into a plurality of storage servers in a distributed database based on the length of the sub-data sequences.

Further, the score is calculated using the following Smith-Waterman formula:

M＝R*C；

R＝a*L+b；

wherein M represents the Smith-Waterman calculated score, R is the length of the reference sequence, C represents the length of each sub-data sequence, L represents the length of the data stored by the sub-storage server, and a and b represent constants.

Further, for the newly added sequence, the corresponding index is judged according to the chain label, base sequence or quality as a value, so as to add the corresponding subdata sequence and store the subdata sequence in a storage server in a distributed database.

The present invention further provides a data analysis system for implementing the foregoing data analysis method, including:

the structure data acquisition unit is used for acquiring the structure data of genome sequencing in the original FASTQ file;

the dividing and storing unit is electrically connected with the structural data acquiring unit and is used for dividing the structural data into a plurality of subdata sequences and storing the subdata sequences into a distributed database;

the sequence comparison unit is electrically connected with the distributed database and is used for filtering the quality of a plurality of subdata sequences stored in the distributed database and comparing the filtered subdata sequences with a reference sequence;

the probability evaluation unit is electrically connected with the sequence comparison unit and is used for evaluating the similarity between the subdata sequence and the reference sequence so as to obtain the genotyping probability;

a classification scoring unit, electrically connected to the probability evaluation unit, for generating a variant likelihood classification score for the target base position.

Furthermore, the sequence alignment unit further comprises a calibration subunit, which is used for evaluating the alignment quality of the subdata sequences and calibrating the correctness thereof.

Further, the classification scoring unit outputs a classification score of the variant likelihood by using a deep convolutional neural network model, wherein the deep convolutional neural network model comprises a characterization layer of an input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully-connected layer for feature integration and a finally-connected classifier.

The noun explains:

a gene refers to a DNA or RNA sequence carrying genetic information (i.e., a gene is a DNA or RNA fragment having a genetic effect), also called a genetic element, which is the basic genetic unit for controlling a trait. The gene expresses the genetic information carried by the gene by guiding the synthesis of protein, thereby controlling the character expression of the organism individual.

Gene sequencing is a novel gene detection technology, and the complete sequence of genes is analyzed and determined from blood or saliva, so that the possibility of suffering from various diseases, the behavior characteristics and the behavior reasonableness of an individual and the like are predicted.

reads: the method is a short sequencing fragment, is sequencing data generated by a high-throughput sequencer, can generate hundreds of thousands of reads when sequencing the whole genome, and can obtain the complete sequence of the genome by splicing the reads.

And (3) sequence alignment analysis: the short sequences (reads) sequenced by NGS are stored in FASTQ files, and although they are originally from an ordered genome, the tandem relationship between different reads in the files is completely lost after DNA banking and sequencing. Therefore, there is no positional relationship between the two reads immediately following the FASTQ file, and they are both random from the short sequence at a position in the original genome. Therefore, we need to sort out the large stack of short sequences, compare them with the reference genome of the species one by one, find the position of each read on the reference genome, and then arrange them in order, which is called alignment of sequencing data.

And (3) comparison algorithm: the computational methods for sequence alignment are generally divided into two categories: global alignments (global alignments) and local alignments (local alignments). Computing a global route is a form of global optimization that forces all query sequences to align in terms of the entire length. In contrast, local alignments only determine local similarity while the entire long sequence is often quite different. Local alignments tend to be desirable, but may be more difficult to compute because of challenges from determining other similar regions. Various computational algorithms have been applied to the sequence alignment problem, including slow but regular optimization methods like dynamic programming, efficient but not thorough heuristic algorithms, or probabilistic methods of large database search design.

Smith-waterman is an algorithm for performing local sequence alignments (as opposed to global alignments) to find regions of similarity between two nucleotide or protein sequences. The objective of this algorithm is not to align the complete sequences, but to find the fragments with high similarity in both sequences.

Hash table, also called hash algorithm, hash function, is a method to create a small digital "fingerprint" from any kind of data. The hash function compresses a message or data into a digest so that the amount of data becomes small, fixing the format of the data. This function mixes the data in a hash, recreating a fingerprint called a hash value (hash sums, or hashes). The hash value is typically represented by a short string of random letters and numbers. Good hash functions rarely have hash collisions in the input domain. In hash tables and data processing, data is distinguished without suppressing conflicts, making database records more difficult to find.

Drawings

FIG. 1 is a flow chart of a data analysis method for detecting genomic structural variation according to the present invention;

FIG. 2 is a detailed flowchart of step S3 according to the present invention;

FIG. 3 is a schematic diagram of a data analysis system according to the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

In order to solve the above problems, the present invention provides a data analysis system and an analysis method for detecting genomic structural variation.

Referring to fig. 1, a flow chart of a data analysis method for detecting genomic structural variation according to the present invention is shown, and the data analysis method mainly includes four main processes: establishing a distributed database, and segmenting and storing structural data of original genome sequencing; sequence alignment of sequencing data; probability calculation for genotyping and classification scoring.

Firstly, in the process of establishing a distributed database, segmenting and storing the structural data of original genome sequencing, the method specifically comprises the following steps:

acquiring structural data of genome sequencing in an original FASTQ file;

based on a characteristic extraction and segmentation principle, segmenting the structural data into a plurality of subdata sequences;

it should be noted that the segmentation and calculation of genome structure data must ensure computational accuracy in a biological sense while regulating system load balancing. For example, before mutation detection, data segmentation is effectively segmented according to biological bases such as chromosome regions and the like, the balance degree of calculation load is adjusted, and the overlapping distance is reasonably set, so that information loss cannot be caused by data segmentation. Data transmission efficiency is improved by performing reasonable data slicing in a high-throughput computing environment.

Specifically, a MapReduce frame is used for correspondingly segmenting structural data, a feature extraction (FX) segmentation method is adopted when the structural data is segmented, the method divides double-chain data of a sequencing sequence according to the length set by Hadoop by rewriting a RecordReader abstract class, sends the double-chain data to different maps for processing, uses the ID of reads of the segmented sequence as an index Key in each Map, uses a chain label, a base sequence and quality as values of Value, assembles the sequences with the same index Key into a sub-data sequence by using the sort function of Reduce after segmentation, calculates the length of each sub-data sequence, and stores the sub-data sequences with different index keys into a plurality of storage servers in a distributed database based on the length of the sub-data sequence.

Because the length of each sub data sequence has a certain difference, when the data is stored, different sub data sequences are distributed and stored to different storage servers according to the principle of average distribution as much as possible, so as to ensure load balance.

In a preferred embodiment, the total length of the plurality of sub data sequences allocated by each storage server is the same.

And judging which subdata sequence of the index Key belongs to the newly added sequence according to the chain label, the base sequence or the quality as a Value, adding the corresponding subdata sequence, and storing the subdata sequence in a storage server in a distributed database.

And finishing the building and storing process of the distributed database after all the sub data sequences are written into the distributed database.

In a preferred embodiment, in the above data structure partitioning and storing manner, if a specific node in the structure data has more associated nodes, a minimum cutting principle may be preferentially adopted, and the relationship between the specific node and the associated nodes is not cut, so that the specific node and the associated nodes and the relationship thereof are all partitioned to the same storage server, thereby ensuring that the data in one server node has relatively strong internal relevance; and has good expansibility and throughput capability.

Next, a sequence alignment of the sequencing data is required, specifically including:

and sending the well-divided sequencing structure data stored on the distributed database to each slave machine of the cluster for processing. In a preferred embodiment, the sequencing structure data is subjected to quality filtering before sequence alignment, reads with lower quality are filtered out so as to remove interference data in the sequencing structure data, and then the filtered sequences are aligned by using an alignment tool.

And correspondingly dividing the reference sequences according to the chromosome set and respectively establishing indexes. When bowtie2 is used for alignment, the sequencing structure sequence is respectively aligned with the reference sequence of each chromosome group to obtain the alignment result of the sequencing structure sequence and each chromosome reference sequence, and the alignment result is stored in the corresponding result list of the distributed database.

The sequence alignment process can obtain a series of statistics including the total number of aligned reads, the number of mismatches, the gap size, etc., which can be used to assess the quality of the data and calibrate its correctness.

In the sequence comparison process, a Smith-Waterman algorithm is used for scoring the sub-data sequences, and the scoring results in batches are integrated to obtain a final comparison result so as to realize the fine comparison of the sequencing structure sequence and the reference genome;

the score was calculated using the following Smith-Waterman formula:

M＝R*C；

R＝a*L+b；

wherein M represents the score calculated by Smith-Waterman, R is the length of the reference sequence, C represents the length of each subdata sequence, L represents the length of data stored by each storage server, and a and b represent constants.

And then, evaluating the similarity of the sequencing structure sequence and the reference sequence, and further obtaining the probability of the possible base type.

In particular, with reference to fig. 2, implementation details are set forth:

step S31, defining active regions, and identifying at least one active region according to the sequence alignment data of the sequencing data;

after alignment with the reference genome, the most likely variant active regions are identified by traversing the sequencing structure data and transmitted to the next step. In the expected noise level range, the region without variation will be skipped directly in the subsequent step to speed up the analysis.

In step S32, by applying the De-Bruijn graph and the hash table, an assembly graph can be constructed for each active region using the reference sequence as a template. And then, reading each section of Reads in sequence to match with one section of the chart, generating all possible paths by dynamically generating nodes, and screening paths with good support to represent possible haplotypes.

At step S33, after obtaining the possible haplotypes of an organism, we need to make an assessment of the reliability of their different haplotypes. In haplotype detection, this step usually uses Pair HMM algorithm (hidden Markov matching algorithm) to realign the sequences and combine the data quality information to obtain the likelihood value of each haplotype.

Step S34, calculating the probability of the base type by adopting Bayes theorem;

specifically, based on the Bayesian total probability formula, the probability that genotype G is a basic type in the case where data D is observed

This can be achieved:

；

wherein, for a common diploid, i is equal to 1, 2, 3, i.e. there is

；

Represents the probability of the base type.

Is the probability of the occurrence of observation data D given genotype G, i.e., the probability of genotype G being a basic type, the value of which is represented by the haploid likelihood value

Substituting the following equation yields:

；

thus we have the probability of each possible base type for each site given the observations.

Finally, classification scoring is performed.

The probability of each base type is a number that represents a particular attribute of a candidate mutation site. The probability sequence values are provided directly to the deep convolutional neural network model.

As shown, the deep convolutional neural network model structure includes a characterization layer of the input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully connected layer for feature integration, and a finally connected classifier.

Inputting the probability sequence into a characterization layer, and extracting an upper convolution layer and a lower convolution layer of the features to give a plurality of vector sequences; the full-connection layer combines a plurality of vector sequences given by the upper convolution layer and the lower convolution layer, namely coding or constructing;

the classification layer generates a classification score for the likelihood of whether each candidate variant at the target base position is a true variant or a false variant based on the fully-connected layer encoding or constructed structure.

In other embodiments, the classification layer may generate a classification score for the likelihood of whether each candidate variant at the target base position is a homozygous variant, a heterozygous variant, a non-variant, or a complex variant.

The deep convolutional neural network model requires training of the classifier according to a larger training set, so that its performance can be further improved.

As shown in fig. 3, the data analysis system for implementing the data analysis method of the present invention includes:

the sequence comparison unit is electrically connected with the distributed database and is used for filtering the quality of a plurality of subdata sequences stored in the distributed database and comparing the filtered subdata sequences;

in a preferred embodiment, the sequence alignment unit further comprises a calibration subunit, configured to evaluate the alignment quality of the subdata sequences and calibrate the correctness thereof;

a probability evaluation unit electrically connected with the sequence comparison unit and used for evaluating the similarity of the subdata sequence and the reference sequence so as to obtain the genotyping probability;

The classification scoring unit outputs classification scores of the variant possibilities by using a deep convolutional neural network model, wherein the deep convolutional neural network model comprises a characterization layer of an input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully-connected layer for feature integration and a classifier connected finally.

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims.

Claims

1. A data analysis method for detecting genomic base variation, comprising the steps of:

s31, defining active regions, and identifying at least one active region according to the sequence alignment data of the sequencing data;

s34, using Pair HMM algorithm to realign the sequences, and combining the calculated score M to obtain a likelihood value of each haploid;

s35, calculating the probability of the base type by adopting Bayes theorem;

based on the Bayesian total probability formula, the probability that genotype G is a basic type in the case where data D is observed

Calculated by the following formula:

；

wherein i is equal to 1, 2, 3, i.e. having

；

A probability representing the base type;

is the probability of the occurrence of observed data D given genotype G, the value of which is determined by the haploid likelihood value

Substituting the following equation yields:

；

s4, generating a classification score of the candidate variant possibility of the target base position.

2. The data analysis method of claim 1, wherein: in the step S1, in the above step,

3. The data analysis method of claim 1, wherein: the score was calculated using the following Smith-Waterman formula:

M＝R*C；

R＝a*L+b；

4. The data analysis method of claim 2, wherein: and judging the corresponding index of the newly added sequence according to the chain label, the base sequence or the quality as a value, adding the corresponding subdata sequence, and storing the subdata sequence in a storage server in a distributed database.

5. A data analysis system for implementing the data analysis method of any one of claims 1 to 4, characterized in that: the method comprises the following steps:

6. The data analysis system of claim 5, wherein: the sequence alignment unit also comprises a calibration subunit, which is used for evaluating the alignment quality of the subdata sequences and calibrating the correctness of the subdata sequences.

7. The data analysis system of claim 6, wherein: the classification scoring unit outputs classification scores of the variant possibilities by using a deep convolutional neural network model, wherein the deep convolutional neural network model comprises a characterization layer of an input sequence, an upper convolutional layer and a lower convolutional layer for extracting features, a fully-connected layer for feature integration and a classifier connected finally.