Disclosure of Invention
The present invention is directed to a whole genome association analysis method based on multiple genome comparisons and second-generation sequencing data, so as to solve the problems of the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a method for genome-wide association analysis based on multiple genome comparisons and second-generation sequencing data, the method comprising the steps of:
the method comprises the following steps: comparing the reference genome with the de novo assembled genome file by using comparison software to obtain the co-linearity characteristics among the genomes, mining structural variation sites according to the co-linearity characteristics, inserting the insertion fragments in the de novo assembled genome into corresponding positions of the reference genome, and generating an updated reference genome file;
step two: updating the position or structure of the annotated gene in the reference genome according to the size of the insert and the position of the insert in the reference genome, obtaining the annotated gene of which the gene structure is completely positioned in the insert according to the position of the insert and the gene annotation information in the de novo assembled genome, and updating the annotated gene into the reference genome;
step three: if a plurality of genomes are assembled from the beginning, the reference genome is updated iteratively in turn;
step four: comparing the second-generation sequencing data of the sample to the updated reference genome by using comparison software, extracting the short sequence coverage of the comparison file by using sequence extraction software to generate a coverage file, and judging the structural variation condition of the sample by setting a coverage threshold;
collecting the position information of the demarcation points of all the structural variations of all the samples into a set, dividing the updated reference genome into segments (named as bins) by using the position of the demarcation points, recapturing the existence and the deletion of all the bins in the samples, constructing the genotype of each sample based on the bins, combining the bin genotypes of all the samples, and forming the structural variation genotype of the genome of the colony through the Minimum Allele Frequency (MAF) screening;
step six: and performing whole genome association analysis based on the genotype and phenotype of the structural variation of the genome of the population, and performing functional gene candidate according to the association sites and the updated reference genome annotation file.
As a preferred embodiment of the present invention, the first step is to insert the detected fragments from the genome to be aligned into the corresponding positions of the reference genome, which means fragments larger than 50bp.
As a preferred embodiment of the present invention, the structural variation refers to the presence or absence of a sequence fragment.
And as a preferred technical scheme of the invention, in the fifth step, the position information of the demarcation points of the structural variation of all the samples is collected into a set, the reference genome is divided into bins by the elements in the set, and the bin division is beneficial to the unification of the marker quantity and the marker position of all the samples, so that the establishment of the group genotype is realized.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention relates to a whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data, which realizes whole genome correlation analysis by using structural variation genotype data.
2. The invention innovatively integrates a plurality of genomes, updates the reference genome, and the updated reference genome comprises more fragments without the initial reference genome, thereby providing possibility for subsequent detection of structural variation and excavation of new functional genes. And the updated reference genome is still in a linear form, so that the understanding and the subsequent application are facilitated.
3. According to the invention, the second-generation sequencing is utilized to compare with the updated reference genome, and the structural variation information of the sample is obtained through the coverage degree, so that the efficient and accurate analysis and identification of the structural variation of the genome are realized.
4. The invention creatively constructs the bin map, unifies the quantity and the positions of all the sample structural variation aiming at the condition that the starting position and the ending position of each sample structural variation are different from each other in size, and forms a genotype file for the whole genome association analysis. Since structural variation is ubiquitous and may directly result in changes in gene function, genome wide association analysis using structural variation genotypes may find new association sites relative to genome wide association analysis of SNP genotypes.
5. The present invention innovatively updates the gene annotation of the reference genome, gene candidates can be made by correlating site positions and updated gene annotations, and since the updated gene annotations also contain genes partially from the de novo assembled genome, it is possible to find functional genes that do not exist on the initial reference genome, which cannot be achieved by making gene candidates using the initial reference genome.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to FIGS. 1-6, the present invention provides a whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data:
A. update of reference genome and its annotations:
A1. the reference genome and a plurality of de novo assembled genome names are saved to a text file and sorted, wherein the first line is the reference genome and the second and subsequent lines are set as de novo assembled genomes.
A2. And (3) comparing the reference genome with the de novo assembled genome file by using comparison software to obtain the co-linear characteristics among the genomes.
A3. And D, mining the structural variation site according to the colinearity characteristics obtained in the step A2 by using extraction software, and generating a structural variation site position information file by taking the reference genome in the step A2 as a position benchmark.
A4. According to the structural variation location information file generated in step A3, the fragments only existing in the de novo assembled genome are inserted into the corresponding locations of the reference genome, and a new reference genome file is generated.
A5. The annotated gene location or structure in the reference genome is updated based on the insert size and the location of the insert into the reference genome.
A6. Obtaining the annotated gene with the gene structure completely positioned in the inserted segment according to the position of the inserted segment and the gene annotation information in the genome assembled from the beginning. Such annotated genes are added to the gene annotation file generated by a5, updating the gene annotation of the reference genome.
A7. If there are multiple de novo assembled genomes, the updated reference genome file in step A4 is used as the reference genome and the gene annotation file generated in step A6 is used as the gene annotation for the reference genome. And (4) combining the de novo assembled genome of the third line and the subsequent line of the A1 text file, and repeating the steps A2-A6 in sequence to finally generate a reference genome file and a gene annotation file thereof after multiple iterations of updating.
B. Whole genome structural variation mining based on updated reference genome and second generation sequencing data:
B1. and B, comparing the second-generation sequencing data of the sample to the updated reference genome in the step A through comparison software to generate a comparison file.
B2. And comparing the short sequence coverage of the file by using sequence extraction software to extract the short sequence coverage of the file, and generating a coverage file. And judging the structural variation condition of the sample by setting a coverage threshold.
C. Construction of the genome structural variation genotype of the population:
C1. all samples were collected for all structural variants of the start and stop site information to remove redundancy, and placed in a set where the site can cut the reference genome into different size fragments (called bin for short, bin is named with bin start position).
C2. And (4) according to the structural variation condition of the sample, recapturing the existence and deletion condition of the bin in the sample, and constructing the genotype of each sample based on the bin.
C3. Bin genotypes of all samples were pooled and screened for Minimal Allele Frequency (MAF) to form population genomic structural variant genotypes.
D. And performing whole genome association analysis by using the genotype and phenotype of the group genome structural variation, and searching candidate genes in the updated reference genome gene annotation file according to the position of the association site.
Note:
1. the invention realizes that the comparison software used in the step A2 is MUMmer, the extraction software used in the step A3 is Assemblytics, the comparison software used in the step B1 is BWA, and the sequence extraction software used in the step B2 is Mosdepth.
MUMmer, assemblies, BWA, Mosdepth are software and have no Chinese name, and are directly expressed in English in the industry.
3. In the step B2, the coverage threshold is set by the user, if the coverage of the second-generation sequencing data alignment of the variant locus is greater than the threshold, it indicates that the sequence alignment covers the variant fragment, and the variant locus exists in the sample, otherwise, it does not exist.
4. In step B2, a minimum deletion (gap) threshold may be set, and if the detected variation is greater than the threshold, the sample will have a variation site at it, otherwise, it will not.
5. Structural variations of the present invention refer to the presence and absence of DNA sequences.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings. The invention comprises three parts in total.
The first embodiment is as follows: rice reference genome and gene annotation update thereof
Using 33 rice genomes already assembled and annotated, with rice nipponica MSU as the initial reference genome, DHX2,02428, Kosh, ZH11, KY131, Lemont, NamRoo, LJ, G46, CN1, FS32, DG, D62, II32, R527, S548, 9311, Y58S, J4155, G8, Y3551, IR64, R498, TM, Tumba, G630, YX1, WSSM, FH838, N22, Basmati1, CG14 were aligned in sequence.
As shown in fig. 2, in the first round of alignment, the reference genome (MSU) and the first de novo assembled genome (DHX2) are aligned using MUMmer software to obtain the co-linear features between the genomes, and generate a delta file;
extracting co-linear characteristics by using assembly software, sorting co-linear characteristic files by using python software to screen the sizes of inserted fragments (>50bp), and obtaining a structural variation position information data file (assembly _ structural _ variants. bed file) taking a reference genome as a position reference, wherein the bed contains compared position information;
the awk is operated to screen the bed file to obtain coords, wherein the bed only contains insertion and deletion fragments;
run the 3 place _ regions _ with _ nucleotidese 1.2.py program retrieve insert sequences based on the bed position and DHX2 genomic sequence;
running grep to screen the bed size, and removing fragments with insertion fragments smaller than 50 bp;
and (5) running a self-programming program 5in, and performing format arrangement on the bed to form a bed2 format file, wherein the file format is as follows: the first column is the chromosome of the insert corresponding to the MSU genome, the second column is the initial position of the insert corresponding to the MSU genome, the third column is the sum of the data of the second column plus the size of the insert, the fourth column is the "insert" tag, the fifth column is the insert base sequence, the sixth column is the DHX2 genome name, the seventh column is the chromosome of the insert on the DHX2 genome, and the eighth column is the initial position of the insert on the DHX2.
Run from the editor 6update _ ref _ by _ nucmer. py, insert the sequence from DHX2 genome into the corresponding position of MSU genome according to MSU genome sequence, DHX2 genome sequence, and alignment result bed2 format file, obtain updated reference genome ref0dhx2. fa.
As shown in fig. 3, all de novo assembled genomes and reference genomes comprise a genome sequence (fa) file and a gene annotation (gff) file. When a reference genome is inserted with a sequence fragment from a de novo assembled genome (PV), screening genes completely located on the PV by obtaining the overlap of the annotated genes with the PV through the position and size of the PV in the de novo assembled genome and the annotation information of the de novo assembled genome genes; and (4) acquiring the overlapping condition of the insertion sites and the genes according to the positions of PV inserted into the reference genome and the reference genome gene annotations, and screening the insertion sites positioned inside the genes. Updating the gene annotation based on the location information using the R language software, including the following cases:
1. if the PV is located in front of the gene position, the position of the gene should be added with the PV size to obtain new position information, so as to ensure that the new gene annotation gff file corresponds to the new reference genome;
2. if the PV is positioned in the gene and the annotation condition of the PV front-segment gene structure is not changed, the PV size is added to the PV rear-segment gene structure annotation to obtain new position information;
3. if the complete gene structure is contained within the PV, the newly introduced gene should be included in the reference genomic gene annotation. After the update, a gene annotation (gff) file corresponding to the updated reference genome is obtained.
Running self-programmer 7bed2_ update _ bed3.r, a bed3 file was obtained from the bed2 file, bed3 is in accordance with the format of bed2, but the first three columns in bed3 are the insertion corresponding to the chromosome and start and end positions in ref0dhx2. fa.
Running from the self-programming 8gff _ update _ by _ bed2info _ parlapply. r, the annotation gene locations are updated from the msu. gff annotation file and the bed2 file, where part of the genes may contain inserts, this step updates only the original location information, obtaining ref0.update1. gff.
Running from the self-programming 9gff _ split _ by _ bed3_6.R, the annotation of the gene interrupted by the insert is updated, the interrupted state is recorded, and ref0.update2.gff is obtained.
Running the self-programming program 10gene _ in _ pv _ from _ gff _ parlapply2.R, annotation information such as gene, mRNA, exon and CDS, which are completely located on the insert, are obtained according to the annotation of the bed3 and DHX2 genes, and the DHX2.gff.in _ pv file is obtained.
Run the self-programming program 11gene _ in _ pv _ screen. py screen dhx2.gff.in _ pv, remove the annotation information that part of the exon or CDS is on the insert but the gene is not completely on the insert. Annotation information that the gene is completely located in the insert is obtained, and a DHX2.gff _ gene _ absolute. in _ pv file is generated.
The self-programming program 12bed3_ update _ gff.r is run to obtain the position of the inserted fragment gene on the updated reference genome ref0dhx2.fa according to the bed3 file and the dhx2.gff _ gene _ absolute.
The ref0.update2.gff and hx2.gene _ absolute _ in _ pv. gff are combined to obtain a gene annotation file ref0dhx2.gff corresponding to ref0dhx2. fa.
The initial reference genome MSU size is 356Mb, and after alignment with DHX2, extraction of co-linear features, insertion of sequences, etc., the updated reference genome size is 359Mb, adding about 3Mb of base sequences. The gene annotation gff increased 87 genes.
And (3) establishing a soft link by using ref0DHX2.fa as ref1.fa, establishing a soft link by using ref0DHX2.gff as ref1.gff, comparing a ref1 reference genome with the next de novo assembled genome 02428, extracting collinear features, inserting sequences and the like to obtain a second updated reference genome and gene annotations thereof, and sequentially and iteratively updating to obtain multiple rounds of iteratively updated reference genomes and gene annotations thereof. In this example, 33 genomes were obtained, and the genomes ref32.fa and ref32.gff were obtained after 32 rounds of updates. The ref32.fa size was 428Mb, 72Mb larger than MSU, ref32.gff contained 58235 genes and 2249 more genes than MSU. The protein sequence file of ref32 was obtained by extracting all protein sequences corresponding to the genome from ref32.fa and ref32.gff using gffred software, and the protein sequence file of MSU was obtained in the same manner. 66153 protein sequences are present on 12 chromosomes of MSU, and a search was made in the protein sequence file of ref32 based on the protein sequence ID of MSU, which indicated that: identical IDs correspond to identical protein sequences in the two protein files. This indicates the accuracy of the gene annotation update.
Example three: mining sequence structure variations based on second generation sequencing data and updated reference genome
As shown in fig. 4, second-generation sequencing data (fq) is mounted on the updated reference genome by using a BWA file, output data is converted into a bam file by using a pipeline and SAMtools software, the bam file is sequenced by using Picad software, repeated operations are removed to obtain a sorted _ add _ dedup.bam file, and sequencing fragments with the comparison quality less than 20, which are not mounted on the reference genome and are matched to multiple positions are removed by using SAMtools software to obtain a screened mapq20.bam file.
The bam dir folder was produced using the self-programming 2Map _ fq _ to _ pan. py to mount all sequencing data under the fqd dir catalog onto the reference genome, containing the mapq20.bam files for all samples. The files of sequencing data fq.gz are placed in the same directory fqd _ dir, and the format of the double-ended sequencing data is unified, in this example, the files are suffix of _1.fq.gz and _2. fq.gz. The command is python32Map _ fq _ to _ Pan. py-fqd fq _ dir-r ref32.fa-br bam _ dir
Utilizing Mosdepth software to extract coverage information of a mapQ20. base file to obtain a per-base.bed.gz file, and in order to reduce subsequent calculation amount, using 20 as a window to calculate the coverage of each 20 base regions, and storing a result file in regions.bed.gz. This example totals 414 second-generation sequencing data, where the average number of rows for all sample per-base.bed.gz files is 2,104,638 rows and the average number of rows for all sample region.bed.gz files is 601,375 rows.
Running a self-programming program 1depth _ call _ pav.py, extracting coverage information from regions.bed.gz, setting a coverage threshold value to be 5, and judging that the region with the coverage less than the threshold value is absent, otherwise, judging that the region with the coverage less than the threshold value is present. Ignoring insertion deletion smaller than 50bp, recording the boundary points of existence and deletion, and recording the existence and deletion information of all sequences in a5 depth-50 bp. This example contained 414 samples, each containing on average 85,823 inserts.
Example three: construction of genome structure variation genotype of population
As shown in fig. 5, R language software is used to extract the information of the existing and missing boundary positions of all samples in the population, remove redundancy, and put the samples into the set. All junctions divide the reference genome into fragment intervals (called bins) of several sizes, each bin is named using the chromosome plus the left position of the fragment, and the presence and deletion state of each bin in each sample is analyzed one by one from per-base. The existence of deletion state of the population is recorded into a matrix form by using R language software, one bin is a genetic marker, and each row in the bin map represents one sample and each column represents one bin marker.
Running the self-programming 2merge _ samples _ pav _ to _ population.r, all the tagged location files all _ chr _ pos _ frame.txt and bin map files can be output. Txt contains 6,270,114 flags for all _ chr _ pos _ frame in this embodiment.
Run the self-programming 3pav _ genotype _ to _ hmp. py, screen each column of the bin map, remove bin marker genotypes all as deletion or all as presence columns, and screen for Minimum Allele Frequency (MAF) to convert to hapmap format of genomic structure variant genotypes of the population. In this example, 414 international rice resources were included, and the genotype after selection included 4,756,248 bin markers.
Example four: whole genome association analysis
Whole genome association analysis was performed using Gapit software.
Population SNP genotypes were obtained using initial reference genome and sequencing data using BWA and GATK software, and whole genome association analysis was performed using Gapit software.
As shown in FIG. 6, by comparing the results of genome-wide association analysis between the structural variant genotype and the SNP genotype, it was found that the association site found by the SNP genotype was almost all found by the structural variant genotype and that the structural variant genotype was also able to find a new association site. The updated gene annotation also brings new genes, so that the defect that the number of the genes for initial reference genome annotation is limited is overcome.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.