CN113628685A - Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data - Google Patents

Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data Download PDF

Info

Publication number
CN113628685A
CN113628685A CN202110849440.4A CN202110849440A CN113628685A CN 113628685 A CN113628685 A CN 113628685A CN 202110849440 A CN202110849440 A CN 202110849440A CN 113628685 A CN113628685 A CN 113628685A
Authority
CN
China
Prior art keywords
genome
reference genome
file
gene
sequencing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110849440.4A
Other languages
Chinese (zh)
Other versions
CN113628685B (en
Inventor
王健
赵均良
杨武
李方平
刘斌
董景芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rice Research Institute Guangdong Academy Of Agricultural Sciences
Rice Research Institute of Guangdong Academy of Agricultural Sciences
Original Assignee
Rice Research Institute Guangdong Academy Of Agricultural Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rice Research Institute Guangdong Academy Of Agricultural Sciences filed Critical Rice Research Institute Guangdong Academy Of Agricultural Sciences
Priority to CN202110849440.4A priority Critical patent/CN113628685B/en
Publication of CN113628685A publication Critical patent/CN113628685A/en
Application granted granted Critical
Publication of CN113628685B publication Critical patent/CN113628685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data, which comprises the following steps: and (3) comparing the reference genome with the de novo assembled genome file by using comparison software, and performing the second step: updating the position or structure of the annotated gene in the reference genome according to the size of the insert and the position of the insert in the reference genome, and performing the third step: if a plurality of genomes are assembled from the beginning, the reference genome is updated iteratively in turn; step four: comparing the second-generation sequencing data of the samples to the updated reference genome by using comparison software, and collecting the position information of the demarcation points of all structural variations of all the samples into a set to construct a group genotype, wherein the sixth step is as follows: according to the method, the functional genes are selected according to the associated loci and the updated reference genome annotation file, and the whole genome association analysis is performed by using structural variation genotype data.

Description

Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data
Technical Field
The invention relates to a reference genome based on multiple genome comparisons and annotation updating thereof, mining of sequence structure variation based on second-generation sequencing data, group genotype construction based on multi-sample genome structure variation and a whole genome association analysis method based on structure variation genotype, in particular to a whole genome association analysis method based on multiple genome comparisons and second-generation sequencing data, and belongs to the technical field of biological information.
Background
The genome-wide association analysis is to detect the genetic variation polymorphism of a plurality of individuals in the genome-wide range to obtain the genotype, further to carry out the statistical analysis of the group level of the genotype and the observable phenotype, and to mine the gene related to the character variation. The genetic variation forms in the whole genome range are various, and there are base substitution variation, insertion, deletion, inversion translocation and the like. At present, with the cost reduction of the second-generation sequencing, researches for excavating genome genetic variation and constructing a group genotype for whole genome correlation analysis based on the second-generation sequencing technology are more and more. The method for mining the genetic variation of the genome based on the second-generation sequencing data needs a reference genome and sequencing data, and can be improved from the three aspects of the reference genome, sample sequencing data and a genetic variation detection method in order to obtain richer and more accurate genetic variation: 1. in order to obtain a more favorable reference genome, researchers have proposed using a pan-genome as the reference genome, which refers to the sum of all genomic sequences in a population. By capturing and presenting all genome variation in the population, a complete reference genome is provided for functional genomics research. However, the existing strategy for constructing the pan-genome has many defects, the organization form of the pan-genome cannot be unified, and the genome variation is diverse, which indicates that the pan-genome may not be linear, researchers propose a genome based on graph theory, but the pan-genome in a multidimensional graph form is not easy to understand, and the existing classical analysis software such as BWA (bwe-max) for comparing second-generation sequencing data to a reference genome is only suitable for a linear reference genome; 2. in the aspect of whole genome sequencing data, compared with the third-generation sequencing technology, the Hi-C technology and the like, the second-generation sequencing technology has advantages in price and accuracy and can provide effective sequencing data for mining the whole genome structural variation of a population; 3. at present, methods for detecting structural variation by utilizing next-generation sequencing data and linear reference genomes are various and comprise a sequencing coverage depth method, a paired double-end sequencing method, a sequence read length segmentation method and an assembly method, and the methods have characteristics and have the coexistence of advantages and disadvantages. The invention innovatively utilizes a new strategy to re-integrate the structural variation obtained by genome comparison on a reference genome, iterates for multiple times to obtain an updated reference genome and annotations thereof, uses comparison software to mount second-generation sequencing data on the reference genome, re-excavates sequence structural variation according to the coverage depth and constructs a genome structural variation genotype for whole genome association analysis.
Disclosure of Invention
The present invention is directed to a whole genome association analysis method based on multiple genome comparisons and second-generation sequencing data, so as to solve the problems of the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a method for genome-wide association analysis based on multiple genome comparisons and second-generation sequencing data, the method comprising the steps of:
the method comprises the following steps: comparing the reference genome with the de novo assembled genome file by using comparison software to obtain the co-linearity characteristics among the genomes, mining structural variation sites according to the co-linearity characteristics, inserting the insertion fragments in the de novo assembled genome into corresponding positions of the reference genome, and generating an updated reference genome file;
step two: updating the position or structure of the annotated gene in the reference genome according to the size of the insert and the position of the insert in the reference genome, obtaining the annotated gene of which the gene structure is completely positioned in the insert according to the position of the insert and the gene annotation information in the de novo assembled genome, and updating the annotated gene into the reference genome;
step three: if a plurality of genomes are assembled from the beginning, the reference genome is updated iteratively in turn;
step four: comparing the second-generation sequencing data of the sample to the updated reference genome by using comparison software, extracting the short sequence coverage of the comparison file by using sequence extraction software to generate a coverage file, and judging the structural variation condition of the sample by setting a coverage threshold;
collecting the position information of the demarcation points of all the structural variations of all the samples into a set, dividing the updated reference genome into segments (named as bins) by using the position of the demarcation points, recapturing the existence and the deletion of all the bins in the samples, constructing the genotype of each sample based on the bins, combining the bin genotypes of all the samples, and forming the structural variation genotype of the genome of the colony through the Minimum Allele Frequency (MAF) screening;
step six: and performing whole genome association analysis based on the genotype and phenotype of the structural variation of the genome of the population, and performing functional gene candidate according to the association sites and the updated reference genome annotation file.
As a preferred embodiment of the present invention, the first step is to insert the detected fragments from the genome to be aligned into the corresponding positions of the reference genome, which means fragments larger than 50bp.
As a preferred embodiment of the present invention, the structural variation refers to the presence or absence of a sequence fragment.
And as a preferred technical scheme of the invention, in the fifth step, the position information of the demarcation points of the structural variation of all the samples is collected into a set, the reference genome is divided into bins by the elements in the set, and the bin division is beneficial to the unification of the marker quantity and the marker position of all the samples, so that the establishment of the group genotype is realized.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention relates to a whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data, which realizes whole genome correlation analysis by using structural variation genotype data.
2. The invention innovatively integrates a plurality of genomes, updates the reference genome, and the updated reference genome comprises more fragments without the initial reference genome, thereby providing possibility for subsequent detection of structural variation and excavation of new functional genes. And the updated reference genome is still in a linear form, so that the understanding and the subsequent application are facilitated.
3. According to the invention, the second-generation sequencing is utilized to compare with the updated reference genome, and the structural variation information of the sample is obtained through the coverage degree, so that the efficient and accurate analysis and identification of the structural variation of the genome are realized.
4. The invention creatively constructs the bin map, unifies the quantity and the positions of all the sample structural variation aiming at the condition that the starting position and the ending position of each sample structural variation are different from each other in size, and forms a genotype file for the whole genome association analysis. Since structural variation is ubiquitous and may directly result in changes in gene function, genome wide association analysis using structural variation genotypes may find new association sites relative to genome wide association analysis of SNP genotypes.
5. The present invention innovatively updates the gene annotation of the reference genome, gene candidates can be made by correlating site positions and updated gene annotations, and since the updated gene annotations also contain genes partially from the de novo assembled genome, it is possible to find functional genes that do not exist on the initial reference genome, which cannot be achieved by making gene candidates using the initial reference genome.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a schematic representation of an update of a reference genome of the present invention;
FIG. 3 is a schematic representation of the updating of the reference genomic gene annotation of the present invention;
FIG. 4 is a new schematic diagram of the detection of structural variation of a single sample according to the present invention;
FIG. 5 is a schematic diagram of the population genotype construction of the present invention;
FIG. 6 shows the rice disease-resistant genome-wide association analysis of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to FIGS. 1-6, the present invention provides a whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data:
A. update of reference genome and its annotations:
A1. the reference genome and a plurality of de novo assembled genome names are saved to a text file and sorted, wherein the first line is the reference genome and the second and subsequent lines are set as de novo assembled genomes.
A2. And (3) comparing the reference genome with the de novo assembled genome file by using comparison software to obtain the co-linear characteristics among the genomes.
A3. And D, mining the structural variation site according to the colinearity characteristics obtained in the step A2 by using extraction software, and generating a structural variation site position information file by taking the reference genome in the step A2 as a position benchmark.
A4. According to the structural variation location information file generated in step A3, the fragments only existing in the de novo assembled genome are inserted into the corresponding locations of the reference genome, and a new reference genome file is generated.
A5. The annotated gene location or structure in the reference genome is updated based on the insert size and the location of the insert into the reference genome.
A6. Obtaining the annotated gene with the gene structure completely positioned in the inserted segment according to the position of the inserted segment and the gene annotation information in the genome assembled from the beginning. Such annotated genes are added to the gene annotation file generated by a5, updating the gene annotation of the reference genome.
A7. If there are multiple de novo assembled genomes, the updated reference genome file in step A4 is used as the reference genome and the gene annotation file generated in step A6 is used as the gene annotation for the reference genome. And (4) combining the de novo assembled genome of the third line and the subsequent line of the A1 text file, and repeating the steps A2-A6 in sequence to finally generate a reference genome file and a gene annotation file thereof after multiple iterations of updating.
B. Whole genome structural variation mining based on updated reference genome and second generation sequencing data:
B1. and B, comparing the second-generation sequencing data of the sample to the updated reference genome in the step A through comparison software to generate a comparison file.
B2. And comparing the short sequence coverage of the file by using sequence extraction software to extract the short sequence coverage of the file, and generating a coverage file. And judging the structural variation condition of the sample by setting a coverage threshold.
C. Construction of the genome structural variation genotype of the population:
C1. all samples were collected for all structural variants of the start and stop site information to remove redundancy, and placed in a set where the site can cut the reference genome into different size fragments (called bin for short, bin is named with bin start position).
C2. And (4) according to the structural variation condition of the sample, recapturing the existence and deletion condition of the bin in the sample, and constructing the genotype of each sample based on the bin.
C3. Bin genotypes of all samples were pooled and screened for Minimal Allele Frequency (MAF) to form population genomic structural variant genotypes.
D. And performing whole genome association analysis by using the genotype and phenotype of the group genome structural variation, and searching candidate genes in the updated reference genome gene annotation file according to the position of the association site.
Note:
1. the invention realizes that the comparison software used in the step A2 is MUMmer, the extraction software used in the step A3 is Assemblytics, the comparison software used in the step B1 is BWA, and the sequence extraction software used in the step B2 is Mosdepth.
MUMmer, assemblies, BWA, Mosdepth are software and have no Chinese name, and are directly expressed in English in the industry.
3. In the step B2, the coverage threshold is set by the user, if the coverage of the second-generation sequencing data alignment of the variant locus is greater than the threshold, it indicates that the sequence alignment covers the variant fragment, and the variant locus exists in the sample, otherwise, it does not exist.
4. In step B2, a minimum deletion (gap) threshold may be set, and if the detected variation is greater than the threshold, the sample will have a variation site at it, otherwise, it will not.
5. Structural variations of the present invention refer to the presence and absence of DNA sequences.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings. The invention comprises three parts in total.
The first embodiment is as follows: rice reference genome and gene annotation update thereof
Using 33 rice genomes already assembled and annotated, with rice nipponica MSU as the initial reference genome, DHX2,02428, Kosh, ZH11, KY131, Lemont, NamRoo, LJ, G46, CN1, FS32, DG, D62, II32, R527, S548, 9311, Y58S, J4155, G8, Y3551, IR64, R498, TM, Tumba, G630, YX1, WSSM, FH838, N22, Basmati1, CG14 were aligned in sequence.
As shown in fig. 2, in the first round of alignment, the reference genome (MSU) and the first de novo assembled genome (DHX2) are aligned using MUMmer software to obtain the co-linear features between the genomes, and generate a delta file;
extracting co-linear characteristics by using assembly software, sorting co-linear characteristic files by using python software to screen the sizes of inserted fragments (>50bp), and obtaining a structural variation position information data file (assembly _ structural _ variants. bed file) taking a reference genome as a position reference, wherein the bed contains compared position information;
the awk is operated to screen the bed file to obtain coords, wherein the bed only contains insertion and deletion fragments;
run the 3 place _ regions _ with _ nucleotidese 1.2.py program retrieve insert sequences based on the bed position and DHX2 genomic sequence;
running grep to screen the bed size, and removing fragments with insertion fragments smaller than 50 bp;
and (5) running a self-programming program 5in, and performing format arrangement on the bed to form a bed2 format file, wherein the file format is as follows: the first column is the chromosome of the insert corresponding to the MSU genome, the second column is the initial position of the insert corresponding to the MSU genome, the third column is the sum of the data of the second column plus the size of the insert, the fourth column is the "insert" tag, the fifth column is the insert base sequence, the sixth column is the DHX2 genome name, the seventh column is the chromosome of the insert on the DHX2 genome, and the eighth column is the initial position of the insert on the DHX2.
Run from the editor 6update _ ref _ by _ nucmer. py, insert the sequence from DHX2 genome into the corresponding position of MSU genome according to MSU genome sequence, DHX2 genome sequence, and alignment result bed2 format file, obtain updated reference genome ref0dhx2. fa.
As shown in fig. 3, all de novo assembled genomes and reference genomes comprise a genome sequence (fa) file and a gene annotation (gff) file. When a reference genome is inserted with a sequence fragment from a de novo assembled genome (PV), screening genes completely located on the PV by obtaining the overlap of the annotated genes with the PV through the position and size of the PV in the de novo assembled genome and the annotation information of the de novo assembled genome genes; and (4) acquiring the overlapping condition of the insertion sites and the genes according to the positions of PV inserted into the reference genome and the reference genome gene annotations, and screening the insertion sites positioned inside the genes. Updating the gene annotation based on the location information using the R language software, including the following cases:
1. if the PV is located in front of the gene position, the position of the gene should be added with the PV size to obtain new position information, so as to ensure that the new gene annotation gff file corresponds to the new reference genome;
2. if the PV is positioned in the gene and the annotation condition of the PV front-segment gene structure is not changed, the PV size is added to the PV rear-segment gene structure annotation to obtain new position information;
3. if the complete gene structure is contained within the PV, the newly introduced gene should be included in the reference genomic gene annotation. After the update, a gene annotation (gff) file corresponding to the updated reference genome is obtained.
Running self-programmer 7bed2_ update _ bed3.r, a bed3 file was obtained from the bed2 file, bed3 is in accordance with the format of bed2, but the first three columns in bed3 are the insertion corresponding to the chromosome and start and end positions in ref0dhx2. fa.
Running from the self-programming 8gff _ update _ by _ bed2info _ parlapply. r, the annotation gene locations are updated from the msu. gff annotation file and the bed2 file, where part of the genes may contain inserts, this step updates only the original location information, obtaining ref0.update1. gff.
Running from the self-programming 9gff _ split _ by _ bed3_6.R, the annotation of the gene interrupted by the insert is updated, the interrupted state is recorded, and ref0.update2.gff is obtained.
Running the self-programming program 10gene _ in _ pv _ from _ gff _ parlapply2.R, annotation information such as gene, mRNA, exon and CDS, which are completely located on the insert, are obtained according to the annotation of the bed3 and DHX2 genes, and the DHX2.gff.in _ pv file is obtained.
Run the self-programming program 11gene _ in _ pv _ screen. py screen dhx2.gff.in _ pv, remove the annotation information that part of the exon or CDS is on the insert but the gene is not completely on the insert. Annotation information that the gene is completely located in the insert is obtained, and a DHX2.gff _ gene _ absolute. in _ pv file is generated.
The self-programming program 12bed3_ update _ gff.r is run to obtain the position of the inserted fragment gene on the updated reference genome ref0dhx2.fa according to the bed3 file and the dhx2.gff _ gene _ absolute.
The ref0.update2.gff and hx2.gene _ absolute _ in _ pv. gff are combined to obtain a gene annotation file ref0dhx2.gff corresponding to ref0dhx2. fa.
The initial reference genome MSU size is 356Mb, and after alignment with DHX2, extraction of co-linear features, insertion of sequences, etc., the updated reference genome size is 359Mb, adding about 3Mb of base sequences. The gene annotation gff increased 87 genes.
And (3) establishing a soft link by using ref0DHX2.fa as ref1.fa, establishing a soft link by using ref0DHX2.gff as ref1.gff, comparing a ref1 reference genome with the next de novo assembled genome 02428, extracting collinear features, inserting sequences and the like to obtain a second updated reference genome and gene annotations thereof, and sequentially and iteratively updating to obtain multiple rounds of iteratively updated reference genomes and gene annotations thereof. In this example, 33 genomes were obtained, and the genomes ref32.fa and ref32.gff were obtained after 32 rounds of updates. The ref32.fa size was 428Mb, 72Mb larger than MSU, ref32.gff contained 58235 genes and 2249 more genes than MSU. The protein sequence file of ref32 was obtained by extracting all protein sequences corresponding to the genome from ref32.fa and ref32.gff using gffred software, and the protein sequence file of MSU was obtained in the same manner. 66153 protein sequences are present on 12 chromosomes of MSU, and a search was made in the protein sequence file of ref32 based on the protein sequence ID of MSU, which indicated that: identical IDs correspond to identical protein sequences in the two protein files. This indicates the accuracy of the gene annotation update.
Example three: mining sequence structure variations based on second generation sequencing data and updated reference genome
As shown in fig. 4, second-generation sequencing data (fq) is mounted on the updated reference genome by using a BWA file, output data is converted into a bam file by using a pipeline and SAMtools software, the bam file is sequenced by using Picad software, repeated operations are removed to obtain a sorted _ add _ dedup.bam file, and sequencing fragments with the comparison quality less than 20, which are not mounted on the reference genome and are matched to multiple positions are removed by using SAMtools software to obtain a screened mapq20.bam file.
The bam dir folder was produced using the self-programming 2Map _ fq _ to _ pan. py to mount all sequencing data under the fqd dir catalog onto the reference genome, containing the mapq20.bam files for all samples. The files of sequencing data fq.gz are placed in the same directory fqd _ dir, and the format of the double-ended sequencing data is unified, in this example, the files are suffix of _1.fq.gz and _2. fq.gz. The command is python32Map _ fq _ to _ Pan. py-fqd fq _ dir-r ref32.fa-br bam _ dir
Utilizing Mosdepth software to extract coverage information of a mapQ20. base file to obtain a per-base.bed.gz file, and in order to reduce subsequent calculation amount, using 20 as a window to calculate the coverage of each 20 base regions, and storing a result file in regions.bed.gz. This example totals 414 second-generation sequencing data, where the average number of rows for all sample per-base.bed.gz files is 2,104,638 rows and the average number of rows for all sample region.bed.gz files is 601,375 rows.
Running a self-programming program 1depth _ call _ pav.py, extracting coverage information from regions.bed.gz, setting a coverage threshold value to be 5, and judging that the region with the coverage less than the threshold value is absent, otherwise, judging that the region with the coverage less than the threshold value is present. Ignoring insertion deletion smaller than 50bp, recording the boundary points of existence and deletion, and recording the existence and deletion information of all sequences in a5 depth-50 bp. This example contained 414 samples, each containing on average 85,823 inserts.
Example three: construction of genome structure variation genotype of population
As shown in fig. 5, R language software is used to extract the information of the existing and missing boundary positions of all samples in the population, remove redundancy, and put the samples into the set. All junctions divide the reference genome into fragment intervals (called bins) of several sizes, each bin is named using the chromosome plus the left position of the fragment, and the presence and deletion state of each bin in each sample is analyzed one by one from per-base. The existence of deletion state of the population is recorded into a matrix form by using R language software, one bin is a genetic marker, and each row in the bin map represents one sample and each column represents one bin marker.
Running the self-programming 2merge _ samples _ pav _ to _ population.r, all the tagged location files all _ chr _ pos _ frame.txt and bin map files can be output. Txt contains 6,270,114 flags for all _ chr _ pos _ frame in this embodiment.
Run the self-programming 3pav _ genotype _ to _ hmp. py, screen each column of the bin map, remove bin marker genotypes all as deletion or all as presence columns, and screen for Minimum Allele Frequency (MAF) to convert to hapmap format of genomic structure variant genotypes of the population. In this example, 414 international rice resources were included, and the genotype after selection included 4,756,248 bin markers.
Example four: whole genome association analysis
Whole genome association analysis was performed using Gapit software.
Population SNP genotypes were obtained using initial reference genome and sequencing data using BWA and GATK software, and whole genome association analysis was performed using Gapit software.
As shown in FIG. 6, by comparing the results of genome-wide association analysis between the structural variant genotype and the SNP genotype, it was found that the association site found by the SNP genotype was almost all found by the structural variant genotype and that the structural variant genotype was also able to find a new association site. The updated gene annotation also brings new genes, so that the defect that the number of the genes for initial reference genome annotation is limited is overcome.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (4)

1. A method for genome-wide association analysis based on multiple genome comparisons and second-generation sequencing data, the method comprising the steps of:
the method comprises the following steps: comparing the reference genome with the de novo assembled genome file by using comparison software to obtain the co-linearity characteristics among the genomes, mining structural variation sites according to the co-linearity characteristics, inserting the insertion fragments in the de novo assembled genome into corresponding positions of the reference genome, and generating an updated reference genome file;
step two: updating the position or structure of the annotated gene in the reference genome according to the size of the insert and the position of the insert in the reference genome, obtaining the annotated gene of which the gene structure is completely positioned in the insert according to the position of the insert and the gene annotation information in the de novo assembled genome, and updating the annotated gene into the reference genome;
step three: if a plurality of genomes are assembled from the beginning, the reference genome is updated iteratively in turn;
step four: comparing the second-generation sequencing data of the sample to the updated reference genome by using comparison software, extracting the short sequence coverage of the comparison file by using sequence extraction software to generate a coverage file, and judging the structural variation condition of the sample by setting a coverage threshold;
collecting the position information of the demarcation points of all the structural variations of all the samples into a set, dividing the updated reference genome into segments (named as bins) by using the position of the demarcation points, recapturing the existence and the deletion of all the bins in the samples, constructing the genotype of each sample based on the bins, combining the bin genotypes of all the samples, and forming the structural variation genotype of the genome of the colony through the Minimum Allele Frequency (MAF) screening;
step six: and performing whole genome association analysis based on the genotype and phenotype of the structural variation of the genome of the population, and performing functional gene candidate according to the association sites and the updated reference genome annotation file.
2. The method of claim 1, wherein the genome-wide association analysis is based on multiple genome comparisons and second-generation sequencing data, and further comprises: step one, inserting the detected fragments from the genome to be aligned into the corresponding positions of the reference genome, which means the fragments larger than 50bp.
3. The method of claim 1, wherein the genome-wide association analysis is based on multiple genome comparisons and second-generation sequencing data, and further comprises: the structural variation refers to the presence and absence of sequence fragments.
4. The method of claim 1, wherein the genome-wide association analysis is based on multiple genome comparisons and second-generation sequencing data, and further comprises: and fifthly, collecting the position information of the demarcation points of the structural variation of all the samples into a set, dividing the reference genome into bins by the elements in the set, wherein the bin division is favorable for unifying the marker quantity and the position of all the samples, thereby realizing the establishment of the group genotype.
CN202110849440.4A 2021-07-27 2021-07-27 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data Active CN113628685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110849440.4A CN113628685B (en) 2021-07-27 2021-07-27 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110849440.4A CN113628685B (en) 2021-07-27 2021-07-27 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data

Publications (2)

Publication Number Publication Date
CN113628685A true CN113628685A (en) 2021-11-09
CN113628685B CN113628685B (en) 2022-03-15

Family

ID=78381061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110849440.4A Active CN113628685B (en) 2021-07-27 2021-07-27 Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data

Country Status (1)

Country Link
CN (1) CN113628685B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115101124A (en) * 2022-08-24 2022-09-23 天津诺禾致源生物信息科技有限公司 Whole genome allele identification method and device
CN116606942A (en) * 2023-07-19 2023-08-18 浙江大学海南研究院 Method for detecting genomic structural variation of livestock and poultry based on liquid phase chip technology
CN118116451A (en) * 2024-03-13 2024-05-31 广东省农业科学院水稻研究所 InDel background molecular marker design method based on resequencing and application
CN118380045A (en) * 2024-06-21 2024-07-23 安诺优达基因科技(北京)有限公司 Method and device for detecting genetic variation information

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205913A1 (en) * 2011-12-02 2015-07-23 Bgi Tech Solutions Co., Ltd. Parental genome assembly method
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
WO2019138244A1 (en) * 2018-01-12 2019-07-18 John Innes Centre Method for identifying genes associated with a particular phenotype
CN110189796A (en) * 2019-05-27 2019-08-30 新疆农业大学 A kind of sheep full-length genome resurveys sequence analysis method
WO2020095035A1 (en) * 2018-11-05 2020-05-14 Earlham Institute Genomic analysis
US20210074379A1 (en) * 2017-12-12 2021-03-11 Sophia Genetics Sa Methods for detecting variants in next-generation sequencing genomic data
WO2021045947A1 (en) * 2019-09-05 2021-03-11 Illumina, Inc. Methods and systems for diagnosing from whole genome sequencing data
US20210193260A1 (en) * 2013-08-30 2021-06-24 Personalis, Inc. Methods and systems for genomic analysis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205913A1 (en) * 2011-12-02 2015-07-23 Bgi Tech Solutions Co., Ltd. Parental genome assembly method
US20210193260A1 (en) * 2013-08-30 2021-06-24 Personalis, Inc. Methods and systems for genomic analysis
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
US20210074379A1 (en) * 2017-12-12 2021-03-11 Sophia Genetics Sa Methods for detecting variants in next-generation sequencing genomic data
WO2019138244A1 (en) * 2018-01-12 2019-07-18 John Innes Centre Method for identifying genes associated with a particular phenotype
WO2020095035A1 (en) * 2018-11-05 2020-05-14 Earlham Institute Genomic analysis
CN110189796A (en) * 2019-05-27 2019-08-30 新疆农业大学 A kind of sheep full-length genome resurveys sequence analysis method
WO2021045947A1 (en) * 2019-09-05 2021-03-11 Illumina, Inc. Methods and systems for diagnosing from whole genome sequencing data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
赵均良等: "《泛基因组及其在植物功能基因组学研究中的应用》", <植物遗传资源学报> *
黄平仙等: "基于全基因组重测序技术分析甜菜InDel标记", 《中国糖料》 *
黄莹等: "BIG-Annotator:基因组测序数据高效功能注释及其在遗传诊断中的应用", 《遗传》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115101124A (en) * 2022-08-24 2022-09-23 天津诺禾致源生物信息科技有限公司 Whole genome allele identification method and device
CN115101124B (en) * 2022-08-24 2022-11-22 天津诺禾致源生物信息科技有限公司 Whole genome allele identification method and device
CN116606942A (en) * 2023-07-19 2023-08-18 浙江大学海南研究院 Method for detecting genomic structural variation of livestock and poultry based on liquid phase chip technology
CN118116451A (en) * 2024-03-13 2024-05-31 广东省农业科学院水稻研究所 InDel background molecular marker design method based on resequencing and application
CN118380045A (en) * 2024-06-21 2024-07-23 安诺优达基因科技(北京)有限公司 Method and device for detecting genetic variation information

Also Published As

Publication number Publication date
CN113628685B (en) 2022-03-15

Similar Documents

Publication Publication Date Title
CN113628685B (en) Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data
Kim et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype
De Coster et al. Towards population-scale long-read sequencing
Nelson et al. McClintock: an integrated pipeline for detecting transposable element insertions in whole-genome shotgun sequencing data
Li et al. A survey of sequence alignment algorithms for next-generation sequencing
Krawitz et al. Microindel detection in short-read sequence data
Cunningham et al. Ensembl 2015
Yang et al. Reptile: representative tiling for short read error correction
Yao et al. Graph accordance of next-generation sequence assemblies
Sun et al. SHOREmap v3. 0: fast and accurate identification of causal mutations from forward genetic screens
Fu et al. IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing
Kehr et al. PopIns: population-scale detection of novel sequence insertions
Hills et al. BAIT: Organizing genomes and mapping rearrangements in single cells
Huang et al. A novel multi-alignment pipeline for high-throughput sequencing data
Cao et al. Sequencing technologies and tools for short tandem repeat variation detection
Holtgrewe et al. Methods for the detection and assembly of novel sequence in high-throughput sequencing data
KR20140006846A (en) Data analysis of dna sequences
Kim et al. Hisat-genotype: Next generation genomic analysis platform on a personal computer
Glusman et al. Identification of copy number variants in whole-genome data using Reference Coverage Profiles
Barbieri et al. Proteogenomics: key driver for clinical discovery and personalized medicine
Normand et al. An introduction to high-throughput sequencing experiments: design and bioinformatics analysis
Gauthier et al. DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics
Arredondo-Alonso et al. A high-throughput multiplexing and selection strategy to complete bacterial genomes
Anastasiadi et al. Bioinformatic analysis for age prediction using epigenetic clocks: Application to fisheries management and conservation biology
CN108256291A (en) It is a kind of to generate the method with higher confidence level detection in Gene Mutation result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant