CN112164424B

CN112164424B - Group evolution analysis method based on no-reference genome

Info

Publication number: CN112164424B
Application number: CN202010768331.5A
Authority: CN
Inventors: 刘书云; 张海焕; 姜丽荣; 孙子奎
Original assignee: Nanjing Personal Gene Technology Co ltd
Current assignee: Nanjing Personal Gene Technology Co ltd
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2024-04-09
Anticipated expiration: 2040-08-03
Also published as: CN112164424A

Abstract

The invention discloses a Group evolution analysis method based on 2d-RAD sequencing and without reference genome, which comprises the steps of carrying out data splitting on samples, filtering and clustering to obtain Group SNP, carrying out Group genetic parameter analysis based on sample grouping and Group SNP information, constructing a phylogenetic tree, determining an optimal K value, then utilizing the R self-writing script, and searching for sharing and special SNP information between two groups according to the Group SNP information and the appointed Group information to carry out Group evolution analysis without reference genome. The whole data analysis is more automatic, the labor cost is saved, the analysis efficiency is improved, possible human errors are avoided, and the analyzed data chart is more attractive.

Description

Group evolution analysis method based on no-reference genome

Technical Field

The invention relates to the technical field of gene sequencing analysis, in particular to a population evolution analysis method based on a reference-free genome.

Background

The population structure difference and the gene communication condition between different subgroups in the same species can be deeply explored through the population evolution analysis, and the population structure characteristics between different species can be studied; however, many species have not yet been published with reference genomes, so population evolution analysis without reference genomes is performed.

Because there are multiple non-participating library creating methods (RAD, GBS, 2d-RAD, SLAF, etc.), different library creating methods can be different in the first step of data splitting of non-participating analysis, but the existing non-participating analysis method based on 2d-RAD library creating has complex data filtering flow and lower efficiency, especially when the number of items is large and the sample amount contained in one item is large, one item can be sequenced on machine for multiple times in the actual operation process, thus different batches of data can be obtained, the existing non-participating analysis method cannot intelligently use an automatic flow to combine and filter the different batches of data, and a great amount of labor time is consumed for data combination and filtering.

With the continuous development of high-throughput sequencing, the analysis content of the existing analysis flow appears to be thin, the analysis content is less, and the new non-parametric analysis content is more diversified and personalized. In the past, many places in the non-parametric analysis flow need to be operated manually, and the new non-parametric analysis method is more automatic, and the automatic flow improves the service efficiency of a server, reduces the analysis pressure of an analyst and is convenient for controlling analysis contents.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, the present invention aims to provide an automated analysis method for population evolution analysis based on a genome without reference.

In order to achieve the purpose of the invention, the technical scheme adopted is as follows:

a population evolution analysis method based on 2d-RAD sequencing without reference genome, comprising the steps of:

the first step: according to the enzyme cleavage site information of the barcode, the enzyme 1 and the enzyme 2 in the sequencing sample, carrying out data splitting by utilizing a splitting script, merging a plurality of sequencing data of the same sample in a next machine, and storing the merging data in a fastq.gz format in a first folder;

and a second step of: the data after the first step is split and combined is subjected to fastQC quality control through filtering scripts, and then the quality control is carried out according to the alkali matrix value: data filtering is carried out according to the standard that Q is more than or equal to 20 and the sequence length is more than or equal to 50bp, so that filtered data are stored in a second folder in a fastq.gz format;

and a third step of: sequence clustering is firstly carried out in a single sample, double-end sequencing data of the single sample are combined into a file before clustering, then clustering is carried out by utilizing a ustacks command in software Stacks, a representative category sequence of each sample is obtained, and a result file is stored in a third folder in a tags.tsv.gz format;

fourth step: after grouping samples, clustering based on the category sequences of the single samples to obtain the consensus sequences of all samples, wherein the consensus sequences are class reference genome sequences for all samples;

fifth step: reading grouping information of each sample appointed by all files, simultaneously appointing a deletion rate parameter, detecting group SNP information by using csstacks commands in software Stacks, and storing the group SNP information in a format of VCF files;

sixth step: based on the SNP information of the population in the fifth step, analyzing genetic parameters of the population by utilizing the position command in the Stacks, and calculating to obtain population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity and observed heterozygosity, haplotype diversity data;

seventh step: performing format conversion on the VCF file of the SNP information of the group in the fifth step by using software vccftools and plinks, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components with great influence on the group, calculating the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script;

eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by using a Python self-writing script, and then constructing a phylogenetic tree by using different models;

ninth step:

converting the group SNP format into a format required by software structure by utilizing a Perl self-writing script, then designating the number of SNPs and the number of groups used in analysis, and calculating the percentage of ancestor of each sample;

then determining the optimal K value (ancestor number), and obtaining whether grouping information of the sample is consistent with the initial specification or not according to the result;

tenth step:

and searching for common and specific SNP information between two groups according to the Group SNP information and the specified Group information by utilizing a Perl self-writing script.

In a preferred embodiment of the present invention, the filter script is filter_batch_v2.pl.

In a preferred embodiment of the present invention, the model for constructing a phylogenetic tree comprises any one or more of Maximum Parsimony (MP), neighbor-joining (NJ), maximum Likelihood (ML) or Bayesian method (BI).

In a preferred embodiment of the present invention, the optimal K value is a K value corresponding to an inflection point after the ln linehood enters the plateau.

The invention has the beneficial effects that:

the whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, possible human errors are avoided, and the analyzed data chart is more attractive.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a PCA profile of the present invention.

FIG. 3 is a graph of the evolutionary tree profile based on the NJ model of the present invention.

Fig. 4 is a population genetic structure profile at optimum k=3 according to the invention.

Detailed Description

The principle of the invention is as follows:

the automatic filtering flow based on 2d-RAD parameter-free simplification can be used for splitting and filtering batch data, various subsequent analysis of data filtering can be automatically completed, the data processing efficiency and the server use efficiency are improved, the labor time is saved, meanwhile, the human error is reduced, the whole project analysis period is finally shortened, and the parameter-free analysis high-efficiency automation of rich analysis contents is realized.

Referring to FIG. 1, the population evolution analysis method based on 2d-RAD sequencing without reference genome comprises the following steps:

(1) Data splitting step

Carrying out data automatic resolution by using a self-written script according to information of enzyme sites of the barcode, enzyme 1 and enzyme 2 of a sequencing sample, wherein the format is approximately one row of information representing one sample, and elements of each row are respectively a sample name, a barcode base, an enzyme site of the enzyme 1 and an enzyme site of the enzyme 2, and a spacer is set as a tab; if one sample has multiple off-machine sequencing, the analysis flow is automatically matched and combined, and the combined data are uniformly stored in a folder of 1_RawData in a fastq.gz format.

The splitting script specifically comprises:

a library contains a plurality of samples, four columns of sample names, barcode, enzyme 1 and enzyme 2 enzyme cleavage site sequences are used as an input file 1, and original double-end data fastq.gz of the library is used as input files 2 and 3;

if the front 7bp of the 5 'end of R1 of one sequence is consistent with the barcode, the next 5 bases are consistent with enzyme 1 cleavage site, and the front 4bp of the 5' end of R2 corresponding to the reads is consistent with enzyme 2 cleavage site sequence, the reads can be split into the samples, and the split data result of each sample is output after multiple times of circulation.

(2) Data quality control and filtering step

And performing quality control on the sample by using an automatic filtration script filter_batch_v2.pl written by the user, and performing data filtration according to the standard of an alkali matrix value (Q is more than or equal to 20) and a sequence length (more than or equal to 50 bp). After the run is finished, all high quality data is stored in fastq.gz format in 2_hqdata.

The filter script is filter_batch_v2.pl:

firstly, reading double-end sequence files $ { name } -R1. Fastq.gz and $ { name } -R1. Fastq.gz of sample off-machine original data in 1_RawData as input files, renaming the files, and controlling the quality of the input files through software fastqc to obtain fastq files of information such as base quality of the original data;

then using software adapter remove to take fastq.gz file of original data as input file, removing sequencing joint, at the same time storing the newly produced result file in fastq format in 2_HQData, then using the newly produced fastq file of last step as input file of sequence quality filtering program, adopting sliding window method to make quality filtering, window size is set to 5bp, step length is set to 1bp;

moving one base forward each time, taking 5 bases to calculate the average Q value of a window, and if the average Q value of the window is less than or equal to 20, only keeping the last base and the previous base of the window;

and then removing any reads at the two ends if the length of the reads at the two ends is less than or equal to 50 bp. The final results are output as $ { name } -HQ-R1. Fq and $ { name } -HQ-R2. Fq.

(3) Sequence clustering step in single sample

Because there is no reference genome in the non-parametric analysis, sequence clustering is performed in a single sample, double-ended sequencing data of the single sample are combined into one file before clustering, then clustering is performed by using the ustacks command in software Stacks, a representative category sequence of each sample is obtained, and the result file is stored in a 3_stacks folder in a tags.tsv.gz format.

(4) All sample category sequence clustering step

Grouping information of samples is designated, and clustering is performed based on the category sequence of a single sample to obtain a consensus sequence of all samples, wherein the consensus sequence is taken as a class reference genome sequence of all samples.

(5) Step of detecting population SNP

And reading grouping information of each sample designated by all files, and simultaneously designating the deletion rate parameter, detecting group SNP information by using csstacks commands in software Stacks, and storing the group SNP information in a format of VCF files.

(6) Analysis of population genetic parameters (Fst, pi, heterozygosity, haplotype diversity)

According to the SNP information of the population, the population genetic parameters are analyzed by utilizing the position command in the Stacks, and the population differentiation index Fst, the population nucleotide diversity pi, the population expected heterozygosity, the observed heterozygosity and the haplotype diversity are obtained through calculation.

(7) Step of population PCA analysis

And performing format conversion by using software vccftools and plinks according to the VCF file of the SNP of the group, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components with larger influence on the group, calculating the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script.

The R self-writing script firstly reads the vector information of the PC1 and the PC2 output by the GCTA software as an input file, calculates the contribution rate of the PC1 and the PC2, and then utilizes the ggplot2 in the R to pack a scatter diagram.

(8) Step of phylogenetic tree analysis of populations

And connecting the obtained group SNP information with the SNP information conversion format of each sample by using the self-writing script, and then constructing a phylogenetic tree by using a model which is not used.

Common models for building evolutionary trees include Maximum Parsimony (MP), neighbor-joining (NJ), maximum Likelihood (ML), bayesian method (BI);

the MP model is suitable for long sequences with high sequence similarity, large nucleotide or amino acid number and stable substitution rate, wherein no back mutation and parallel mutation exist in the site. The NJ model is suitable for short sequences with small evolutionary distance and few information sites. Under the condition of determining an evolution model, the ML method is a tree building method which is best matched with the evolution facts. The BI model reserves the basic principle of the maximum likelihood method, introduces the Monte Carlo method of the Markov chain, and is suitable for deducing the system tree, evaluating the uncertainty of the system tree, detecting and selecting the function, comparing the system tree, referring to fossil records to calculate the divergence time and detecting the molecular clock.

(9) Step of analysis of population genetic Structure

The self-writing script converts the population SNP format into the format required by the software structure, then specifies the number of SNPs and population numbers used in the analysis, and calculates the percentage of ancestors to which each sample belongs. The optimal K value (number of ancestors) is then determined, from which it is possible to obtain the grouping information of the samples and whether or not they are identical to the initially specified ones.

Each K value is based on the result of the bayesian model calculation method simulation, and a corresponding maximum likelihood value (likelihood) is generated, which is output after taking the natural logarithm. The larger the ln likelihood, the closer the K value is to the real, but generally as the K value increases, the ln likelihood value also goes into plateau. The optimal K value is the K value corresponding to the inflection point that enters the plateau).

(10) Step of population-specific SNP analysis

The self-writing script searches for common and specific SNP information between two large groups according to the Group SNP information and the designated Group information.

The original SNP is filtered according to the genotype deletion condition and the sequencing depth of SNP loci, the specificity of the SNP in a population is defined by two thresholds (A and B), one is that the occurrence frequency of the SNP in a target population is higher than a certain threshold (A), and the other is that the occurrence frequency of the SNP in a non-target population is lower than a certain threshold (B), and the threshold is generally set to be 0.8.

The invention has the advantages based on the steps that:

(1) The whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, and possible human errors are avoided.

(2) The analysis content is richer, and the graph of the analysis result is more beautiful (as shown in fig. 2-4).

Claims

1. A population evolution analysis method based on 2d-RAD sequencing without reference genome, which is characterized by comprising the following steps:

the splitting script specifically comprises the following steps:

if the front 7bp of the 5 'end of R1 of a sequence is consistent with the barcode, the next 5 bases are consistent with the enzyme 1 cleavage site, and the front 4bp of the 5' end of R2 corresponding to the reads is consistent with the enzyme 2 cleavage site sequence, splitting the reads into the samples, cycling for multiple times, and outputting the split data result of each sample finally;

the filtering script is filter_batch_v2.pl;

the filtering script firstly reads double-end sequence files $ { name } -R1. Fastq.gz and $ { name } -R1. Fastq.gz of sample starting original data in 1_RawData as input files, renames the files, and controls the quality of the input files through software fastqc to obtain fastq files of base quality information of the original data;

then using software adapter remove to take fastq.gz file of original data as input file, removing sequencing joint, at the same time storing the newly produced result file in fastq format in 2_HQData, then using the newly produced fastq file of last step as input file of sequence quality filtering program, adopting sliding window method to make quality filtering, window size is set as 5bp, step length is set as 1bp;

then, if the length of any one of the reads in the two ends is less than or equal to 50bp, removing the two ends reads, and outputting the final result as $ { name } -HQ-R1. Fq and $ { name } -HQ-R2. Fq;

the R self-writing script firstly reads the vector information of PC1 and PC2 output by GCTA software as an input file, calculates the contribution rate of PC1 and PC2, and then utilizes ggplot2 in R to pack a scatter diagram;

eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by utilizing a Perl self-writing script, and then constructing a phylogenetic tree by utilizing different models;

the model for constructing the phylogenetic tree is any one or more of maximum parsimony, neighbor-joining, maximum Likelihood and Bayesian method;

ninth step:

converting the group SNP format into a format required by software structure by using a Python self-writing script, then designating the number of SNPs and the number of groups used in analysis, and calculating the percentage of ancestors of each sample;

then determining the optimal K value of the ancestor number, wherein the optimal K value is the K value corresponding to the inflection point after the ln likelihood enters the platform stage, and obtaining whether the grouping information of the sample is consistent with the initial specification or not according to the result;

tenth step:

searching common and specific SNP information between two groups according to Group SNP information and specified Group information by utilizing a Perl self-writing script;

specifically, the original SNP is filtered according to the genotype deletion condition and the sequencing depth of SNP loci, the specificity of the SNP of a population is defined by two thresholds A and B, firstly, the occurrence frequency of the SNP in a target population is higher than a certain threshold A, secondly, the occurrence frequency of the SNP in a non-target population is lower than a certain threshold B, and the threshold is set to be 0.8.