CN112164424B - Group evolution analysis method based on no-reference genome - Google Patents

Group evolution analysis method based on no-reference genome Download PDF

Info

Publication number
CN112164424B
CN112164424B CN202010768331.5A CN202010768331A CN112164424B CN 112164424 B CN112164424 B CN 112164424B CN 202010768331 A CN202010768331 A CN 202010768331A CN 112164424 B CN112164424 B CN 112164424B
Authority
CN
China
Prior art keywords
snp
data
group
sample
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010768331.5A
Other languages
Chinese (zh)
Other versions
CN112164424A (en
Inventor
刘书云
张海焕
姜丽荣
孙子奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Personal Gene Technology Co ltd
Original Assignee
Nanjing Personal Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Personal Gene Technology Co ltd filed Critical Nanjing Personal Gene Technology Co ltd
Priority to CN202010768331.5A priority Critical patent/CN112164424B/en
Publication of CN112164424A publication Critical patent/CN112164424A/en
Application granted granted Critical
Publication of CN112164424B publication Critical patent/CN112164424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a Group evolution analysis method based on 2d-RAD sequencing and without reference genome, which comprises the steps of carrying out data splitting on samples, filtering and clustering to obtain Group SNP, carrying out Group genetic parameter analysis based on sample grouping and Group SNP information, constructing a phylogenetic tree, determining an optimal K value, then utilizing the R self-writing script, and searching for sharing and special SNP information between two groups according to the Group SNP information and the appointed Group information to carry out Group evolution analysis without reference genome. The whole data analysis is more automatic, the labor cost is saved, the analysis efficiency is improved, possible human errors are avoided, and the analyzed data chart is more attractive.

Description

Group evolution analysis method based on no-reference genome
Technical Field
The invention relates to the technical field of gene sequencing analysis, in particular to a population evolution analysis method based on a reference-free genome.
Background
The population structure difference and the gene communication condition between different subgroups in the same species can be deeply explored through the population evolution analysis, and the population structure characteristics between different species can be studied; however, many species have not yet been published with reference genomes, so population evolution analysis without reference genomes is performed.
Because there are multiple non-participating library creating methods (RAD, GBS, 2d-RAD, SLAF, etc.), different library creating methods can be different in the first step of data splitting of non-participating analysis, but the existing non-participating analysis method based on 2d-RAD library creating has complex data filtering flow and lower efficiency, especially when the number of items is large and the sample amount contained in one item is large, one item can be sequenced on machine for multiple times in the actual operation process, thus different batches of data can be obtained, the existing non-participating analysis method cannot intelligently use an automatic flow to combine and filter the different batches of data, and a great amount of labor time is consumed for data combination and filtering.
With the continuous development of high-throughput sequencing, the analysis content of the existing analysis flow appears to be thin, the analysis content is less, and the new non-parametric analysis content is more diversified and personalized. In the past, many places in the non-parametric analysis flow need to be operated manually, and the new non-parametric analysis method is more automatic, and the automatic flow improves the service efficiency of a server, reduces the analysis pressure of an analyst and is convenient for controlling analysis contents.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, the present invention aims to provide an automated analysis method for population evolution analysis based on a genome without reference.
In order to achieve the purpose of the invention, the technical scheme adopted is as follows:
a population evolution analysis method based on 2d-RAD sequencing without reference genome, comprising the steps of:
the first step: according to the enzyme cleavage site information of the barcode, the enzyme 1 and the enzyme 2 in the sequencing sample, carrying out data splitting by utilizing a splitting script, merging a plurality of sequencing data of the same sample in a next machine, and storing the merging data in a fastq.gz format in a first folder;
and a second step of: the data after the first step is split and combined is subjected to fastQC quality control through filtering scripts, and then the quality control is carried out according to the alkali matrix value: data filtering is carried out according to the standard that Q is more than or equal to 20 and the sequence length is more than or equal to 50bp, so that filtered data are stored in a second folder in a fastq.gz format;
and a third step of: sequence clustering is firstly carried out in a single sample, double-end sequencing data of the single sample are combined into a file before clustering, then clustering is carried out by utilizing a ustacks command in software Stacks, a representative category sequence of each sample is obtained, and a result file is stored in a third folder in a tags.tsv.gz format;
fourth step: after grouping samples, clustering based on the category sequences of the single samples to obtain the consensus sequences of all samples, wherein the consensus sequences are class reference genome sequences for all samples;
fifth step: reading grouping information of each sample appointed by all files, simultaneously appointing a deletion rate parameter, detecting group SNP information by using csstacks commands in software Stacks, and storing the group SNP information in a format of VCF files;
sixth step: based on the SNP information of the population in the fifth step, analyzing genetic parameters of the population by utilizing the position command in the Stacks, and calculating to obtain population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity and observed heterozygosity, haplotype diversity data;
seventh step: performing format conversion on the VCF file of the SNP information of the group in the fifth step by using software vccftools and plinks, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components with great influence on the group, calculating the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script;
eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by using a Python self-writing script, and then constructing a phylogenetic tree by using different models;
ninth step:
converting the group SNP format into a format required by software structure by utilizing a Perl self-writing script, then designating the number of SNPs and the number of groups used in analysis, and calculating the percentage of ancestor of each sample;
then determining the optimal K value (ancestor number), and obtaining whether grouping information of the sample is consistent with the initial specification or not according to the result;
tenth step:
and searching for common and specific SNP information between two groups according to the Group SNP information and the specified Group information by utilizing a Perl self-writing script.
In a preferred embodiment of the present invention, the filter script is filter_batch_v2.pl.
In a preferred embodiment of the present invention, the model for constructing a phylogenetic tree comprises any one or more of Maximum Parsimony (MP), neighbor-joining (NJ), maximum Likelihood (ML) or Bayesian method (BI).
In a preferred embodiment of the present invention, the optimal K value is a K value corresponding to an inflection point after the ln linehood enters the plateau.
The invention has the beneficial effects that:
the whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, possible human errors are avoided, and the analyzed data chart is more attractive.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a PCA profile of the present invention.
FIG. 3 is a graph of the evolutionary tree profile based on the NJ model of the present invention.
Fig. 4 is a population genetic structure profile at optimum k=3 according to the invention.
Detailed Description
The principle of the invention is as follows:
the automatic filtering flow based on 2d-RAD parameter-free simplification can be used for splitting and filtering batch data, various subsequent analysis of data filtering can be automatically completed, the data processing efficiency and the server use efficiency are improved, the labor time is saved, meanwhile, the human error is reduced, the whole project analysis period is finally shortened, and the parameter-free analysis high-efficiency automation of rich analysis contents is realized.
Referring to FIG. 1, the population evolution analysis method based on 2d-RAD sequencing without reference genome comprises the following steps:
(1) Data splitting step
Carrying out data automatic resolution by using a self-written script according to information of enzyme sites of the barcode, enzyme 1 and enzyme 2 of a sequencing sample, wherein the format is approximately one row of information representing one sample, and elements of each row are respectively a sample name, a barcode base, an enzyme site of the enzyme 1 and an enzyme site of the enzyme 2, and a spacer is set as a tab; if one sample has multiple off-machine sequencing, the analysis flow is automatically matched and combined, and the combined data are uniformly stored in a folder of 1_RawData in a fastq.gz format.
The splitting script specifically comprises:
a library contains a plurality of samples, four columns of sample names, barcode, enzyme 1 and enzyme 2 enzyme cleavage site sequences are used as an input file 1, and original double-end data fastq.gz of the library is used as input files 2 and 3;
if the front 7bp of the 5 'end of R1 of one sequence is consistent with the barcode, the next 5 bases are consistent with enzyme 1 cleavage site, and the front 4bp of the 5' end of R2 corresponding to the reads is consistent with enzyme 2 cleavage site sequence, the reads can be split into the samples, and the split data result of each sample is output after multiple times of circulation.
(2) Data quality control and filtering step
And performing quality control on the sample by using an automatic filtration script filter_batch_v2.pl written by the user, and performing data filtration according to the standard of an alkali matrix value (Q is more than or equal to 20) and a sequence length (more than or equal to 50 bp). After the run is finished, all high quality data is stored in fastq.gz format in 2_hqdata.
The filter script is filter_batch_v2.pl:
firstly, reading double-end sequence files $ { name } -R1. Fastq.gz and $ { name } -R1. Fastq.gz of sample off-machine original data in 1_RawData as input files, renaming the files, and controlling the quality of the input files through software fastqc to obtain fastq files of information such as base quality of the original data;
then using software adapter remove to take fastq.gz file of original data as input file, removing sequencing joint, at the same time storing the newly produced result file in fastq format in 2_HQData, then using the newly produced fastq file of last step as input file of sequence quality filtering program, adopting sliding window method to make quality filtering, window size is set to 5bp, step length is set to 1bp;
moving one base forward each time, taking 5 bases to calculate the average Q value of a window, and if the average Q value of the window is less than or equal to 20, only keeping the last base and the previous base of the window;
and then removing any reads at the two ends if the length of the reads at the two ends is less than or equal to 50 bp. The final results are output as $ { name } -HQ-R1. Fq and $ { name } -HQ-R2. Fq.
(3) Sequence clustering step in single sample
Because there is no reference genome in the non-parametric analysis, sequence clustering is performed in a single sample, double-ended sequencing data of the single sample are combined into one file before clustering, then clustering is performed by using the ustacks command in software Stacks, a representative category sequence of each sample is obtained, and the result file is stored in a 3_stacks folder in a tags.tsv.gz format.
(4) All sample category sequence clustering step
Grouping information of samples is designated, and clustering is performed based on the category sequence of a single sample to obtain a consensus sequence of all samples, wherein the consensus sequence is taken as a class reference genome sequence of all samples.
(5) Step of detecting population SNP
And reading grouping information of each sample designated by all files, and simultaneously designating the deletion rate parameter, detecting group SNP information by using csstacks commands in software Stacks, and storing the group SNP information in a format of VCF files.
(6) Analysis of population genetic parameters (Fst, pi, heterozygosity, haplotype diversity)
According to the SNP information of the population, the population genetic parameters are analyzed by utilizing the position command in the Stacks, and the population differentiation index Fst, the population nucleotide diversity pi, the population expected heterozygosity, the observed heterozygosity and the haplotype diversity are obtained through calculation.
(7) Step of population PCA analysis
And performing format conversion by using software vccftools and plinks according to the VCF file of the SNP of the group, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components with larger influence on the group, calculating the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script.
The R self-writing script firstly reads the vector information of the PC1 and the PC2 output by the GCTA software as an input file, calculates the contribution rate of the PC1 and the PC2, and then utilizes the ggplot2 in the R to pack a scatter diagram.
(8) Step of phylogenetic tree analysis of populations
And connecting the obtained group SNP information with the SNP information conversion format of each sample by using the self-writing script, and then constructing a phylogenetic tree by using a model which is not used.
Common models for building evolutionary trees include Maximum Parsimony (MP), neighbor-joining (NJ), maximum Likelihood (ML), bayesian method (BI);
the MP model is suitable for long sequences with high sequence similarity, large nucleotide or amino acid number and stable substitution rate, wherein no back mutation and parallel mutation exist in the site. The NJ model is suitable for short sequences with small evolutionary distance and few information sites. Under the condition of determining an evolution model, the ML method is a tree building method which is best matched with the evolution facts. The BI model reserves the basic principle of the maximum likelihood method, introduces the Monte Carlo method of the Markov chain, and is suitable for deducing the system tree, evaluating the uncertainty of the system tree, detecting and selecting the function, comparing the system tree, referring to fossil records to calculate the divergence time and detecting the molecular clock.
(9) Step of analysis of population genetic Structure
The self-writing script converts the population SNP format into the format required by the software structure, then specifies the number of SNPs and population numbers used in the analysis, and calculates the percentage of ancestors to which each sample belongs. The optimal K value (number of ancestors) is then determined, from which it is possible to obtain the grouping information of the samples and whether or not they are identical to the initially specified ones.
Each K value is based on the result of the bayesian model calculation method simulation, and a corresponding maximum likelihood value (likelihood) is generated, which is output after taking the natural logarithm. The larger the ln likelihood, the closer the K value is to the real, but generally as the K value increases, the ln likelihood value also goes into plateau. The optimal K value is the K value corresponding to the inflection point that enters the plateau).
(10) Step of population-specific SNP analysis
The self-writing script searches for common and specific SNP information between two large groups according to the Group SNP information and the designated Group information.
The original SNP is filtered according to the genotype deletion condition and the sequencing depth of SNP loci, the specificity of the SNP in a population is defined by two thresholds (A and B), one is that the occurrence frequency of the SNP in a target population is higher than a certain threshold (A), and the other is that the occurrence frequency of the SNP in a non-target population is lower than a certain threshold (B), and the threshold is generally set to be 0.8.
The invention has the advantages based on the steps that:
(1) The whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, and possible human errors are avoided.
(2) The analysis content is richer, and the graph of the analysis result is more beautiful (as shown in fig. 2-4).

Claims (1)

1. A population evolution analysis method based on 2d-RAD sequencing without reference genome, which is characterized by comprising the following steps:
the first step: according to the enzyme cleavage site information of the barcode, the enzyme 1 and the enzyme 2 in the sequencing sample, carrying out data splitting by utilizing a splitting script, merging a plurality of sequencing data of the same sample in a next machine, and storing the merging data in a fastq.gz format in a first folder;
the splitting script specifically comprises the following steps:
a library contains a plurality of samples, four columns of sample names, barcode, enzyme 1 and enzyme 2 enzyme cleavage site sequences are used as an input file 1, and original double-end data fastq.gz of the library is used as input files 2 and 3;
if the front 7bp of the 5 'end of R1 of a sequence is consistent with the barcode, the next 5 bases are consistent with the enzyme 1 cleavage site, and the front 4bp of the 5' end of R2 corresponding to the reads is consistent with the enzyme 2 cleavage site sequence, splitting the reads into the samples, cycling for multiple times, and outputting the split data result of each sample finally;
and a second step of: the data after the first step is split and combined is subjected to fastQC quality control through filtering scripts, and then the quality control is carried out according to the alkali matrix value: data filtering is carried out according to the standard that Q is more than or equal to 20 and the sequence length is more than or equal to 50bp, so that filtered data are stored in a second folder in a fastq.gz format;
the filtering script is filter_batch_v2.pl;
the filtering script firstly reads double-end sequence files $ { name } -R1. Fastq.gz and $ { name } -R1. Fastq.gz of sample starting original data in 1_RawData as input files, renames the files, and controls the quality of the input files through software fastqc to obtain fastq files of base quality information of the original data;
then using software adapter remove to take fastq.gz file of original data as input file, removing sequencing joint, at the same time storing the newly produced result file in fastq format in 2_HQData, then using the newly produced fastq file of last step as input file of sequence quality filtering program, adopting sliding window method to make quality filtering, window size is set as 5bp, step length is set as 1bp;
moving one base forward each time, taking 5 bases to calculate the average Q value of a window, and if the average Q value of the window is less than or equal to 20, only keeping the last base and the previous base of the window;
then, if the length of any one of the reads in the two ends is less than or equal to 50bp, removing the two ends reads, and outputting the final result as $ { name } -HQ-R1. Fq and $ { name } -HQ-R2. Fq;
and a third step of: sequence clustering is firstly carried out in a single sample, double-end sequencing data of the single sample are combined into a file before clustering, then clustering is carried out by utilizing a ustacks command in software Stacks, a representative category sequence of each sample is obtained, and a result file is stored in a third folder in a tags.tsv.gz format;
fourth step: after grouping samples, clustering based on the category sequences of the single samples to obtain the consensus sequences of all samples, wherein the consensus sequences are class reference genome sequences for all samples;
fifth step: reading grouping information of each sample appointed by all files, simultaneously appointing a deletion rate parameter, detecting group SNP information by using csstacks commands in software Stacks, and storing the group SNP information in a format of VCF files;
sixth step: based on the SNP information of the population in the fifth step, analyzing genetic parameters of the population by utilizing the position command in the Stacks, and calculating to obtain population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity and observed heterozygosity, haplotype diversity data;
seventh step: performing format conversion on the VCF file of the SNP information of the group in the fifth step by using software vccftools and plinks, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components with great influence on the group, calculating the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script;
the R self-writing script firstly reads the vector information of PC1 and PC2 output by GCTA software as an input file, calculates the contribution rate of PC1 and PC2, and then utilizes ggplot2 in R to pack a scatter diagram;
eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by utilizing a Perl self-writing script, and then constructing a phylogenetic tree by utilizing different models;
the model for constructing the phylogenetic tree is any one or more of maximum parsimony, neighbor-joining, maximum Likelihood and Bayesian method;
ninth step:
converting the group SNP format into a format required by software structure by using a Python self-writing script, then designating the number of SNPs and the number of groups used in analysis, and calculating the percentage of ancestors of each sample;
then determining the optimal K value of the ancestor number, wherein the optimal K value is the K value corresponding to the inflection point after the ln likelihood enters the platform stage, and obtaining whether the grouping information of the sample is consistent with the initial specification or not according to the result;
tenth step:
searching common and specific SNP information between two groups according to Group SNP information and specified Group information by utilizing a Perl self-writing script;
specifically, the original SNP is filtered according to the genotype deletion condition and the sequencing depth of SNP loci, the specificity of the SNP of a population is defined by two thresholds A and B, firstly, the occurrence frequency of the SNP in a target population is higher than a certain threshold A, secondly, the occurrence frequency of the SNP in a non-target population is lower than a certain threshold B, and the threshold is set to be 0.8.
CN202010768331.5A 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome Active CN112164424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010768331.5A CN112164424B (en) 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010768331.5A CN112164424B (en) 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome

Publications (2)

Publication Number Publication Date
CN112164424A CN112164424A (en) 2021-01-01
CN112164424B true CN112164424B (en) 2024-04-09

Family

ID=73859973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010768331.5A Active CN112164424B (en) 2020-08-03 2020-08-03 Group evolution analysis method based on no-reference genome

Country Status (1)

Country Link
CN (1) CN112164424B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113678767B (en) * 2021-08-10 2022-08-23 中国水产科学研究院黄海水产研究所 Breeding method for prawn disease resistance character

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7571151B1 (en) * 2005-12-15 2009-08-04 Gneiss Software, Inc. Data analysis tool for analyzing data stored in multiple text files
CN101968774A (en) * 2010-10-21 2011-02-09 中国人民解放军61938部队 Device and method for storing mobile data safely
GB201404479D0 (en) * 2013-03-15 2014-04-30 Palantir Technologies Inc Transformation of data items from data sources using a transformation script
CN104573409A (en) * 2015-01-04 2015-04-29 杭州和壹基因科技有限公司 Gene mapping multi-inspection method
CN105002567A (en) * 2015-06-30 2015-10-28 北京百迈客生物科技有限公司 Method for constructing high-throughput simplified methylation sequencing library without reference genome
CN108388771A (en) * 2018-01-24 2018-08-10 安徽微分基因科技有限公司 A kind of bio-diversity automatic analysis method
CN108537006A (en) * 2018-04-03 2018-09-14 郑州云海信息技术有限公司 A kind of gene sequence data processing method, apparatus and system
CN109182505A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs75762330 and 2b-RAD Genotyping and analysis method
WO2019191649A1 (en) * 2018-03-29 2019-10-03 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
CN111235303A (en) * 2020-03-24 2020-06-05 中国环境科学研究院 Method for identifying cord-grass and spartina alterniflora
CN111276185A (en) * 2020-02-18 2020-06-12 上海桑格信息技术有限公司 Microorganism identification and analysis system and device based on second-generation high-throughput sequencing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101832834B1 (en) * 2017-03-09 2018-04-13 주식회사 샤인바이오 Method and system for multiple dot plot analysis

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7571151B1 (en) * 2005-12-15 2009-08-04 Gneiss Software, Inc. Data analysis tool for analyzing data stored in multiple text files
CN101968774A (en) * 2010-10-21 2011-02-09 中国人民解放军61938部队 Device and method for storing mobile data safely
GB201404479D0 (en) * 2013-03-15 2014-04-30 Palantir Technologies Inc Transformation of data items from data sources using a transformation script
CN104573409A (en) * 2015-01-04 2015-04-29 杭州和壹基因科技有限公司 Gene mapping multi-inspection method
CN105002567A (en) * 2015-06-30 2015-10-28 北京百迈客生物科技有限公司 Method for constructing high-throughput simplified methylation sequencing library without reference genome
CN108388771A (en) * 2018-01-24 2018-08-10 安徽微分基因科技有限公司 A kind of bio-diversity automatic analysis method
WO2019191649A1 (en) * 2018-03-29 2019-10-03 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
CN108537006A (en) * 2018-04-03 2018-09-14 郑州云海信息技术有限公司 A kind of gene sequence data processing method, apparatus and system
CN109182505A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs75762330 and 2b-RAD Genotyping and analysis method
CN111276185A (en) * 2020-02-18 2020-06-12 上海桑格信息技术有限公司 Microorganism identification and analysis system and device based on second-generation high-throughput sequencing
CN111235303A (en) * 2020-03-24 2020-06-05 中国环境科学研究院 Method for identifying cord-grass and spartina alterniflora

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
RAD测序技术及其在水生生物研究中的应用;胡景杰 等;水产科学;第37卷(第1期);第125-132页 *
Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences;Julian M. Catchen;G3 Genes Genomes Genetics;第1卷;第171-182页 *
应用简化基因组技术对富民枳遗传多样性检测;张珊珊;陈剑;杨文忠;;东北林业大学学报;20200414(第04期);第38-43页 *
数据集成中的一种数据合并技术;董树明, 徐文胜, 董逸生;现代计算机;20031130(第11期);第1-5页 *

Also Published As

Publication number Publication date
CN112164424A (en) 2021-01-01

Similar Documents

Publication Publication Date Title
Venturini et al. Leveraging multiple transcriptome assembly methods for improved gene structure annotation
US11817180B2 (en) Systems and methods for analyzing nucleic acid sequences
US20210173842A1 (en) Systems and Methods for Annotating Biomolecule Data
CN105989249B (en) For assembling the method, system and device of genome sequence
EP2758908A1 (en) Systems and methods for identifying sequence variation
KR20140119723A (en) Data analysis of dna sequences
CN110997936A (en) Method and device for genotyping based on low-depth genome sequencing and application of method and device
CN113299344A (en) Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment
CN112164424B (en) Group evolution analysis method based on no-reference genome
Chen et al. Recent advances in sequence assembly: principles and applications
US10179934B2 (en) High-throughput detection method for DNA synthesis product
Gawehns et al. epiGBS2: An improved protocol and automated snakemake workflow for highly multiplexed reduced representation bisulfite sequencing
CN110570901B (en) Method and system for SSR typing based on sequencing data
CN108182348A (en) DNA methylation data detection method and its device based on Seed Sequences information
CN108595914B (en) High-precision prediction method for tobacco mitochondrial RNA editing sites
CN110504007B (en) Working method and system for completing multi-scene strain identification in one-key mode
Conry Determining the impact of recombination on phylogenetic inference
US20190172553A1 (en) Using k-mers for rapid quality control of sequencing data without alignment
RU2804535C1 (en) Whole genome sequencing data processing system
RU2806429C1 (en) Whole genome sequencing data processing system
Jing et al. ScSmOP: a universal computational pipeline for single-cell single-molecule multiomics data analysis
Wen et al. Reference-guided automatic assembly of genomic tandem repeats with only HiFi and Hi-C data enables population-level analysis
Clarke Bioinformatics challenges of high-throughput SNP discovery and utilization in non-model organisms
Barcelona Cabeza Genomics tools in the cloud: the new frontier in omics data analysis
Vevik Read mapping on graph-based reference genomes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant