CN107609347A

CN107609347A - A kind of grand transcript profile data analysing method based on high throughput sequencing technologies

Info

Publication number: CN107609347A
Application number: CN201710720413.0A
Authority: CN
Inventors: 薛正晟; 杨洋; 姜丽荣; 孙子奎
Original assignee: SHANGHAI PERSONAL BIOTECHNOLOGY CO Ltd
Current assignee: SHANGHAI PERSONAL BIOTECHNOLOGY CO Ltd
Priority date: 2017-08-21
Filing date: 2017-08-21
Publication date: 2018-01-19

Abstract

A kind of grand transcript profile data analysing method based on high throughput sequencing technologies disclosed by the invention, comprises the following steps：(1) quality data collection is obtained；(2) mRNA transcript sequence collection is obtained；(3) nonredundant protein sequence collection is obtained；(4) the function monoid abundance for obtaining each grade is composed and analyzed；(5) species annotation is carried out to gene order, is planted and planted the species composition spectrum of following fine level, and analyzed；(6) function abundance spectrum and species composition spectrum based on above-mentioned acquisition, further carry out Alpha and Beta diversity analysis, and then screen to obtain the key organism label in grand genome by a variety of Multivariate Statistics methods to grand transcript profile sample；(7) by a variety of data visualizations and interactive tools, two-dimensional/three-dimensional chart is drawn, it is comprehensive, above analysis result is objectively presented；(8) according to samples sources, specific functional database is selected to carry out annotation analysis.

Description

A kind of grand transcript profile data analysing method based on high throughput sequencing technologies

Technical field

The present invention relates to technical field of biological, more particularly to a kind of grand transcript profile number based on high throughput sequencing technologies According to analysis method.

Background technology

The research object of grand transcription group (Metatranscriptomics) is microorganism group mRNA, is obtaining microorganism After organizing total serum IgE and removing rRNA, reverse transcription cDNA, and the Insert Fragment library of appropriate length is built, these libraries are entered Row both-end (Paired-end, PE) high-flux sequence, so as to fine group of species active in the whole flora of accurate quantification Into and its corresponding function expression, and then lock flora in key organism label, illustrate its biological significance.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of grand transcript profile data based on high throughput sequencing technologies Analysis method.

The technical problems to be solved by the invention can be achieved through the following technical solutions：

A kind of grand transcript profile data analysing method based on high throughput sequencing technologies, specifically comprises the following steps：

(1) quality examination is carried out to the both-end sequence initial data of machine under high-flux sequence, acquisition can be used for grand turn of downstream The quality data collection of record group credit analysis；

(2) ribosomal RNA sequences prediction and rejecting are carried out to high quality sequence, obtains mRNA transcript sequence collection；

(3) carry out the assembling of grand transcript profile sequence assembly respectively to each sample, build grand transcript profile Contigs and Scaffolds sequence sets, and predictive genes are carried out, obtain nonredundant protein sequence collection；

(4) functional annotation is carried out with a variety of frequently-used data storehouses to protein sequence, obtains the function monoid abundance spectrum of each grade, And carry out comparison in difference analysis, metabolic pathway enrichment analysis, cluster analysis；

(5) species annotation is carried out to gene order, is planted and planted the species composition spectrum of following fine level, and carried out Comparison in difference analysis, cluster analysis, species composition richness and Uniformity Analysis and related network analysis；

(6) function abundance spectrum and species composition spectrum based on above-mentioned acquisition, further can be carried out to grand transcript profile sample Alpha and Beta diversity analysis, and then it is raw to rely on a variety of Multivariate Statistics methods to screen to obtain the key in grand genome Substance markers thing；

(7) by a variety of data visualizations and interactive tools, two-dimensional/three-dimensional chart is drawn, it is comprehensive, objectively present Above analysis result；

(8) according to samples sources, specific functional database is selected to carry out annotation analysis.

As a result of technical scheme as above, the present invention has following features：

(1) directly the genetic fragment of activity expression in flora sample is sequenced, really realized to active specy and table Up to the accurate quantification of function；

(2) multiple functions annotations database is optional, and KEGG/EggNOG/CAZy/NR/Swiss- is selected according to Research Requirements The databases such as Prot/GO/VFDB/CARD, optimize the active function metabolism spectrum annotation of grand transcript profile；

(3) source of species is accurately identified by microbial gene information, obtains kind and plant with lower horizontal " high-resolution Rate " active specy finely forms spectrum；

(4) by a variety of multivariate statistical analysis and machine learning method, system, grand transcript profile big data is in depth excavated Middle difference related active specy and corresponding function, so as to accurately identify the active bio label of key.

Brief description of the drawings

Fig. 1 is a kind of schematic flow sheet of the grand transcript profile data analysing method based on high throughput sequencing technologies of the present invention.

Fig. 2 is the annotation result statistical chart of the EggNOG function monoids of the present invention.In figure, abscissa corresponds to the 25 of EggNOG Individual gene function major class, each major class are represented with an English capital letter, EggNOG of the ordinate for annotation to corresponding classification Function monoid quantity.

The Unigene differential expressions MA figures of Fig. 3 present invention.In figure, abscissa shows each Unigene in two samples (group) In average expression intensity (i.e. A values, A=[log₂(Case)+log₂(Control)]/2, Case and Control represent this respectively Expression quantity of the Unigene in two samples (group)), abscissa value is bigger, and corresponding Unigene average expression intensity is stronger.It is vertical Coordinate is expression quantity fold difference logarithm value (i.e. M value, M=logs of each Unigene between two samples (group)₂(Control/ Case)), ordinate logarithm value is bigger, and expression quantity of the corresponding Unigene in Control samples (group) is higher, and in Case samples Expression quantity in this (group) is lower；Logarithm value is smaller, and expression quantity of the corresponding Unigene in Case samples (group) is higher, and Expression quantity in Control samples (group) is lower.The Unigene of differential expression is on the diagram with red spots in two samples (group) Represent, the Unigene of expression quantity indifference is represented with cyan round dot.

Fig. 4 is the display renderings of the present invention.Obtained KO functions are annotated in KEGG functional databases based on each sample The relative expression quantity distribution table of monoid, each sample (group) can be analyzed and be enriched with the KO of (i.e. expression quantity significantly raises), and led to Whether notable cross statistical check evaluation difference.The display form of metabolic pathway concentration effect will have according to selected functional category Institute is different.

Fig. 5 is the PHI database annotation result statistical charts of the present invention.In figure, abscissa corresponds to PHI 9 gene major classes, Gene dosage of the ordinate for annotation to corresponding classification.

Embodiment

Referring to Fig. 1, a kind of grand transcript profile data analysing method based on high throughput sequencing technologies for being provided in figure, specific bag Include following steps：

(4) functional annotation is carried out with a variety of frequently-used data storehouses to protein sequence, obtains the function monoid abundance spectrum of each grade, And carry out comparison in difference analysis, metabolic pathway enrichment analysis, cluster analysis (referring to Fig. 2, Fig. 3)；

(5) species annotation is carried out to gene order, is planted and planted the species composition spectrum of following fine level, and carried out Comparison in difference analysis, cluster analysis, species composition richness and Uniformity Analysis and related network analysis are (referring to Fig. 4)；

(6) function abundance spectrum and species composition spectrum based on above-mentioned acquisition, further can be carried out to grand transcript profile sample Alpha and Beta diversity analysis, and then it is raw to rely on a variety of Multivariate Statistics methods to screen to obtain the key in grand genome Substance markers thing (referring to Fig. 5)；

Claims

1. a kind of grand transcript profile data analysing method based on high throughput sequencing technologies, it is characterised in that comprise the following steps：

(1) quality examination is carried out to the both-end sequence initial data of machine under high-flux sequence, acquisition can be used for the grand transcript profile in downstream The quality data collection of credit analysis；

(3) carry out grand transcript profile sequence assembly assembling respectively to each sample, build grand transcript profile Contigs and Scaffolds Sequence sets, and predictive genes are carried out, obtain nonredundant protein sequence collection；

(4) functional annotation is carried out with a variety of frequently-used data storehouses to protein sequence, obtains the function monoid abundance spectrum of each grade, go forward side by side The analysis of row comparison in difference, metabolic pathway enrichment analysis, cluster analysis；

(5) species annotation is carried out to gene order, is planted and planted the species composition spectrum of following fine level, and carry out difference Comparative analysis, cluster analysis, species composition richness and Uniformity Analysis and related network analysis；

(6) function abundance spectrum and species composition spectrum based on above-mentioned acquisition, further can carry out Alpha to grand transcript profile sample With Beta diversity analysis, and then by a variety of Multivariate Statistics methods screen to obtain in grand genome key organism mark Thing；

(7) by a variety of data visualizations and interactive tools, draw two-dimensional/three-dimensional chart, it is comprehensive, objectively present more than Analysis result；