CN115394360A

CN115394360A - Exhaustive analysis method for time series biological omics big data

Info

Publication number: CN115394360A
Application number: CN202210710202.XA
Authority: CN
Inventors: 张际峰; 杨士伟; 刘海涛; 汪承润; 李茂业; 蒋磊; 张国超; 刘芯茹; 孟静
Original assignee: Huainan Normal University
Current assignee: Huainan Normal University
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-11-25
Anticipated expiration: 2042-06-22
Also published as: CN115394360B

Abstract

The invention discloses an exhaustive analysis method for time series biological omics big data. Belongs to the fields of bioinformatics and big data. According to the steps and the implementation scheme of the time-series biological big data analysis, a specific analysis method is provided, and reference is provided for trend analysis, segmentation analysis and interaction analysis research of the time-series biological big data in the biological field.

Description

Exhaustive analysis method for time series biological omics big data

Field of the invention

The invention belongs to the fields of bioinformatics and big data research in the field of life science, and particularly relates to an exhaustive analysis method for various types of biological omics data aiming at multipoint time sequences.

Second, background Art

With the blowout-type release of high-throughput omics data, one class of omics data recorded according to a time sequence, namely time-series omics big data, is favored by more and more life science researchers. Because time series data has continuity in time, all kinds of data have the same background environment except time. Thus, only a single time variable needs to be considered when comparing. It is often used in a variety of biological processes over a span of time, such as the continuous growth of plants, the continuous invasion of hosts by viruses, the continuous division of cells, and the like. And obtaining the change trend and regularity of the object to be detected in time through multi-point dynamic analysis of the biological process.

The current development of time series omics big data analysis mainly focuses on the time series analysis of transcriptome data most commonly. The research methods comprise a short time sequence expression miner (STEM), a K-means clustering algorithm, an Mfuzz algorithm and the like.

However, to our knowledge, these methods are rarely used for data other than transcriptome data, and they all rely on their internal parameter settings to obtain the total number of types possible, and do not exhaust all cases of characteristic data to be studied. This virtually loses some of the potentially significant data features.

Based on this, the present invention is directed to a method for analyzing time series characteristic data including quantifiable data such as transcriptome, epigenetics or proteome; the research object can be a segment in chromosome, gene, protein or non-coding RNA. The analysis content comprises 3 types of trend analysis, sectional analysis and interaction analysis, the possible analysis types are exhausted, the characteristic data with obvious difference change under the types are explored by a statistical analysis method, and the method is a dynamic rule which is hidden behind various time sequence data under the background of big data, is false-free and true-saving and is mined.

Third, the invention

1. Problems to be solved by the invention

The invention aims to solve the following problems: firstly, in a general view, an exhaustive analysis method for time-series biological group big data is provided, and a comprehensive system analysis scheme is provided for time-series transcriptome, proteome and epigenetic group characteristic data analysis; second, the analysis scheme traverses the possible performance conditions of the research objects among different time nodes, and researches the change rule of the research objects along with the time fluctuation from the global time angle (trend analysis), the local time angle (segmentation analysis) and the comparative analysis (interaction analysis) among the pair of research objects of a single research object to find potential characteristics and biological secret in the time sequence data.

The research method provided by the patent is helpful for solving the problems that the existing time sequence omics big data analysis method is few, the existing method depends on subjective parameter selection, the method is less in exhaustion of data characteristics of various nodes and the like.

2. Technical scheme

The invention provides an exhaustive analysis method for time series biological omics big data, which comprises the following specific implementation schemes:

(1) Data pre-processing

Time series omics data may be derived from public databases, such as transcriptome or methylation group data in GEO databases, or may be derived from direct identification results of biological companies, such as protein mass spectrometry identification. The time nodes of the omics data analyzed by the method are generally not less than 3, and the sample information corresponding to different time nodes is consistent or the same.

After time series omics data are obtained, the corresponding data are preprocessed as follows:

(1) deleting probes with characteristic data possibly having more than 80% missing values;

(2) merging the same probes, and merging according to the mean value or the median value of the characteristic data;

(3) standardizing the probe data of different time sequences to ensure that the characteristic data of each sample has similar numerical distribution;

(4) performing targeted processing according to the self-properties of the research object, such as a methylated data chip in an epigenetic group, wherein the data needs to be converted into a beta value and then normalized;

(5) except the first time point, the other following time points except the characteristic value of the preceding time point are subjected to data conversion by taking the natural logarithm or the logarithm taking the base 10 or 2, and the obtained result is defined as alpha;

(6) the fluctuation trend of a single subject is determined by the corresponding threshold value of α itself, which may be a value, such as α =0; a pair of opposite numbers is also possible, such as α = ± 0.2;

(7) for α =0, we consider α >0 to mean: between two nodes to be studied, the data features appear as rising; conversely, α <0 means that the data feature appears to be decreasing; whereas α =0 means that the single feature data has not changed: and between two nodes to be researched, the data characteristic is represented by a pair of opposite numbers of the alpha threshold value, such as alpha is greater than 0.5 to indicate that the data characteristic is relatively up-regulated, alpha < -0.5 to indicate that the data characteristic is relatively down-regulated, and alpha is equal to [ -0.5,0.5] to indicate that the data characteristic is relatively unchanged.

(2) Exhaustive analysis method of use

According to the exhaustive analysis method for the time-series biological omics big data, the following specific implementation steps of 3 aspects considered by the method are as follows:

(1) performing trend analysis of the feature data, measuring "all possible trends" of the feature data for all single study objects under global or local temporal perspective:

first, all possible tendency situations are combed. There may be 3 cases of inter-node ratios compared to a given threshold, namely three cases that are greater than, less than, or equal to the threshold. Therefore, the trend of the device has three situations of ascending, descending and invariable. And there are a set of three such possibilities between two time nodes.

Secondly, counting all possible tendency situations. According to a set of three probability calculation, the total probability situation of the trend of the nodes at all times can be obtained. I.e. 3 to the power of the index, which is the total number of nodes minus 1, and the final trend needs to be subtracted by a case that does not change at any time, which may not have any relation to the time event.

Thirdly, selecting the trend of 'remarkable characteristics'. According to the analysis needs, the screening can be carried out on the number of the tendency and the specificity of the tendency. In the number of the trends, most of the trends or the 5% of the trends can be selected; on the basis of the particularity of the trend, the conditions of consistent rising, consistent falling, rising first and then falling, falling first and then rising and the like can be selected for analyzing the biological process.

(2) A segmentation analysis of the feature data is performed measuring "all possible trends or interactions" of the feature data at the local temporal perspective for all single study objects.

First, all possible time segments are sorted, where the method only considers a single segmentation case, and does not consider two or more segmentation cases. Because the likelihood of feature data becomes more complex if the number of segmentation cut points exceeds two. If two segments are actually selected, the beginning segment can be used as the start of the time sequence data. This contributes to the complexity of the dimension reduction analysis;

secondly, the total number of the possible segments is accumulated, and the possible total number is influenced by the number of time periods between two nodes in the middle segment analysis of the multi-time nodes. Obviously, the total number of the time segments is twice that of the time segments, i.e. the total number of the time sequence nodes minus 2.

Thirdly, according to the segmentation situation, the interested time period can be considered or further analysis can be carried out according to the monotonicity and the characteristic of the trend. The number difference or "sub-trend" difference at different time periods was examined.

(3) Performing an interaction analysis of the feature data, measuring the "interaction" of the feature data at global or local time angles between pairs or two groups of subjects:

first, the possible types of interaction analysis are considered, which include two categories of "antagonism" and "synergy", and the total number of interactions among antagonism relationship groups in the interaction analysis is half of the total number of trends in the trend analysis.

Secondly, the trend analysis can be only performed between two single study objects, or only performed between two study object groups. Interactions in "synergistic" relationships can be further screened in the same trend analysis results, while interactions in "antagonistic" relationships require further screening in "symmetric" trends.

3. Advantageous effects

The exhaustive analysis method for the time series biological omics big data provided by the invention has the following specific beneficial effects:

(1) The invention relates to an exhaustive analysis method for time-series biological group big data, which comprises a trend analysis method, wherein the method can comprehensively sort out the global or local trend conditions of all single research objects in characteristic data, and provides an effective analysis scheme for screening out specially required trend characteristics and searching characteristic data corresponding to a biological process.

(2) The invention relates to an exhaustive analysis method for time series biological group big data, which can take time series into consideration in a segmentation way so as to obtain a segmentation analysis method, can perform local analysis in a required interested time period, or perform analysis by comparing two time periods before and after, and the local analysis can also comprise trend analysis and interaction analysis. Provides a research idea for locally finding out proper time series data characteristics.

(3) The invention relates to an exhaustive analysis method for time series biological group big data, which comprises a method for proposing interaction analysis, wherein the method can screen out interaction relation possibly existing between paired research objects in time series data. Or there may be an interaction between the two groups of subjects. It is useful to screen for possible physical interactions between genes, chromosome fragments or proteins based on the data. Provides an analysis scheme for the interaction research of biomacromolecules.

Drawings

FIG. 1 is a graph of the trend of the time series variation of the characteristic data of the present invention;

when the number of the time nodes is 4 (fig. 1A-N) or 5 (fig. 1O), a value obtained by taking a natural logarithm from a ratio of a next time node to a previous time node adjacent to the previous time node is compared with a user-defined alpha value to obtain a possible situation graph and a possible segmentation point graph;

the reference numbers in the figures illustrate: a: the condition that the values of four time nodes of 0h (hour), 12h,24h and 48h are unchanged all the time is represented; B-N: the 26 possible trends (numbered as P2 a) for the four time nodes, these include 13 symmetric trend pairs (numbered as fig. 1B), which can be used for "antagonistic" relationship analysis of the interaction analysis pairs. O: 6 segmentation case possible for 5 time nodes.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings which illustrate examples of the invention.

Examples

For time series transcriptome data of four time nodes, 0h,12h,24h and 48h, 3 analysis schemes for trend analysis, segmentation analysis and interaction analysis by using the method will be described in detail in the present embodiment, and specific details are as follows:

(1) Obtaining time series transcriptome data

The present embodiment relates to time series transcriptome data may be obtained from the following database GEO common data repository. Screening data should be considered to describe unequivocally each time point with enough biological replicates for each time point, the number of time nodes is no less than two, and the example selects RNAseq sequencing data with four time nodes (i.e., 0h,12h,24h, and 48 h).

(2) Data pre-processing

Aiming at the characteristic data of the transcriptome in the embodiment, firstly, the default values are supplemented, the probe names are merged and standardized; then, the transcriptome values of successively adjacent time nodes are averaged to obtain a ratio, the logarithm taking the base 10 is taken, if the gene name is X and the time nodes are i and i +1 respectively, the ratio is expressed as X _(i+1)/i Collectively referred to as α. Further,. + -. 0.5 was set as a threshold for determining the fluctuation of the transcription level of each gene.

(3) Exhaustive analysis method

One is trend analysis. For a single gene X, X is judged _(i+1)/i And (5) drawing the floating condition among nodes at different times according to the relation of +/-0.5. X _(i+1)/i >0.5 or X _(i+1)/i <-0.5 is represented as rising and falling respectively, while between-0.5 and 0.5 is unchanged. By traversing each gene in the data in this manner, the variation types of all genes under the condition of 4 time nodes can be obtained, and the total number is 3 powers of 3, namely 27. In trend analysis, the total trend is as follows: 27-1= 26.

The monotone increasing trend and the monotone decreasing trend in the trend analysis are analyzed, P11a in figure 1K is the monotone increasing trend, and P11b is the monotone decreasing trend, the monotone trend analysis can show that the genes keep the rising trend or the falling trend on any time node, and compared with a method for directly averaging the expression levels of all points and then judging the up-regulation or the down-regulation of the gene expression level in the traditional research, the method has more delicate scientificity.

As another example, for P8a and P8b in fig. 1H, which indicate that the gene group belongs to the rising-falling and falling-rising types, respectively, it can be seen that the gene expression level changes significantly during different time nodes. However, if the difference between the time-series point data is neglected in the conventional analysis mode, the two trends may not show a significant fluctuation relationship.

And secondly, carrying out segmentation analysis. According to the possible segment types of fig. 1O, for 5 time nodes (defined here as 0h,12h,24h, 48h, 72h), the number of segments is 3 × 2=6 segments. Namely: 0h-12h segment and 12h-48h segment, 0h-24h segment and 24h-48h segment, and 0h-48h segment and 48h-72h segment.

P13a and P13b of fig. 1M, it can be seen that they have significantly opposite monotonicity in the 0h-12h segment and 12h-48h segment. Namely P13a is monotonically increased in the period of 0h-12h and is monotonically decreased in the period of 12h-48 h. P13b has the opposite monotonicity. In addition, P12a and P12b of FIG. 1L also have opposite monotonic trends in the 0h-24h segment and 24h-48h segment. Compared with a global analysis method, the local segmentation analysis is beneficial to mining more remarkable characteristic results aiming at local time.

And thirdly, interaction analysis. And analyzing the existing results according to the trend analysis and the segmentation analysis. The gene or genes of interest are selected for evaluation of the interaction analysis. The analysis can be divided into global and local in time series, and as can be seen from FIGS. 1K-N, the global fluctuation trend is remarkable, and the interaction can be global; and as shown in fig. 1B-J, the local remarkable fluctuation trend is realized.

In fig. 1K, it can be seen that P11a and P11b may have significant "mutually inhibitory" antagonistic interactions, whereas in P11a or in the respective population of P11b, we may screen for synergistic interactions with "mutual promotion". These interactions may be for a pair of genes or for two groups of genes in a cluster. In the embodiment of the patent, four obvious single gene pairs have obvious interaction in the process of infecting the greater wax moth which is a host insect by beauveria bassiana, BBA _05021 corresponds to BBA _08187, BBA _02297corresponds to BBA _00032, antagonistic interaction exists between every two genes, BBA _05635 corresponds to BBA _00807, BBA _02196corresponds to BBA _07954, and synergistic interaction exists between every two genes.

Claims

1. An exhaustive analysis method for time series biological group big data is characterized in that the characteristic data aiming at analysis comprises quantifiable data such as transcriptome, epigenetics or proteome; the research object can be a segment in chromosome, gene or protein or non-coding RNA and the like. When the number n of time series nodes of omic big data is equal to or larger than 3, the method for exhaustive analysis of the feature data specifically comprises 3 types of trend analysis of the global time angle of a single research object, local time angle (segmentation analysis) and interaction analysis of comparative analysis between paired research objects, namely trend analysis, segmentation analysis and interaction analysis.

An exhaustive analysis method for time series biological group big data is characterized in that data preprocessing is needed before characteristic data analysis. The specific steps of the preprocessing are that after standardization, except that the first time node is the first time node, the specific value of the time node after the characteristic data is compared with the previous time node, and the ratio is subjected to logarithmic processing. The logarithm processing mode is determined by the nature of the data, natural logarithm can be selected, or the base 10 or 2 is selected, the obtained result is defined as alpha, the value of alpha can be defined according to the amplitude of the characteristic data, and a numerical value can be directly selected, such as alpha =0; it can also be a pair of opposite numbers, such as α = ± 0.2.

An exhaustive analysis method for time series biological group big data is characterized in that the total number of changes which can be obtained by characteristic data has certain regularity and an exponential equation is satisfied. According to the processing method of the feature data, for example, the total number Z of possible variation types of the feature data of n nodes satisfies the following formula 1:

Z(n)＝3 ^n-1 equation 1

It can be obtained thereby that when the node n =4 of the time series data, the total number of possible variation types of the feature data is 27.

2. An exhaustive analysis method for time series biological group big data is characterized in that the trend change total number and type of the characteristic data have determinacy. The trend change total number accords with certain regularity, and the trend change total number is linearly related to the change type total number. The trend variation situation of the corresponding characteristic data conforms to the following formula 2:

t (n) = Z (n) -1 formula 2

As can be seen from equation 2, when n =4, the total number of possible trend changes of the time-series data is 26. When N =4, the trend change is specifically shown in fig. 1B-N.

3. An exhaustive analysis method for time series biological omics big data is characterized in that the number of segmentation points and segmentation types for segmentation case analysis of feature data has certainty, when the number of nodes is n, the segmentation points are n-2, as shown in fig. 1O, and the total number of the segmentation types satisfies formula 3:

d (n) =2 (n-2) formula 3

The possible types of the segments for the segment analysis are obtained according to formula 3, and one or more types or pairs of types of the segments can be selected according to requirements to perform another two types of analyses mentioned in the present application.

4. An exhaustive analysis method for time series biological omics big data is characterized in that the number of pairs of 'antagonistic' interaction types for the interaction analysis of the feature data has regularity and biological significance of the interaction analysis. For the total number of interaction types, when the number of nodes is n, the total number of pairs of "antagonistic" interaction types satisfies formula 4:

p (n) = T (n)/2 formula 4

Wherein, T (n) is the total number of trend change cases obtained according to equation 2. In the interaction analysis, the characteristic data of the research objects are distributed symmetrically or uniformly, and the research objects are presumed to have the possibility of mutual 'antagonism' or 'synergy' in a biological sense.

5. An exhaustive analysis method for time series biological group big data is characterized in that in trend analysis, segmentation analysis and interaction analysis of the time series data, the segmentation analysis can perform trend analysis and interaction analysis within a smaller range. The interaction analysis is based on trend analysis, and the segmentation analysis also influences the result.

6. An exhaustive analysis method for time-series biological omics big data is characterized in that time-series data are based on all possible situations of traversal trend analysis, segmentation analysis and interaction analysis, possible features of the data are exhausted, and corresponding biological meanings of significant features are researched after the analysis is carried out on the time-series data and one or more time-series feature data.

7. An exhaustive analysis method for time-series biological omics big data is characterized in that when the analysis method is applied to time-series transcriptome data analysis of a process that beauveria bassiana infects the greater wax moth as a host insect, four pairs of obvious single gene pairs are found to have an interaction relationship, and the four pairs of obvious single gene pairs are specifically as follows: BBA _05021 corresponds to BBA _08187, BBA _02297corresponds to BBA _00032, antagonistic interaction exists between every two genes, BBA _05635 corresponds to BBA _00807, BBA _02196corresponds to BBA _07954, and synergistic interaction exists between every two genes.