CN112200270B

CN112200270B - Data partition filling method for correcting high-throughput omics data loss

Info

Publication number: CN112200270B
Application number: CN202011285428.7A
Authority: CN
Inventors: 刘骁; 冀树伸
Original assignee: Jin Fu Kang Biotechnology Shanghai Ltd By Share Ltd
Current assignee: Jin Fu Kang Biotechnology Shanghai Ltd By Share Ltd
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2022-12-20
Anticipated expiration: 2040-11-17
Also published as: CN112200270A

Abstract

The invention discloses a data partition filling method for correcting high-throughput omics data loss, which comprises the following steps of: calculating partition critical values Blow and Bup according to the grouping condition and the data detection distribution condition of the high-throughput omics data expression matrix to realize the partition of data; sorting the data according to the missing amount from more to less, and dividing the data into three partitions of real missing, unstable missing and technical missing according to the partition critical point; and filling the data of the three partitions by using corresponding filling algorithms respectively. The invention can ensure that the data filling is closer to reality, on one hand, the negative influence of data distortion on data grouping can be reduced, and on the other hand, the problem of excessive data remodeling caused by using a single filling algorithm is avoided; experiments show that the method has strong data restoration robustness, and compared with other methods, the data grouping result filled by the method is closest to a real result, and the effectiveness of the method is proved.

Description

Data partition filling method for correcting high-throughput omics data loss

Technical Field

The invention belongs to the field of biological information data analysis, and particularly relates to a data partition filling method for correcting high-throughput omics data loss.

Background

High-throughput omics technology was developed after 2000 years, has gradually become one of the most important means for studying the micro-molecular world, and is widely applied to the life science research fields such as genomics, transcriptomics and proteomics. However, due to the high sensitivity of high throughput detection instruments, and the random fluctuation and time-dependent nature of biomolecules, some biomolecules are often not detected, i.e., their detection value is zero or close to zero. When a data set containing a lot of missing values is trained, the presence of the missing values can greatly affect the performance of the machine learning model, and can lead to the misinterpretation of biological significance. How to recover the missing data and restore the real expression of the missing data as much as possible is an important challenge in omics data analysis.

At present, filling algorithms aiming at high-throughput omics data all use a single fixed mode, and common filling algorithms are as follows: mean number padding, median padding, KNN padding, and the like. However, in practical experiments, the reasons for the loss of molecular detection values are manifold, and the common reasons for the loss are as follows: 1) True absence, this molecule does not exist; 2) An unstable deletion, wherein the expression of the molecule is unstable and can be detected when the molecule is detected, and can not be detected when the molecule is detected; 3) The technical defects are as follows: a molecule can be detected in most samples due to instability of the detection instrument, but there are cases where the detection value of the sample is empty. In this context, the use of a fixed pattern does not satisfy the computational requirements. And different adaptive algorithm combinations are used for adaptive filling according to different conditions, so that the filling distortion condition can be effectively avoided.

Disclosure of Invention

The invention aims to overcome the problems in the prior art and provide a data partition filling method for correcting high-throughput omics data loss, wherein a distribution model of data loss caused by different factors in high-throughput omics experimental big data is filled by using different algorithms according to different models, wherein the loss caused by unstable molecular expression is the difficulty of the filling algorithm, and the non-overfitting filling of the data is realized according to the integral loss probability and the intra-group loss probability of the data based on the Bayesian algorithm.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

a data zone population method for correcting high throughput omics data loss, the method comprising the steps of:

the method comprises the following steps: calculating partition critical values Blow and Bup according to the grouping condition and the data detection distribution condition of the high-throughput omics data expression matrix to realize the partition of data;

step two: sorting the data according to the deletion amount from more to less, and dividing the data into three partitions of real deletion, unstable deletion and technical deletion according to a partition critical value;

step three: and filling the data of the three partitions by using corresponding filling algorithms respectively.

Further, the specific steps of the first step are as follows:

(1) Calculating the detection rate of each molecule in each group of high throughput omics data expression matrices: the detection rate of the molecules in the group i = the number of samples of which the detection value is not 0/the total number of samples of the group i;

(2) Calculating partition critical values Blow and Bup, aiming at the grouping of each high-throughput omics data expression matrix, dividing all molecules contained in each sample in the group into three clusters according to the detection expression quantity of the molecules by using a k-means algorithm, and calculating the median of the detection rate of the molecules contained in each cluster, wherein the minimum and maximum two medians are the partition critical values Blow and Bup.

Further, the third step comprises the following specific steps:

(1) Filling of true misses: filling is not carried out when the molecular detection rate is less than the minimum critical value;

(2) Filling of unstable deletions:

and (3) for the deletion caused by unstable self-expression of the molecules, filling is carried out after predicting the filling number by using a Bayesian algorithm: calculating the number of samples needing to be filled by using a Bayesian algorithm, firstly calculating the potential deletion rate missp of the molecules in the group, wherein the used formula is as follows:

missp = PA (PBA/((PBA PA) + (0.05 x (1-PA)))), where PBA is the intra-group deletion rate of the molecule in the data set and PA is the overall deletion rate of a molecule in the data set, using the formula: IN = min (Mj/2, (1-missp) × Mi), calculating the number IN of the molecules to be filled IN the group, where Mi is the number of samples not detected IN the group, and Mj represents the number of samples detected IN the group; finally, carrying out a random algorithm on the samples with the detection value of 0 IN the reorganization, selecting IN samples needing to be filled, and filling by using the nonzero minimum value IN the group;

(3) Filling of technical deletions: for molecules with a detection rate greater than the maximum threshold, null filling is performed using the median of the molecular detection values of the set.

The invention has the beneficial effects that:

according to the technical characteristics of high-throughput omics detection, a regression algorithm is used for establishing a data loss model, data are partitioned according to three conditions of real loss, unstable loss and technical loss, and then data filling calculation is performed by respectively using a minimum value algorithm, a Bayes algorithm and a median algorithm; therefore, the data filling is closer to reality, on one hand, the negative influence of data distortion on data grouping can be reduced, and on the other hand, the problem of excessive data remodeling caused by using a single filling algorithm is avoided; experiments show that the method has strong data restoration robustness, and compared with other methods, the data grouping result filled by the method is closest to a real result, and the effectiveness of the method is proved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a graph of the correlation regression fit trend of the detection rate and expression level in each of the grouped samples according to the present invention;

FIG. 3 is a graph showing a comparison of the number of proteins before filling the sample in the present invention;

FIG. 4 is a graph showing a comparison of the number of proteins in a sample filled in the present invention;

FIG. 5 is a diagram illustrating the clustering of samples before data padding according to the present invention;

FIG. 6 is a diagram illustrating the clustering of the samples after data padding according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A data zone population method for correcting high throughput omics data loss as shown in figure 1, said method comprising the steps of:

the method comprises the following steps: calculating partition critical values Blow and Bup according to the grouping condition and the data detection distribution condition of the high-throughput omics data expression matrix, and realizing the partition of data, wherein the specific steps are as follows:

(2) Calculating partition critical values Blow and Bup, aiming at the grouping of each high-throughput omics data expression matrix, dividing all molecules contained in each sample in the group into three cluster according to the detection expression quantity by using a k-means algorithm, and calculating the median of the detection rate of the molecules contained in each cluster, wherein the minimum median and the maximum median are the partition critical values Blow and Bup;

step two: sorting the data according to the missing amount from more to less, and dividing the data into three partitions of real missing, unstable missing and technical missing according to a partition critical value;

step three: filling the data of the three partitions by using corresponding filling algorithms respectively, and specifically comprising the following steps of:

(2) Filling of unstable deletions:

filling after predicting the filling number of the deletion generated by unstable self-expression of the molecules by using a Bayesian algorithm: calculating the number of samples needing to be filled by using a Bayesian algorithm, firstly calculating the potential deletion rate missp of the molecules in the group, wherein the used formula is as follows:

missp = PA (PBA/((PBA PA) + (0.05 x (1-PA)))), where PBA is the intra-group deletion rate of the molecule in the data set and PA is the overall deletion rate of a molecule in the data set, again using the formula: IN = min (Mj/2, (1-missp) × Mi), calculating the number IN of the molecules to be filled IN the group, where Mi is the number of samples not detected IN the group, and Mj represents the number of samples detected IN the group; finally, carrying out a random algorithm on the samples with the detection value of 0 IN the reorganization, selecting IN samples needing to be filled, and filling by using the nonzero minimum value IN the group;

(3) Filling of technical deletions: for molecules with detection rates greater than the maximum threshold, null filling is performed using the median of the set of molecular detection values.

Taking the proteome data of the blood sample of the liver cancer patient as an example, g1 to g7 represent 7 different disease states and stages in clinic respectively:

1. the clinical samples are subjected to proteomics experiments through a mass spectrometer, the signal value of each detected protein in each sample is recorded in a data matrix analyzed by the mass spectrometer, and the value without the detected signal value is marked as 0, namely a deletion value.

2. Calculating the relevance of the detection rate and the expression quantity of each protein in each grouped sample, and performing regression fitting by using a locally-weighted polymeric regression;

as shown in fig. 2, a regression fitting trend graph of correlation between the detection rate and the expression amount in each grouped sample, in which black dots represent different detected proteins, the abscissa is the deletion rate of each protein, the ordinate is the expression value of each protein, line a is an expression change fitting curve, and lines B and C are two partition values Blow and Bup; the detection rate and the expression quantity of the protein are in positive correlation on the whole, but from the trend of a fitting curve, the protein in a section with lower detection rate is hardly expressed and belongs to a low-expression section, the detection rate of a middle section and the protein expression quantity fitting curve have a very fast rising trend of 45 degrees and belong to a transition section, and the expression quantity of the protein stably rises after the final detection rate is greater than a critical value and belongs to a stable section; respectively corresponding to three conditions of real deletion, unstable deletion and technical deletion.

The deletion of the low expression segment protein is considered as the unstable detection condition caused by the instability of the protein expression, wherein the deletion of the low expression segment protein is considered as the unstable detection condition caused by the low expression amount, the insufficient sensitivity of a mass spectrometer cannot be detected, the deletion of the stable segment protein is considered as the insufficient accuracy of the mass spectrometer, the expression amount of the protein is not detected, and the deletion between the sensitivity and the accuracy of the mass spectrometer, namely the deletion of the intermediate transition segment protein is considered as the unstable detection condition caused by the protein expression.

3. Dividing the data into 3 clusters through a kmeans algorithm, and calculating the median of the detection rate of molecules contained in each cluster, wherein the minimum median and the maximum median, namely partition critical values Blow and Bup are respectively 0.15 and 0.5;

4. partitioning: sorting the data according to the missing amount, and dividing the data into three partitions of real missing, unstable missing and technical missing according to two deletion rate partition values of 0.15 and 0.5;

5. filling different partition data by using different filling algorithms respectively to obtain a filled recovery data matrix;

5.1 filling of true deletions: filling is not carried out when the molecular detection rate is less than the minimum critical value;

5.2 filling of unstable deletions:

and (3) for the deletion caused by unstable self-expression of the molecules, filling is carried out after predicting the filling number by using a Bayesian algorithm: calculating the number of samples needing to be filled by using a Bayesian algorithm, firstly, calculating the potential deletion rate missp of the molecule IN the group, wherein the missp = PA (((PBA). PA) + (0.05). PA))), the PBA is the deletion rate IN the group of the molecule IN the data set, the PA is the overall deletion rate of a certain molecule IN the data set, and then the number IN of the molecules needing to be filled IN the group is calculated by using the formula IN = min (Mj/2, (1-missp). Mi, wherein Mi is the number of the samples not detected IN the group, and Mj represents the number of the samples detected IN the group; finally, carrying out a random algorithm on the samples with the detection value of 0 IN the reorganization, selecting IN samples needing to be filled, and filling by using the nonzero minimum value IN the group;

5.3 filling of technical deletions: filling null values by using the median of the molecular detection values of the group for the molecules with the detection rates greater than the maximum critical value;

as shown in fig. 3 and 4, in the comparison graph of the number of proteins before and after filling, the difference between the number of detected proteins of each group of samples becomes small after filling, the change between groups is relatively stable, and the clinical logic is satisfied, because the samples of the same group belong to the same clinical stage, the large difference in the number of protein expression should not occur;

as shown in fig. 5 and 6, in the sample cluster maps before and after filling, the approximate clustering tendency of the samples among the groups can be seen before data filling, but there is a sample interspersed clustering condition, which belongs to a clustering error, the samples belonging to the same group after data filling can be gathered in the same branch, and the front and back association sequence of different branches accords with the disease progression logic corresponding to each group clinically, so that the clinical grouping meaning of each group can be accurately explained.

In the description herein, references to the description of "one embodiment," "an example," "a specific example," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims

1. A data partition population method for correcting high-throughput omics data loss, characterized by: the method comprises the following steps:

step three: filling the data of the three partitions by using corresponding filling algorithms respectively;

the specific steps of the first step are as follows:

(1) Calculating the detection rate of each molecule in each group of high throughput omics data expression matrices: the detection rate of the molecule in the group i = number of samples whose detection value is not 0/total number of samples of the group i;

(2) Calculating partition critical values Blow and Bup, aiming at the grouping of each high-throughput omics data expression matrix, dividing all molecules contained in each sample in the group into three cluster according to the detection expression quantity of the molecules by using a k-means algorithm, and calculating the median of the detection rate of the molecules contained in each cluster, wherein the minimum and maximum two medians are the partition critical values Blow and Bup.

2. The data partition filling method for correcting high throughput omics data loss as defined in claim 1, wherein: the third step comprises the following specific steps:

(2) Filling of unstable deletions:

filling after predicting the filling number of the deletion generated by unstable self-expression of the molecules by using a Bayesian algorithm: calculating the number of samples needing to be filled by using a Bayesian algorithm, firstly calculating the potential deletion rate missp of the molecules in the group, wherein the used formula is as follows: missp = PA (PBA/((PBA PA) + (0.05 x (1-PA)))), where PBA is the intra-group deletion rate of the molecule in the data set and PA is the overall deletion rate of a molecule in the data set, using the formula: IN = min (Mj/2, (1-missp) × Mi), calculating the number IN of the molecules to be filled IN the group, where Mi is the number of samples not detected IN the group, and Mj represents the number of samples detected IN the group; finally, carrying out a random algorithm on the samples with the detection value of 0 IN the reorganization, selecting IN samples needing to be filled, and filling by using the nonzero minimum value IN the group;

(3) Filling in technology deficiency: for molecules with a detection rate greater than the maximum threshold, null filling is performed using the median of the molecular detection values of the set.