CN110097920B

CN110097920B - Metabonomics data missing value filling method based on neighbor stability

Info

Publication number: CN110097920B
Application number: CN201910284004.XA
Authority: CN
Inventors: 罗霄; 李超; 林晓惠
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-04-10
Filing date: 2019-04-10
Publication date: 2022-09-20
Anticipated expiration: 2039-04-10
Also published as: CN110097920A

Abstract

The invention provides a neighbor stability-based metabonomics data missing value filling method, and belongs to the technical field of metabonomics data analysis. The core technology of the method is to measure the stability of the content of k nearest neighbor samples of the samples containing the missing metabolites on the corresponding metabolites, and fill different types of missing values by adopting different strategies respectively based on the stable nearest neighbor samples. The method has a good effect of filling metabonomics data containing a deletion value, and has important significance for subsequent data analysis, metabolic marker selection and the like.

Description

Metabonomics data missing value filling method based on neighbor stability

Technical Field

The invention belongs to the technical field of metabonomics data analysis, relates to a metabonomics data missing value filling method based on neighbor stability, and relates to a metabonomics data missing value filling method considering the missing type of metabolite missing values, the similarity relation among samples and the neighbor sample stability.

Background

Metabolomics searches for metabolites associated with physiopathological changes by systematically performing qualitative and quantitative studies on molecular metabolites in organisms. Methods for the qualitative and quantitative determination of different metabolites include mass spectrometry and nuclear magnetic resonance spectroscopy. In general, there are many missing values in metabolomics data obtained by mass spectrometry. These deletion values mainly originate from two aspects: firstly, random errors introduced in the data acquisition process or instrument operation cause that the content of certain metabolites in a sample is not detected, and the data deletion type is called random deletion; secondly, the content of the metabolite in the sample is lower than the detection limit of the mass spectrometer and is not detected, and the data deletion type is called non-random deletion. For example, the concentration of the metabolite bile acid in humans varies widely, and due to the existence of instrumental detection limits, the bile acid metabolite in the obtained metabolomic data may be a missing value in many samples. However, conventional data analysis methods are only suitable for processing complete data matrices without missing values. If metabolites or samples containing missing values in metabolomic data are directly deleted, much valuable information is lost. Therefore, filling missing data by using a simple and efficient method is an important task in metabonomics data analysis, and has important significance for subsequent data analysis, metabolic marker selection and the like.

Some metabolomics data deficiency value processing methods fill in the deficiency value for the corresponding metabolite using a zero value, a minimum value for the metabolite content, half or the median of the minimum value, etc. These methods are simpler but tend to have a greater impact on subsequent data analysis. The missing value filling algorithm based on k nearest neighbor is a common method for processing missing values in metabonomic data. The method considers that the greater the similarity between samples, the smaller the content deviation between their metabolites. If the content of the metabolite m of the sample s is missing, a missing value filling algorithm based on k nearest neighbors finds k nearest neighbor samples with the sample s according to the similarity measure (if the k nearest neighbor samples correspond to the missing content of the metabolite m, the k nearest neighbor samples are replaced by subsequent neighbors), and then fills the content of the missing metabolite m of the sample s with a weighted average of the content of the metabolite m of the k nearest neighbor samples. The missing value filling algorithm based on k nearest neighbor can better process random missing type data in metabonomics data, but the filling effect of the missing value filling algorithm based on k nearest neighbor is not ideal enough.

The method provides a neighbor stability-based metabonomics data missing value filling method. The method comprises the steps of determining k nearest neighbor samples of samples containing missing metabolites according to Euclidean distances among the samples, evaluating the stability of the nearest neighbor samples, and filling different types of missing values by adopting corresponding strategies based on the stable nearest neighbor samples.

Disclosure of Invention

The object of the present invention is to fill in missing values in metabolomic data. The core technology of the method is to measure the stability of the content of k nearest neighbor samples of the samples containing the missing metabolites on the corresponding metabolites, and fill different types of missing values by adopting different strategies respectively based on the stable nearest neighbor samples.

In order to achieve the above object, the technical solution adopted by the present invention is as follows:

a metabonomics data missing value filling method based on neighbor stability comprises the following steps:

detecting metabolic components in a biological sample by using a mass spectrometry, obtaining map data of the metabolic components, analyzing the map data by adopting preprocessing operations such as peak identification, peak matching, normalization and the like, determining the content of metabolites in the sample, and obtaining metabonomics data.

N denotes the number of samples in metabonomic data, p denotes the number of metabolites in the samples, x _i ＝(x _i1 ,x _i2 ,…,x _ip ) A value vector representing the content composition of p metabolites in the ith sample, 1 ≦ i ≦ n. Sample x in metabolomics data _i The content of the middle metabolite m is absent (x) _im Is a deletion value), m is more than or equal to 1 and less than or equal to p, the deletion value x is obtained by the following steps _im Filling:

(1) calculating a sample x _i And sample x _j (1 ≦ i ≠ j ≦ n) Euclidean distance d (x) _i ,x _j ) The formula is as follows:

wherein o is _il Represents a sample x _i Whether the content of the l (1. ltoreq. l.ltoreq.p) metabolites is missing or not, when the sample x _i In the absence of the content of the first metabolite of (1), o _il 0, otherwise o _il ＝1。

Is shown at sample x _i And sample x _j Chinese herbal medicineThe number of metabolites that are not missing in amount. Distance d (x) _i ,x _j ) The smaller, x _i And x _j The higher the similarity between them. Determining the distance to the sample x by Euclidean distance _i The most similar k samples constitute a sample set S _k ；

(2) Judging the deletion type of the metabolite.

Pearson correlation coefficients between metabolite m and other metabolites were calculated. Finding out the metabolite aux _ m with the strongest correlation with m as the reference metabolite of m. And (3) judging the deletion type of the metabolite m according to the content distribution condition of the reference metabolite aux _ m, wherein the judgment process is as follows:

order S _miss ＝{x _j |x _jm Is a deletion value, j is more than or equal to 1 and less than or equal to n represents a sample set of which the metabolite m is the deletion value in the metabonomic data. Order S _obs ＝{x _j |x _jm Not missing values, 1 ≦ j ≦ n represents a set of samples in the metabonomic data for which metabolite m is not a missing value. Separately calculating reference metabolite aux _ m in sample set S _miss And S _obs The average content of (A) is recorded as mu _miss And mu _obs . When metabolite m is positively correlated with aux _ m and μ _miss ＜μ _obs If so, the deletion type of m is non-random deletion, and the step (3) is carried out; and conversely, if the deletion type of m is random deletion, the step (4) is carried out. When metabolite m is negatively correlated with aux _ m and μ _miss ＞μ _obs If so, the deletion type of m is non-random deletion, and the step (3) is carried out; and conversely, if the deletion type of m is random deletion, the step (4) is carried out.

(3) And (4) a non-random deletion type processing mode.

When S is _k In the presence of a deficiency of the content of the metabolite m of the sample, temporarily populating S with the minimum content value of the metabolite m over all samples in the metabolomic data _k The value of m missing from the sample. This step takes into account the fact that the metabolomics data contains non-random missing data. Non-random missing values occurred because the metabolite content was below the detection limit of the instrument and was not detected. Temporarily filling the missing m values of the neighbor sample with the minimum content value of the metabolite m more closely matches the non-random missing dataIs characterized in that.

(4) And (4) a random deletion type processing mode.

When S is _k When the content of the metabolite m in the sample is absent, the metabolite m in S is used _k Average content of metabolite m of the sample without deletion of medium content, temporarily filled with S _k The content value of m missing from the sample. When S is _k When the content of the metabolite m in the middle sample is missing, then the minimum content value of the metabolite m on all the remaining samples in the metabonomic data is used to temporarily fill in S _k The content value of missing m of the sample.

(5) Stable neighbor samples are determined.

According to S _k Determination of the degree of fluctuation of the content of the metabolite m in the sample S _k Of the stable neighbor sample. Calculating S _k Mean μ and standard deviation σ of the metabolite m content of the middle sample. When S is _k In the presence of the metabolite m in the sample at [ mu-sigma, mu + sigma ]]Out of range, sample is taken from S _k Deleting the neighbor samples to obtain a stable neighbor sample set S' _k . Because the variation in metabolite content between neighboring samples is small, will [ mu-sigma, mu + sigma [ ]]The elimination of samples outside the range can reduce the influence of outliers to ensure stability and reliability in the computation of the fill values.

(6) Calculating S' _k Weighted average of m content of middle sample metabolite, x calculated using equation (3) _im Filling sample x _i The content of the deletion metabolite m. The formula is as follows:

wherein k 'is | S' _k L represents a sample set S' _k Number of middle samples, s _j ,s _l (1. ltoreq. j, l. ltoreq. k ') is S' _k Sample of (1), w (x) _i ,s _j ) Representing a sample s _j In the calculation of x _im The weight of the epoch. d (x) _i ,s _j ) Representing the sample x calculated by equation (1) _i And s _j European distance of(s) _lm Representing a sample s _l Content of metabolite m of (a). Based on neighboring samples and sample x _i The distance size gives different weights to the content of m of different neighboring samples. S' _k Middle sample and sample x _i The smaller the distance, the more heavily weighted the content of its metabolite m, x is calculated _im The greater the specific gravity.

The invention has the beneficial effects that:

the method is used for filling metabonomics missing data, missing value types of metabolites are considered, and different strategies are adopted for filling missing values according to different missing value types; and meanwhile, screening the adjacent samples, and filtering unstable adjacent samples. The method has a good effect of filling metabonomics data containing deletion values, and has important significance for subsequent data analysis, metabolic marker selection and the like.

Detailed Description

The following further describes the embodiments of the method on the simulation data in conjunction with the technical solutions, and the simulation data is only used to illustrate the present invention for easy understanding, but not to limit the present invention.

Table 1 shows the simulation data of the present invention, x _i Denotes the ith sample, the data contains 10 samples, m ₁ ～m ₅ Representing 5 metabolites in the data and NaN representing the missing values in the data.

Table 1: analog data

The data in Table 1 contain 4 missing values, each x ₁₃ ,x ₅₂ ,x ₈₄ ,x ₉₃ . In the following with x ₁₃ Are specifically described as examples.

(1) Calculating the sample x using equation (1) ₁ The distance d from the other samples yields: d (x) ₁ ,x ₂ )＝1.94,d(x ₁ ,x ₃ )＝1.73,d(x ₁ ,x ₄ )＝3.39,d(x ₁ ,x ₅ )＝3.46,d(x ₁ ,x ₆ )＝4.12,d(x ₁ ,x ₇ )＝2.29,d(x ₁ ,x ₈ )＝2.71,d(x ₁ ,x ₉ )＝2.74,d(x ₁ ,x ₁₀ ) 3.16. Let k equal 6, then sum with sample x ₁ The set of the most similar 6 samples is S _k ＝{x ₃ ,x ₂ ,x ₇ ,x ₈ ,x ₉ ,x ₁₀ }。

(2) Determination of metabolite m ₃ The type of deletion of (a). Calculate m ₃ And m ₁ ，m ₂ ，m ₄ ，m ₅ The pearson correlation coefficient. Calculated, m ₄ And m ₃ M is selected if the correlation is strongest and positive correlation is present ₄ Is m ₃ The reference metabolite of (1). Metabolite m in the data ₃ Set of samples S as missing values _miss ＝{x ₁ ,x ₉ },m ₃ Set of non-missing samples is S _obs ＝{x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ,x ₈ ,x ₁₀ }. Reference metabolite m ₄ At S _miss Mean value of above μ _miss Is 7 at S _obs Mean value of above μ _obs And was 4.86. Mu.s _miss ≥μ _obs If the deletion type is random deletion, the step (4) is entered.

(3) At x ₁ Of the 6 nearest neighbor samples of (1), sample x ₉ Metabolite m of ₃ For missing values, then x is used ₃ ,x ₂ ,x ₇ ,x ₈ ,x ₁₀ Metabolite m of ₃ Average of 6 to fill x temporarily ₉₃ The value of (c).

(4) Sample set S _k M of middle sample ₃ Corresponding to a value of {3,9,5,7,6,6}, S _k M of middle sample ₃ The mean value μ of (d) is 6 and the standard deviation σ is 2. The stability interval is then [4,8 ]]. Value x ₃₃ ,x ₂₃ Outside the stability interval, so sample x ₃ ,x ₂ From S _k Is deleted, then S' _k ＝{x ₇ ,x ₈ ,x ₉ ,x ₁₀ }。

(5) Calculating D' _k The weight of the middle sample. Obtaining S 'from the formula (2)' _k The weight of each sample in (a) is: w (x) ₁ ,x ₇ )＝0.29,w(x ₁ ,x ₈ )＝0.25,w(x ₁ ,x ₉ )＝0.25,w(x ₁ ,x ₁₀ ) 0.21. Using equation (3), a weighted average x is calculated ₁₃ ＝w(x ₁ ,x ₇ )*x ₇₃ +w(x ₁ ,x ₈ )*x ₈₃ +w(x ₁ ,x ₉ )*x ₉₃ +w(x ₁ ,x ₁₀ )*x _10,3 5.95. Then 5.95 is taken as the missing value x ₁₃ Is estimated to fill the value.

For missing value x ₅₂ ,x ₈₄ ,x ₉₃ Filling is carried out by adopting the steps (1) to (6) respectively.

Claims

1. A metabonomics data missing value filling method based on neighbor stability is characterized by comprising the following steps:

detecting metabolic components in a biological sample by using a mass spectrometry, obtaining map data of the metabolic components, analyzing the map data by adopting peak identification, peak matching and normalization pretreatment operations, determining the content of metabolites in the sample, and obtaining metabonomics data;

n denotes the number of samples in the metabolomic data, p denotes the number of metabolites in the samples, x _i ＝(x _i1 ,x _i2 ,…,x _ip ) A value vector representing the content composition of p metabolites in the ith sample, i is more than or equal to 1 and less than or equal to n; sample x in metabolomics data _i The content of the middle metabolite m is absent, i.e. x _im If m is greater than or equal to 1 and less than or equal to p, the missing value x is determined by the following steps _im Filling:

(1) calculating a sample x _i With other samples x _j Euclidean distance d (x) _i ,x _j ) I ≠ j ≦ n, 1 ≦ as follows:

wherein o is _il Representing a sample x _i If the content of the first metabolite is missing, l is more than or equal to 1 and less than or equal to p, when the sample x _i In the absence of the content of the first metabolite of (1), o _il 0, otherwise o _il ＝1；

Is shown at sample x _i And sample x _j The number of metabolites whose contents are not deleted; distance d (x) _i ,x _j ) The smaller, x _i And x _j The higher the similarity between them; determining the distance to the sample x by Euclidean distance _i The most similar k samples constitute a sample set S _k ；

(2) Determination of the type of deletion of a metabolite

Calculating a pearson correlation coefficient between the metabolite m and the other metabolites; finding out a metabolite aux _ m with the strongest correlation with m as a reference metabolite of m; and (3) judging the deletion type of the metabolite m according to the content distribution condition of the reference metabolite aux _ m, wherein the judgment process is as follows:

order S _miss ＝{x _j |x _jm J is more than or equal to 1 and less than or equal to n represents a sample set of which the metabolite m in the metabonomics data is a deletion value; order S _obs ＝{x _j |x _jm J is not a deletion value, j is more than or equal to 1 and less than or equal to n represents a sample set of the metabolite m in the metabonomic data, wherein the metabolite m is not a deletion value; separately calculating reference metabolite aux _ m in sample set S _miss And S _obs The average content of (A) is recorded as mu _miss And mu _obs (ii) a When metabolite m is positively correlated with aux _ m and μ _miss ≤μ _obs If so, the deletion type of m is non-random deletion, and the step (3) is carried out; when metabolite m is positively correlated with aux _ m and μ _miss ＞μ _obs If yes, the deletion type of m is random deletion, and the step (4) is carried out; when metabolite m is negatively correlated with aux _ m and μ _miss ＞μ _obs If so, the deletion type of m is non-random deletion, and the step (3) is carried out; when metabolite m is negatively correlated with aux _ m and μ _miss ≤μ _obs Then m is randomIf the deletion exists, entering the step (4);

(3) non-random missing type processing mode

When S is _k In the presence of a deficiency of the content of the metabolite m of the sample, temporarily populating S with the minimum content value of the metabolite m over all samples in the metabolomic data _k The content value of m missing from the sample; entering the step (5);

(4) random miss type handling

When S is _k When the content of the metabolite m in the sample is absent, the metabolite m in S is used _k Average content of metabolite m of the sample without deletion of medium content, temporarily filled with S _k The content value of m missing from the sample; when S is _k When the content of the metabolite m in the middle sample is missing, then the minimum content value of the metabolite m on all the remaining samples in the metabonomic data is used to temporarily fill in S _k The missing m content value of the sample; entering the step (5);

(5) determining stable neighbor samples

According to S _k Determination of the degree of fluctuation of the content of the metabolite m in the sample S _k A medium stable neighbor sample; calculating S _k Mean value mu and standard deviation sigma of the metabolite m content of the medium sample; when S is _k In the presence of the metabolite m in the sample in [ mu-sigma, [ mu + sigma ]]Out of range, sample is taken from S _k Deleting the neighbor samples to obtain a stable neighbor sample set S' _k ；

(6) Calculating S' _k A weighted average of the m content of the metabolite m in the middle sample; x calculated using equation (3) _im Filling sample x _i The content of the deletion metabolite m of (a), the formula is as follows:

wherein k '═ S' _k L represents a sample set S' _k Number of samples in, s _j ,s _l (1. ltoreq. j, l. ltoreq. k ') is S' _k Sample of (1), w (x) _i ,s _j ) Representing a sample s _j In the calculation of x _im The weight of the epoch; d (x) _i ,s _j ) Representing the sample x calculated by equation (1) _i And s _j European distance of(s) _lm Representing a sample s _l The content of metabolite m of (a); based on neighboring samples and sample x _i The distance gives different weights to the content of m of different adjacent samples; s' _k Middle sample and sample x _i The smaller the distance, the more heavily weighted the content of its metabolite m, x is calculated _im The greater the specific gravity.