WO2024051052A1 - 组学数据的批次矫正方法、装置、存储介质及电子设备 - Google Patents

组学数据的批次矫正方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2024051052A1
WO2024051052A1 PCT/CN2022/143821 CN2022143821W WO2024051052A1 WO 2024051052 A1 WO2024051052 A1 WO 2024051052A1 CN 2022143821 W CN2022143821 W CN 2022143821W WO 2024051052 A1 WO2024051052 A1 WO 2024051052A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
batch
correction
analysis
omics
Prior art date
Application number
PCT/CN2022/143821
Other languages
English (en)
French (fr)
Inventor
成晓亮
郑和龙
周岳
张伟
Original Assignee
上海氨探生物科技有限公司
南京品生医疗科技有限公司
南京品生医学检验实验室有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海氨探生物科技有限公司, 南京品生医疗科技有限公司, 南京品生医学检验实验室有限公司 filed Critical 上海氨探生物科技有限公司
Publication of WO2024051052A1 publication Critical patent/WO2024051052A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates to the field of biological analysis technology, and in particular to batch correction methods, devices, storage media and electronic equipment for omics data.
  • Proteomics, metabolomics and lipidomics based on mass spectrometry technology have become key methods for biological analysis.
  • the present invention provides batch correction methods, devices, storage media and electronic equipment for omics data to solve errors caused by batch detection.
  • a batch correction method for omics data includes:
  • a device for batch correction of omics data which device includes:
  • the data preprocessing module is used to obtain omics data of multiple batches of samples, preprocess the omics data, and obtain preprocessed data;
  • the batch correction module is used to perform batch correction processing on preprocessed data to obtain corrected data
  • the data analysis module is used to perform preset types of analysis and processing on correction data to obtain batch correction analysis results.
  • an electronic device includes:
  • a memory communicatively connected to at least one processor; wherein,
  • the memory stores a computer program that can be executed by at least one processor, and the computer program is executed by at least one processor, so that at least one processor can execute a batch correction method for omics data according to any embodiment of the present invention.
  • a computer-readable storage medium stores computer instructions.
  • the computer instructions are used to implement an omics data of any embodiment of the present invention when executed by a processor. batch correction method.
  • the technical solution of the embodiment of the present invention is to obtain the omics data of multiple batches of samples and preprocess the omics data to obtain the preprocessed data. Perform batch correction processing on the preprocessed data to obtain corrected data. Perform a preset type of analysis and processing on the correction data to obtain batch correction analysis results. The problem of errors caused by batch testing is solved, and the quality of omics data can be assessed more accurately and efficiently.
  • Figure 1 is a flow chart of a batch correction method for omics data provided by Embodiment 1 of the present invention
  • Figure 2 is a flow chart of a batch correction method for omics data provided in Embodiment 2 of the present invention.
  • Figure 3 is a flow chart of the batch correction method of omics data provided in Embodiment 3 of the present invention.
  • Figure 4 is a cumulative QC sample RSD box plot provided by Embodiment 3 of the present invention.
  • Figure 5 is a histogram of cumulative QC sample RSD percentage provided by Embodiment 3 of the present invention.
  • Figure 6 is a schematic structural diagram of a batch correction device for omics data provided in Embodiment 4 of the present invention.
  • FIG. 7 is a schematic structural diagram of an electronic device provided in Embodiment 5 of the present invention.
  • FIG. 1 is a flow chart of a batch correction method for omics data provided in Embodiment 1 of the present invention.
  • This embodiment can be applied to perform mass spectrometry processing on a large number of samples, obtain the omics data of each sample through analysis, and perform Batch correction and analysis status.
  • This method can be performed by a batch correction device for omics data.
  • the batch correction device for omics data can be implemented in the form of hardware and/or software.
  • the batch correction device for omics data can be configured in electronic equipment such as computers. As shown in Figure 1, the method includes:
  • mass spectrometry is an important analysis technology in research fields such as biological macromolecules. Mass spectrometry can be used to perform mass spectrometry analysis on sample data. Due to the large amount of sample data, mass spectrometry analysis needs to be performed in multiple batches. Omics data can obtain molecular expression information through mass spectrometry analysis of multiple batches of samples. Omics data can include genomics data, proteomics data, lipidomics data, metabolomics data, etc.
  • the sample information can be associated with the omics data to assist in omics analysis. Analysis and processing of data.
  • the sample information may include but is not limited to the sample name, the group name corresponding to the sample, the sequence of mass spectrometry, batch information, etc.
  • the omics data can be determined based on the batch information, sample name, and group name corresponding to the sample (ie, the type of sample) in the sample information.
  • Preprocess the acquired omics data to obtain preprocessed data.
  • the preprocessing objects can include omics data and associated sample information.
  • Preprocessing can include one or more of data cleaning, data normalization, and missing value processing.
  • Data cleaning can be a process of re-examining and verifying data. It can delete duplicate information and correct error information to ensure the integrity of the data. Correct and neat.
  • Data normalization can limit the data to be processed to a certain range after processing, so that the detection data distribution remains consistent and eliminates the adverse effects caused by singular sample data. For example, median normalization and variance stabilization normalization can be used. Yihua et al.
  • Missing value processing can include missing value filling and discarding omics data with missing values.
  • missing value filling can be to fill in the missing data in the sample data to reduce the impact of missing values on the detection results. For example, you can use Default value filling, mean filling, mode filling, K-Nearest Neighbors (knn) filling and interpolation filling, etc. Discarding omics data with missing values can discard the omics data when the number of missing values in the omics data exceeds a certain value and cannot be filled to avoid the impact of omics data with missing values on the overall omics data. interference.
  • the preprocessed data obtained through preprocessing can be used for subsequent data correction and data analysis.
  • multiple batches of omics data are obtained through multiple batches of mass spectrometry processing.
  • errors may occur due to abiotic factors that affect the accuracy of omics data. That is, there may be batch abnormalities in omics data obtained from different batches.
  • the abiotic factors may be abnormalities in mass spectrometry equipment. , external factors such as operator error.
  • Batch correction can correct batch differences caused by abiotic factors, eliminate batch effects as much as possible, eliminate biases and systematic errors, and characterize the biological status of the data itself. Batch correction can be implemented using ComBat method, surrogate variable method, mean center method, distance weighted discrimination method and other methods, and there is no limit to this. Corrected data is obtained by performing batch correction on the preprocessed data. The corrected data is generally omics data after eliminating batch differences.
  • the preprocessed data is subjected to batch correction processing to obtain corrected data, including: for any type of omics data of any sample in any batch, a batch matrix based on the omics data and The initial expression amount of the omics data determines the correction parameters; the correction data is determined based on the initial expression amount of the omics data and the correction parameters.
  • L/S model parameters representing batch effects are estimated by expressing information between molecules in each batch, thereby specifying the batch effect parameter estimate to the overall mean of the batch effect estimates (across molecule), classical Bayesian estimators are used to adjust the data for batch effects, providing a more robust adjustment for batch effects for each molecule.
  • the initial expression level may be the molecular expression level in uncorrected omics data.
  • the batch matrix of omics data is formed by the batch information of the samples described in each omics data, and the correction parameters are used to correct the initial expression amount in the omics data to obtain corrected data.
  • the calculation formula of the correction data can be set in advance, and the correction data can be obtained by inputting the determined correction parameters and the initial expression amount of the omics data to be corrected into the above calculation formula.
  • determining correction parameters based on the batch matrix of omics data and the initial expression amount of the omics data includes: based on the batch matrix of the omics data and the initial expression of the omics data.
  • the expression amount determines the initial correction parameters; the correction parameters are determined based on the initial correction parameters and the preset distribution.
  • the initial correction parameters include the overall average expression amount of the omics data, the regression coefficient variable corresponding to the batch matrix, the additive batch effect parameter, and the multiplicative batch effect parameter.
  • the error term satisfies the standard normal distribution.
  • each initial correction parameter has a one-to-one correspondence with the correction parameter, and the corresponding correction parameter is determined based on the initial correction parameter.
  • mass spectrometry instruments have different sensitivities to molecules, molecular expression may differ due to molecular weight, which biases the classical Bayesian estimate of the prior distribution of the batch.
  • the molecular expression data are standardized so that the molecules have similar overall mean and variance. .
  • the correction parameters conform to the preset distribution.
  • the distributions of different correction parameters may be the same or different, and there is no limitation on this.
  • the least squares method is used to estimate the initial correction parameters to obtain the corresponding correction parameters. That is, for any of the overall average expression amount in the initial correction parameters, the regression coefficient variable corresponding to the batch matrix, and the additive batch effect parameter. Through the estimation method of the least squares method, the overall average expression amount, the regression coefficient variable corresponding to the batch matrix, and the correction parameters corresponding to the additive batch effect parameter can be obtained.
  • the variance can be further estimated based on the estimated overall average expression amount, the regression coefficient variable corresponding to the batch matrix, and the additive batch effect parameter. For example, for each type of molecule, the variance can be determined based on the difference between the initial expression amount of each sample and the above correction parameter. It is assumed that the correction parameters satisfy the preset distribution and the prior distribution of the batch effect parameters satisfies the preset distribution.
  • the preset distribution can be a normal distribution.
  • determining the correction data based on the initial expression amount and correction parameters of the omics data includes: determining a standardized value based on the initial expression amount and correction parameters of the omics data; based on the standardized value and The correction parameters determine correction data.
  • the omics data has a similar overall mean and variance.
  • the standardized values can be used as standardized values.
  • the standardized values can be based on the initial expression, the overall average expression estimate, the batch matrix,
  • the regression coefficient estimate and error estimate are determined.
  • the correction data can be determined based on standardized values and correction parameters, where the correction data is the omics data after batch correction. Therefore, the correction data can be based on the initial expression, the overall average expression estimate, the batch matrix, the regression coefficient estimate, the additive batch effect parameter estimate, the multiplicative batch effect parameter estimate, the overall average expression estimate, the batch Submatrix and regression coefficient estimates are determined.
  • the corrected omics data is analyzed to realize the analysis of the sample.
  • Omics data is high-dimensional data and needs to be analyzed from multiple dimensions.
  • the preset type of analysis processing includes but is not limited to: sample intensity distribution analysis, dimensionality reduction analysis, discriminant analysis, molecular trend analysis, sample correlation analysis and test sample repeatability analysis.
  • the processing rules for various types of analysis and processing are stored in advance, the corresponding processing rules are called according to the needs of analysis and processing, and the corrected omics data is analyzed and processed based on the called processing rules.
  • the type of analysis processing may be set in advance, for example, a type identifier of the analysis processing may be input, and the corresponding processing rule may be called based on the type identifier.
  • the correction data of partial samples can be analyzed and processed, or the correction data of all samples can be analyzed and processed, which can be determined according to the needs of analysis and processing, and is not limited to this.
  • the partial sample may be a quality control (QUALITY CONTROL, QC) sample in each batch.
  • PCA Principal component analysis
  • TSNE t-distributed Stochastic Neighbor Embedding
  • UMAP Uniform Manifold Approximation and Projection
  • Projection of data onto a two-component axis visualizes sample proximity. Additional coloring of samples by technical/biological factors or by highlighting duplicates can help explain what drives sample proximity. It is helpful to evaluate clustering by biological and technical factors or to examine duplicate similarity. When the similarity between samples is no longer driven by technical factors, it means that PCA/TSNE/UMAP does not show clustering by batch. For example, discriminant analysis is used to check the clustering distribution of samples under supervised modeling, so that the characteristic variables obtained after extraction can well summarize the information of the original variables and have a strong influence on the dependent variables. To improve the explanatory power, the above results are represented by a scatter plot, each point represents a sample, and the color represents the corresponding grouping of the sample.
  • 50 molecules can be randomly selected to display the expression intensity information in all samples, and visualized with a scatter plot. There are two colors, one representing QC samples and the other representing target samples, and based on the two types of samples Fit the curve to characterize the stability of the molecule in the QC sample.
  • sample correlation analysis it is used to calculate the Pearson/Spearman correlation coefficient between samples.
  • This analysis can compare sample repeatability, especially the correlation within the group, especially when the QC sample correlation is high, indicating the quality of the data.
  • a higher correlation between samples from the same batch compared to unrelated batches is a clear sign of bias and may reflect the presence of batches that are The influence is visualized using heat maps and violin plots respectively.
  • the color of the heat map uses a gradient color according to the size of the correlation coefficient, and the deviation between a certain sample and other samples can also be evaluated; due to the large sample size of the large sample queue, the heat map visualization is not It is convenient to display and view data.
  • a violin plot is used to visually analyze the correlation results. The color indicates the grouping. The larger the value, the greater the correlation within the group.
  • the technical solution of this embodiment provides a batch correction method for omics data, obtains omics data of multiple batches of samples, preprocesses the omics data to obtain preprocessed data, and batches the preprocessed data. Correction processing is performed to obtain correction data, and batch correction analysis results are obtained by performing a preset type of analysis processing on the correction data.
  • FIG. 2 is a flow chart of a batch correction method for omics data provided in Embodiment 2 of the present invention. This embodiment is refined based on the above embodiment. As shown in Figure 2, the method includes:
  • S220 Perform a preset type of analysis processing on the preprocessed data to obtain analysis results without batch correction.
  • a preset type of analysis processing can be performed on the correction data, and a preset type of analysis processing can be performed on the preprocessed data.
  • Omics data is high-dimensional data and needs to be analyzed from multiple dimensions.
  • Preset types of analysis processing can include but are not limited to: sample intensity distribution analysis, dimensionality reduction analysis, discriminant analysis, sample correlation analysis and test sample repeatability analysis. .
  • the analysis results without batch correction can be transmitted to a display device for display.
  • the display device can be a computer screen, an electronic display screen, etc., and the batch correction analysis results can be displayed at the same time.
  • the display mode of the display device is based on The preset types vary, and can be image display, chart display, digital display, etc. It can intuitively display the analysis results without batch correction and the analysis results after batch correction.
  • the pre-processing includes one or more of data cleaning, data normalization and missing value processing; accordingly, the pre-processed data is subjected to a preset type of analysis and processing to obtain the data without batch correction.
  • the analysis results include: subjecting each preprocessed preprocessed data to a preset type of analysis to obtain at least one analysis result without batch correction.
  • preprocessing the data can be any one of the steps of data cleaning, data normalization, and missing value processing, or it can be a combination of any two steps of data cleaning, data normalization, and missing value processing. , or it can be all steps of data cleaning, data normalization, and missing value processing.
  • the preset type may refer to the preset type in the above embodiment, and may be sample intensity distribution analysis, dimensionality reduction analysis, discriminant analysis, sample correlation analysis and test sample repeatability analysis. Perform a preset type of analysis on each preprocessed data.
  • preprocessing is a step in data cleaning, data normalization, and missing value processing
  • an analysis without batch correction can be obtained.
  • a preset type of analysis processing can be performed on each step to obtain two or more unprocessed data. Analysis results of batch correction.
  • the analysis results without batch correction and the analysis results after batch correction can be displayed through the display device, so that the operating user can view the analysis results without batch correction and the analysis results after batch correction through the display device.
  • Analyze the results, and determine whether there are batch abnormalities in the omics data obtained during the mass spectrometry processing by comparing the analysis results without batch correction and the analysis results after batch correction, for example, based on the images, charts, Numbers, etc. can visually compare the analysis results without batch correction and the analysis results after batch correction.
  • compare the unbatch-corrected analysis results with the batch-corrected analysis results to determine whether there are batch quality abnormalities in the omics data. For example, the analysis results of the same preset type may be compared.
  • each batch of samples includes a test sample (for example, it may be a QC sample); the omics data of multiple batches of samples includes the omics data of the test samples in each batch.
  • the method also includes: sequentially determining the omics data groups for analysis according to the batch sequence of the test samples, performing a preset type of analysis processing on each omics data group, and obtaining the analysis results of each omics data group; based on The analysis results of each omics data group determine whether there is a batch quality abnormality in the omics data.
  • the test sample can be a qualified sample or a mixed sample of samples from each batch.
  • the test samples in different batches are the same, that is, the omics data obtained by mass spectrometry processing of the test samples are theoretically the same. Insert the test samples into each batch separately to ensure that the omics data of multiple batches of samples include the omics data of the test samples in each batch.
  • only the omics data of the test samples in each batch are analyzed and processed. Reduce the amount of data processed for analysis.
  • a preset type of analysis is performed on the omics data of the test samples in each batch.
  • the omics data of the test samples are theoretically the same. Through the preset type analysis, the theoretical analysis results of each batch are the same.
  • the quality of the batch can be determined to be normal.
  • the batch quality can be determined to be abnormal.
  • determining whether there is an abnormality in batch quality in the omics data based on the analysis results of each omics data group includes: determining whether there are abnormal analysis results in the analysis results of each omics data group. If so, the abnormal test sample is determined based on the omics data group with abnormal analysis results, and the batch in which the abnormal test sample is located is determined as the abnormal batch.
  • the analysis method of the test sample may include cumulative test sample relative standard deviation (RSD) analysis, cumulative test sample relative standard deviation percentage analysis, intensity analysis of each test sample, and molecular stability analysis.
  • RSS cumulative test sample relative standard deviation
  • the order of test samples can refer to the batch order, determine multiple omics data groups as the test samples increase, and calculate the relative standard deviation value of each omics data group in turn.
  • the first omics data set can be the omics data of the test samples in the first two batches
  • the second omics data set can be the omics data of the test samples in the first three batches, and so on. Relative standard deviation analysis is performed on each omics data.
  • the results can be visualized with box plots.
  • the first box plot can represent the statistical results of the relative standard deviation of the omics data in the test samples of the first two batches, and the second box plot can represent the first three batches.
  • the statistical results of the relative standard deviation of the omics data in the test samples are calculated sequentially until the relative standard deviation of all test samples is calculated. As the number of test samples increases, the drift or discreteness of the test samples can be obtained according to the box plot. .
  • the relative standard deviation value of the box plot meets the preset threshold, the batch quality can be determined to be normal.
  • the N+1 test sample can be determined to be abnormal. Based on the abnormality of the N+1 test sample, the N+1 batch is determined to be an abnormal batch.
  • the technical solution of this embodiment is based on the above embodiment and adds a preset type of analysis processing on the preprocessed data to obtain analysis results without batch correction.
  • the analysis results without batch correction are combined with the batch
  • the corrected analysis results are displayed on the display device, and by comparing the analysis results, the batch quality of the omics data can be determined.
  • Determine batch abnormalities by adding the omics data of the test sample to the omics data of multiple batches of samples, perform a preset type of analysis on the omics data of the test sample, and determine the abnormality based on the analysis results of the omics data of the test sample Test samples, determine the batch where the abnormal test sample is located as the abnormal batch.
  • FIG. 3 is a flow chart of the batch correction method for omics data provided in Embodiment 3 of the present invention. Based on the above embodiments, Embodiment 3 of the present invention also provides a preferred example of a batch correction method for omics data. The method includes: raw data input and raw data cleaning, omics data normalization and missing value processing. , batch correction, and data analysis and evaluation.
  • the input requires two files, the sample information file, and a text file constructed by the user (the format can be csv, txt or excel).
  • the sample information file includes several columns of information: sample name (ID), group name corresponding to the sample (Type), mass spectrometry injection order (order), batch information (batch), among which the batch information (batch) belongs to one batch
  • the samples are represented by the same numbers or letters.
  • the schematic diagram of the sample information file is shown in Table 1 below.
  • the software organizes the corresponding data format.
  • the sample name (ID) in the sample information file it extracts and retains necessary information, such as the expression corresponding to the protein or metabolite detected in each sample.
  • Intensity information saved as a text file with raw expression intensities without any calibration, normalization or correction to its values in other samples, is available for subsequent analysis by the data analysis evaluation module.
  • Missing values are generally caused by instrument collection. Many analysis methods do not allow data to contain missing values, which will have a great impact on the selection of data methods. Too many missing values cannot accurately represent data information. However, simply discarding missing values, or directly filling missing values using inappropriate methods, will cause a large amount of useful information to be lost or non-biological differences to lead to erroneous conclusions. The missing value processing process will try to eliminate the impact of missing values on the results.
  • the text file is quantified according to its expression. It is first normalized and provides a variety of normalization methods for users to choose from, such as median normalization.
  • x i is the quantitative value of molecules in the sample
  • x is the quantitative value sequence of all molecules in the sample
  • norm( xi ) is the value after median normalization
  • quantile normalization The normalized value in the sample is equal to the quantitative expression value of the molecule minus the median of the quantitative expression value of the molecules in the sample divided by the difference between the upper quartile and the lower quartile of all molecules in the sample). The default variance is stable.
  • Y ijg represents the expression value of molecule g from sample j in batch i
  • ⁇ g is the overall average expression level of molecule g
  • batch column information ⁇ g is the regression coefficient variable corresponding to ig represents the multiplicative batch effect of molecule g in batch i.
  • the expression value Y ijg of the molecule g of each sample in each batch, the overall average expression amount ⁇ g of the molecule g and the batch matrix X are known, the additive batch effect ⁇ ig , The multiplicative batch effect ⁇ ig , the error term ⁇ ijg and the regression coefficient variable ⁇ g .
  • the combat algorithm is based on the above model and is expanded using the classic Bayesian method: by expressing information between molecules in each batch to estimate the L/S model parameters representing the batch effect, thereby specifying the batch effect parameter estimation. to the population mean (across molecules) of the batch effect estimate.
  • Classical Bayesian estimators were then used to adjust the data for batch effects, providing a more robust adjustment for batch effects for each molecule.
  • the standardized data should satisfy Z ijg ⁇ N( ⁇ ig , ⁇ 2 ig ) (this ⁇ ig does not have the same meaning as the error term in (1)). If the normal distribution parameters of Z ijg ⁇ ig , ⁇ 2 ig satisfy and ⁇ 2 ig ⁇ Inverse Gamma( ⁇ i , ⁇ i ), then use parametric empirical Bayes. If the normal distribution parameters of Z ijg do not meet the above conditions, a more flexible prior distribution is needed. In this case, you can use Non-parametric empirical Bayes. Thus, the batch effect estimates ⁇ * ig and ⁇ 2 * ig are calculated, and finally the adjusted molecular expression data Y * ijg is obtained.
  • the first step is to conduct an empirical super-prior estimate and derive the estimated values of ⁇ i and ⁇ i
  • the sample mean of molecule g in batch i is Therefore ⁇ i ,
  • the estimated value can be expressed as:
  • sample variance of molecule g in batch i can be obtained and can be calculated accordingly average of and Variance let It is equal to the theoretical moments of the inverse gamma distribution, that is, the mean value variance Estimates of ⁇ i and ⁇ i can be derived:
  • the second step is to perform parameter batch effect correction and apply Bayesian theory to find the conditional (posterior) distribution of ⁇ ig
  • the posterior distribution should satisfy:
  • kernel of a normal distribution (kernel of a normal distribution) expressed as:
  • conditional posterior distribution has ⁇ ig and inverse gamma (Inverse Gamma ( ⁇ i , ⁇ i )) prior, so it should satisfy:
  • ⁇ * ijg is the final corrected data.
  • Z ijg still conforms to the normal distribution Z ijg ⁇ N( ⁇ ig , ⁇ 2 ig ), which is similar to the previous derivation, so that We do this by finding the estimate of the posterior expected value of the batch effect parameter E [ ⁇ ig ] and To estimate the batch effect parameters ⁇ ig , ⁇ 2 ig .
  • the posterior expectation value of ⁇ ig can be expressed as:
  • the same approach can be used to calculate The posterior expectation value of is used to adjust the non-parametric classical Bayesian and can be expressed as
  • Data analysis and evaluation are divided into two methods: analysis only for QC samples and evaluation for overall samples. They are divided into different analysis methods according to the categories of analysis only for QC samples and analysis for overall samples (including QC samples).
  • a QC sample is generally inserted between 10 and 20 samples to analyze the data quality of the QC sample, which can reflect the purpose of the entire data collection stability quality control step. Evaluate the bias of the raw data and evaluate whether normalization and/or batch effect correction improved the data. If the similarity between samples is no longer driven by technical factors and intra-group repeatability is high, the bias is considered to be eliminated.
  • Cumulative QC sample RSD analysis According to the mass spectrometry injection order (order) in the sample information file, the QC sample order is specified. According to the sample name (ID), the QC sample molecular expression data in the quantitative data is extracted, and the QC sample is calculated sequentially. Increasing, the change in RSD value of each molecule ( n is the number of samples, x i is the expression intensity of the molecule in the i-th sample, is the average expression intensity value of the molecule in all n samples) and uses a box plot to visualize the results: the abscissa is the number of cumulative QC samples, and the ordinate is the RSD value.
  • the first box plot in the figure is the statistical result of the RSD of molecular expression in the first two QC samples according to the order of injection
  • the second box plot is the box plot of the RSD of the molecular expression in the first three QC samples.
  • the statistical results are calculated sequentially until the RSD calculation of all QC samples is completed. It reflects the stability of QC samples as the number of QC samples increases, and shows the drift or discreteness of the mass spectrum signals of QC samples. Since QC samples are inserted into the entire data collection process, it is generally believed that samples with RSD values less than 0.3 have better detection stability. , so it can further reflect the stability of the instrument during the entire data detection process.
  • Figure 4 is a cumulative QC sample RSD box plot provided by Embodiment 3 of the present invention.
  • Intensity distribution analysis of each QC sample expresses the expression intensity information of the molecule in the QC sample. Since the expression value is of large magnitude, the abscissa is the quantitative value after log2. The QC sample variance and outliers are evaluated to reflect the overall stability. .
  • PCA Principal component analysis
  • Expression intensity distribution analysis can represent the intensity distribution of each sample through a box plot, and draw the average or median sample intensity according to the order of injection, allowing the signal drift or discrete deviation of the sample during the measurement process to be estimated.
  • Correlation analysis calculate the Pearson/Sperman correlation coefficient between samples. This analysis can compare the sample repeatability, especially the correlation within the group. Especially when the QC sample correlation is high, it indicates that the data quality is better.
  • a higher correlation between samples from the same batch compared to unrelated batches is a clear sign of bias and may reflect batch effects and can be considered separately.
  • Use heat map and violin plot visualization The color of the heat map uses gradient colors according to the size of the correlation coefficient. It can also evaluate the deviation between a certain sample and other samples. Due to the large sample size of the large sample queue, the heat map visualization is inconvenient to display and display. View the data. In order to clearly show the correlation between samples within the group, a violin plot is used to visually analyze the correlation results. The color indicates the grouping. The larger the value, the greater the correlation within the group.
  • FIG. 6 is a schematic structural diagram of a device for batch correction of omics data provided in Embodiment 4 of the present invention. As shown in Figure 6, the device includes:
  • the data preprocessing module 610 is used to obtain omics data of multiple batches of samples, preprocess the omics data, and obtain preprocessed data;
  • the batch correction module 620 is used to perform batch correction processing on the preprocessed data to obtain correction data;
  • the data analysis module 630 is used to perform a preset type of analysis processing on the correction data to obtain batch correction analysis results.
  • the technical solution of this embodiment is to provide a batch correction device for omics data, obtain omics data of multiple batches of samples, preprocess the omics data to obtain preprocessed data; perform the preprocessing
  • the data is subjected to batch correction processing to obtain correction data; the correction data is subjected to a preset type of analysis processing to obtain batch correction analysis results.
  • Batch correction of omics data batch detection is realized, which can evaluate the quality of data more accurately and efficiently.
  • the batch correction module 620 is specifically used for:
  • Correction data is determined based on the initial expression amount and correction parameters of the omics data.
  • the batch correction module 620 is specifically used for:
  • Correction parameters are determined based on the initial correction parameters and the preset distribution.
  • the batch correction module 620 is specifically used for:
  • Correction data is determined based on the normalized values and the correction parameters.
  • the data analysis module 630 is specifically used for:
  • the data analysis module 630 is specifically used for:
  • the analysis results without batch correction are compared with the batch-corrected analysis results to determine whether there are batch quality abnormalities in the omics data.
  • the data analysis module 630 is specifically used for:
  • Each preprocessed preprocessed data is subjected to a preset type of analysis processing to obtain at least one analysis result without batch correction.
  • the data analysis module 630 is specifically used for:
  • the omics data groups for analysis are sequentially determined according to the batch sequence of the test samples, and a preset type of analysis processing is performed on each omics data group to obtain the analysis results of each omics data group;
  • the data analysis module 630 is specifically used for:
  • the abnormal test sample is determined based on the omics data group with abnormal analysis results, and the batch in which the abnormal test sample is located is determined as the abnormal batch.
  • the data analysis module 630 is specifically used for:
  • the preset type of analysis processing includes: sample intensity distribution analysis, dimensionality reduction analysis, discriminant analysis, sample correlation analysis and mixed sample repeatability analysis.
  • the device for batch correction of omics data provided by embodiments of the present invention can execute a batch correction method of omics data provided by any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 7 is a schematic structural diagram of an electronic device provided in Embodiment 5 of the present invention.
  • Electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the invention described and/or claimed herein.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores There is a computer program that can be executed by at least one processor.
  • the processor 11 can be based on a computer program stored in a read-only memory (ROM) 12 or a computer program loaded from the storage unit 18 into the random access memory (RAM) 13, Perform various appropriate actions and processing.
  • RAM 13 various programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 performs various methods and processes described above, such as a batch correction method for omics data.
  • a method for batch correction of omics data can be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 .
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform a batch correction method of omics data in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or a combination thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • a computer program for implementing a batch correction method for omics data of the present invention can be written using any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
  • a computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • Embodiment 6 of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions.
  • the computer instructions are used to cause the processor to execute a batch correction method for omics data.
  • the method includes:
  • omics data of multiple batches of samples preprocess the omics data to obtain preprocessed data
  • a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device.
  • Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be a machine-readable signal medium.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on an electronic device having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display)
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

本发明公开了组学数据的批次矫正方法、装置、存储介质及电子设备。其中,组学数据的批次矫正方法包括:获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据;对所述预处理数据进行批次矫正处理,得到矫正数据;对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。有效降低了批次检测的误差,可以更准确、更高效地评估数据的质量。

Description

组学数据的批次矫正方法、装置、存储介质及电子设备
本申请要求于2022年9月8日提交到国家知识产权局,申请号为“202211097799.1”,发明名称为“一种组学数据的批次矫正方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及生物分析技术领域,尤其涉及组学数据的批次矫正方法、装置、存储介质及电子设备。
背景技术
基于质谱技术的蛋白质组、代谢组和脂质组,已经成为进行生物分析的关键方法。
然而,在对生物大样本量质谱检测时,将样本分为多个批次检测,不可避免地存在着各种误差。
发明内容
本发明提供了组学数据的批次矫正方法、装置、存储介质及电子设备,以解决批次检测引起的误差。
根据本发明的一方面,提供了一种组学数据的批次矫正方法,该方法包括:
获取多批次样本的组学数据,对组学数据进行预处理,得到预处理数据;
对预处理数据进行批次矫正处理,得到矫正数据;
对矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
根据本发明的另一方面,提供了一种组学数据的批次矫正装置,该装置包括:
数据预处理模块,用于获取多批次样本的组学数据,对组学数据进行预处理,得到预处理数据;
批次矫正模块,用于对预处理数据进行批次矫正处理,得到矫正数据;
数据分析模块,用于对矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
根据本发明的另一方面,提供了一种电子设备,电子设备包括:
至少一个处理器;以及
与至少一个处理器通信连接的存储器;其中,
存储器存储有可被至少一个处理器执行的计算机程序,计算机程序被至少一个处理器执行,以使至少一个处理器能够执行本发明任一实施例的一种组学数据的批次矫正方法。
根据本发明的另一方面,提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,计算机指令用于使处理器执行时实现本发明任一实施例的一种组学数据的批次矫正方法。
本发明实施例的技术方案,通过获取多批次样本的组学数据,对组学数据进行预处理,得到预处理数据。对预处理数据进行批次矫正处理,得到矫正数据。对矫正数据进行预设类型的分析处理,得到批次矫正分析结果。解决了批次检测引起误差的问题,可以更准确 和更高效地评估组学数据的质量。
应当理解,本部分所描述的内容并非旨在标识本发明的实施例的关键或重要特征,也不用于限制本发明的范围。本发明的其它特征将通过以下的说明书而变得容易理解。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例一提供的一种组学数据的批次矫正方法的流程图;
图2是本发明实施例二提供的一种组学数据的批次矫正方法的流程图;
图3是本发明实施例三提供的组学数据的批次矫正方法的流程图;
图4是本发明实施例三提供的累计QC样本RSD箱线图;
图5是本发明实施例三提供的累计QC样本RSD百分比柱状图;
图6是本发明实施例四提供的一种组学数据的批次矫正装置的结构示意图;
图7是本发明实施例五提供的一种电子设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
实施例一
图1是本发明实施例一提供的一种组学数据的批次矫正方法的流程图,本实施例可适用于对大量样本进行质谱处理后,通过分析得到各样本的组学数据,并进行批次矫正以及分析的情况。该方法可以由组学数据的批次矫正装置来执行。该组学数据的批次矫正装置可以采用硬件和/或软件的形式实现。该组学数据的批次矫正装置可配置于计算机等电子设备中。如图1所示,该方法包括:
S110、获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据。
本实施例中,质谱分析法是生物大分子等研究领域中的重要分析技术,可以采用质谱分析法对样本数据进行质谱分析,由于样本数据量较大,需要分多个批次进行质谱分析。组学数据可以通过对多批次样本质谱分析获得的分子表达信息。组学数据可以包括基因组学数据、蛋白组学数据、脂类组学数据、代谢组学数据等。
获取经质谱处理得到的各样本的组学数据,例如接收导入的各批次的样本的组学数据,同时获取各样本的样本信息,可以将样本信息与组学数据关联,以辅助对组学数据的分析处理。样本信息可以包括但不限于样本名称、样本对应的分组名称、质谱进行顺序、批次信息等。可以根据样本信息中的批次信息、样本名称、样本对应的分组名称(即样本的类型)确定组学数据。
对获取的组学数据进行预处理,得到预处理数据。其中,预处理的对象可以包括组学数据和关联的样本信息。预处理可以包括数据清洗、数据归一化和缺失值处理的一项或多项,数据清洗可以是对数据进行重新审查和校验的过程,可以删除重复信息和纠正错误信息等,保证数据的正确和整洁。数据归一化可以是把需要处理的数据经过处理后限制在一定范围内,使得检测数据分布保持一致,消除奇异样本数据导致的不良影响,例如,可以采用中位数归一化、方差稳定归一化等。缺失值处理可以包括缺失值填充和对存在缺失值的组学数据进行舍弃,即缺失值填充可以是对样本数据中缺失的数据进行填充,减小缺失值对检测结果的影响,例如,可以采用默认值填充、均值填充、众数填充、K最邻近法(k-Nearest Neighbors,knn)填充和插值填充等。对存在缺失值的组学数据进行舍弃可以在组学数据中缺失值数量超过一定值,无法进行填充的情况下,舍弃该组学数据,避免存在缺失值的组学数据对整体组学数据的干扰。通过预处理得到的预处理数据供后续进行数据矫正和数据分析。
S120、对所述预处理数据进行批次矫正处理,得到矫正数据。
本实施例中,通过多批次的质谱处理获取多批次的组学数据。不同批次的处理过程中,由于非生物因素可能会产生的误差影响组学数据的准确度,即不同批次得到的组学数据可能会存在批次异常,其中非生物因素可以是质谱设备异常、检测人员操作误差等外部因素。批次矫正可以是对非生物因素产生的批次差异进行修正,尽可能消除批次影响,消除偏差和系统误差,表征数据本身的生物学状态。批次矫正可以采用ComBat方法、代理变量法、平均中心法、距离加权判别法等方法实现,对此不作限定。通过对预处理数据进行批次矫正得到矫正数据,矫正数据一般是消除批次差异后的组学数据。
在一些实施例中,对所述预处理数据进行批次矫正处理,得到矫正数据,包括:对于任一批次中任一样本的任一类型组学数据,基于组学数据的批次矩阵和所述组学数据的初始表达量确定矫正参数;基于所述组学数据的初始表达量和矫正参数确定矫正数据。
本实施例中,通过在每个批次中的分子之间表达信息来估计代表批次效应的L/S模型参数,从而将批次效应参数估计规定到批次效应估计的总体平均值(跨分子),经典贝叶斯估计值用于调整批次效应的数据,为每个分子的批次效应提供更稳健的调整。
其中,初始表达量可以是未经矫正的组学数据中的分子表达量。组学数据的批次矩阵有各组学数据所述样本的批次信息形成,矫正参数用于对组学数据中的初始表达量进行矫正处理,得到矫正数据。在一些实施例中,可以是预先设置矫正数据的计算公式,通过将确定的矫正参数和待矫正的组学数据的初始表达量输入至上述计算公式,得到矫正数据。
对于任一批次中任一样本的任一类型组学数据,确定对应的矫正参数。
在一些实施例中,所述基于组学数据的批次矩阵和所述组学数据的初始表达量确定矫正参数,包括:基于所述组学数据的批次矩阵和所述组学数据的初始表达量确定初始矫正参数;基于所述初始矫正参数和所述预设分布确定矫正参数。其中,初始矫正参数包括组 学数据的整体平均表达量、批次矩阵对应的回归系数变量、加法批次效应参数、乘法批次效应参数,误差项满足标准正态分布。
相应的,每一初始矫正参数与矫正参数一一对应,基于初始矫正参数确定对应的矫正参数。由于质谱仪器对分子灵敏度不同,分子表达量可能由于分子量存在差异,使得经典贝叶斯对批次的先验分布估计产生偏差,对分子表达量数据进行标准化,使得分子具有相似的总体均值和方差。
其中,矫正参数符合预设分布。其中,不同矫正参数的分布可以相同或不同,对此不作限定。通过最小二乘法基于初始矫正参数进行估计处理,以得到对应的矫正参数。即,对于初始矫正参数中的整体平均表达量、批次矩阵对应的回归系数变量、加法批次效应参数的任一项。通过最小二乘法的估计方式,可得到整体平均表达量、批次矩阵对应的回归系数变量、加法批次效应参数对应的矫正参数。
根据估计得到整体平均表达量、批次矩阵对应的回归系数变量、加法批次效应参数可以进一步估计出方差。示例性的,对于每一类分子,方差可以是基于各样本的初始表达量与上述矫正参数的差值确定。假设矫正参数满足预设分布,批次效应参数的先验分布满足预设分布,预设分布可以是正态分布。
在上述实施例的采集上,基于所述组学数据的初始表达量和矫正参数确定矫正数据,包括:基于所述组学数据的初始表达量和矫正参数确定标准化数值;基于所述标准化数值和所述矫正参数确定矫正数据。
通过对组学数据进行标准化,使得组学数据具有相似的总体均值和方差,标准化后的数值可以作为标准化数值,其中,标准化数值可以基于初始表达量、整体平均表达量估计值、批次矩阵、回归系数估计值、误差估计值确定。矫正数据可以基于标准化数值、矫正参数确定,其中矫正数据是批次矫正后的组学数据。因此,矫正数据可以基于初始表达量、整体平均表达量估计值、批次矩阵、回归系数估计值、加法批次效应参数估计值、乘法批次效应参数估计值、整体平均表达量估计值、批次矩阵和回归系数估计值确定。
本实施例中,通过对组学数据进行矫正处理,可消除组学数据中的批次差异,避免非生物因素对组学数据的影响,提高组学数据的准确性,以及进一步提高组学数据分析的准确性。
S130、对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
本实施例中,对矫正后的组学数据进行分析,实现对样本的分析。组学数据属于高维度数据,需要从多个维度进行分析。可选的,所述预设类型的分析处理包括但不限于:样本强度分布分析、降维分析、判别分析、分子趋势分析、样本相关性分析和测试样本重复性分析。需要说明的是,预先存储各类型分析处理的处理规则,根据分析处理的需求,调用对应的处理规则,基于调用的处理规则对矫正后的组学数据进行分析处理。具体的,可以是预先设置分析处理的类型,例如输入分析处理的类型标识,基于上述类型标识调用对应的处理规则。
在一些实施例中,可对局部样本的矫正数据进行分析处理,还可以是对全部样本的矫正数据进行分析处理,可根据分析处理的需求确定,对此不作限定。在一些实施例中,局部样本可以是每一批次中的质量控制(QUALITY CONTROL,QC)样本。
示例性的,以样本强度分布分析为例,用于通过箱线图表示每个样本强度分布情况, 按照进样顺序绘制样本强度平均值或中值,允许估计样本在测量过程中的信号漂移或离散偏差;示例性的,以降维分析为例,用于查看样本无监督情况下在线性主成分分析(Principal Component Analysis,PCA)或者非线性t分布-随机邻近嵌入(t-distributed Stochastic Neighbor Embedding,TSNE)/统一流形逼近与投影(Uniform Manifold Approximation and Projection,UMAP)角度下样本聚类情况,主成分分析(PCA)是一种识别变异主要方向的技术,称为主成分。数据在双分量轴上的投影可视化样本接近度。通过技术/生物因素或通过突出显示重复对样本进行额外着色,有助于解释是什么驱动样本接近。有利于于通过生物和技术因素评估聚类或检查重复相似性,样本之间的相似性不再受技术因素驱动时意味PCA/TSNE/UMAP没有显示按批次进行聚类。示例性的,以判别分析为例,用于查看样本有监督情况下建模对样本聚类分布情况,使提取后得到的特征变量能很好的概括原始变量的信息,对因变量有很强的解释能力,上述结果用散点图表示,每个点表示样本,颜色表示样本对应的分组。以分子趋势分析为例,可随机选取50个分子展示在所有样本中表达强度信息,用散点图可视化,两种颜色,一种表示QC样本,另一种表示目标样本,并根据两类样本拟合曲线,表征分子在QC样本中的稳定程度。
以样本相关性分析为例,用于计算样本间的pearson/spearman相关性系数,该分析可比较样本重复性,特别是组内的相关性,特别是当QC样本相关性高时,表明数据质量较好,当评估批次内和批次之间的样本相关性时,与不相关批次相比,若来自同一批次的样品的相关性更高是明显的偏差迹象,可反映有批次影响,分别使用热图与小提琴图可视化,热图颜色根据相关性系数大小使用渐变色,还可评估出某个样本与其他样本的偏差情况;由于大样本队列样本量较大,热图可视化不方便展示与查看数据,为了清晰表现组内样本间的相关性,同时使用小提琴图可视化分析相关性结果,颜色表示分组,值越大组内相关性越大。
本实施例的技术方案,通过提供一种组学数据的批次矫正方法,获取多批次样本的组学数据,对组学数据进行预处理,得到预处理数据,通过对预处理数据进行批次矫正处理,得到矫正数据,通过对矫正数据进行预设类型的分析处理,得到批次矫正分析结果。通过对多批次获取的组学数据进行批次矫正,解决了组学数据批次检测引起误差的问题,可以更准确、更高效地评估数据的质量。
实施例二
图2是本发明实施例二提供的一种组学数据的批次矫正方法的流程图,本实施例在上述实施例的基础上进行了细化。如图2所示,该方法包括:
S210、获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据。
S220、对所述预处理数据进行预设类型的分析处理,得到未经批次矫正的分析结果。
可以参考上述实施例对矫正数据进行预设类型的分析处理,对预处理数据进行预设类型的分析处理。组学数据属于高维度数据,需要从多个维度进行分析,预设类型的分析处理可以包括但不限于:样本强度分布分析、降维分析、判别分析、样本相关性分析和测试样本重复性分析。通过对预处理数据进行预设类型的分析处理,可以评估预处理数据的数据质量,分析结果为未经批次矫正的分析结果。
S230、对所述预处理数据进行批次矫正处理,得到矫正数据。
S240、对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
S250、将所述未经批次矫正的分析结果和所述批次矫正分析结果通过显示设备进行显示。
本实施例中,可以将未经批次矫正的分析结果传输到显示设备进行显示,显示设备可以是计算机屏幕、电子显示屏等,同时将批次矫正分析结果进行显示,显示设备的显示方式根据预设类型不同而不同,可以是图像显示、图表显示、数字显示等。可以直观形象的展示出未经批次矫正的分析结果和批次矫正后的分析结果。
其中,所述预处理包括数据清洗、数据归一化和缺失值处理的一项或多项;相应的,所述对所述预处理数据进行预设类型的分析处理,得到未经批次矫正的分析结果,包括:将每一项预处理后的预处理数据进行预设类型的分析处理,得到至少一个未经批次矫正的分析结果。
本实施例中,对数据进行预处理可以是数据清洗、数据归一化、缺失值处理中的任一步骤,或者,可以是数据清洗、数据归一化、缺失值处理中的任意两步骤组合,或者,可以是数据清洗、数据归一化、缺失值处理的所有步骤。预设类型可以参照上述实施例的预设类型,可以是样本强度分布分析、降维分析、判别分析、样本相关性分析和测试样本重复性分析。对每一项预处理后的预处理数据进行预设类型的分析处理,当预处理是数据清洗、数据归一化、缺失值处理中的一个步骤时,可以得到一个未经批次矫正的分析结果,当预处理是数据清洗、数据归一化、缺失值处理中的两个或两个以上步骤时,可以对每一个步骤进行预设类型的分析处理,得到两个或两个以上未经批次矫正的分析结果。
在上述实施例中,可以通过显示设备显示未经批次矫正的分析结果和批次矫正后的分析结果,以便于操作用户通过显示设备查看未经批次矫正的分析结果和批次矫正后的分析结果,并通过未经批次矫正的分析结果和批次矫正后的分析结果的比对,确定质谱处理过程得到的组学数据中是否存在批次异常,例如根据显示界面的图像、图表、数字等可以直观地比较未经批次矫正的分析结果和批次矫正后的分析结果。可选的,将所述未经批次矫正的分析结果和所述批次矫正分析结果进行比对,判定组学数据是否存在批次质量异常,例如可以是将相同预设类型的分析结果进行比对,基于比对结果确定组学数据是否存在批次质量异常。例如可以是在任一类型的分析结果差值超出预设误差范围的情况下,确定组学数据存在批次质量异常。示例性的,当未经批次矫正的分析结果和批次矫正后的分析结果之间的差异等满足预设阈值时,可以判断批次质量正常,当未经批次矫正的分析结果和批次矫正后的分析结果之间的差异不满足预设阈值时,可以确定存在批次质量异常。
在一些实施例中,所述每一批次样本中分别包括测试样本(例如可以是QC样本);所述多批次样本的组学数据包括各批次中测试样本的组学数据。所述方法还包括:根据所述测试样本的批次顺序依次确定进行分析的组学数据组,对各组学数据组进行预设类型的分析处理,得到各组学数据组的分析结果;基于所述各组学数据组的分析结果确定是否存在组学数据存在批次质量异常。
其中,测试样本可以是合格样本,可以是各批次样本的混合样本,不同批次中的测试样本相同,即测试样本经过质谱处理得到的组学数据理论相同。将测试样本分别插入每一批次中,确保多批次样本的组学数据包括各批次中测试样本的组学数据,此处仅对各批次中测试样本的组学数据进行分析处理,减少分析处理的数据量。按照测试样本的批次顺序,对各批次中测试样本的组学数据进行预设类型的分析,测试样本的组学数据理论相同,通 过预设类型分析,各批次的理论分析结果相同。当每一批次的测试样本组学数据分析结果与其他批次分析结果的误差满足预设阈值时,可确定批次质量正常。当某一批次的测试样本组学数据分析结果与其他批次分析结果的误差不满足预设阈值时,可确定批次质量异常。
可选的,基于所述各组学数据组的分析结果确定是否存在组学数据是否存在批次质量异常,包括:确定所述各组学数据组的分析结果中是否存在异常分析结果。若是,则基于存在异常分析结果的组学数据组确定异常测试样本,将所述异常测试样本所在批次确定为异常批次。
本实施例中,对测试样本的分析方法可以包括累计测试样本相对标准偏差(relative standard deviation,RSD)分析、累计测试样本相对标准偏差百分比分析、每个测试样本的强度分析、分子稳定性分析的一项或多项等。示例性的,以累计测试样本相对标准偏差分析为例,测试样本的顺序可以参考批次顺序,随测试样本增加确定多个组学数据组,依次计算每个组学数据组的相对标准偏差值。例如,第一个组学数据组可以是前两个批次中测试样本的组学数据,第二个组学数据组可以是前三个批次中测试样本的组学数据,并依次类推。对各组学数据进行相对标准偏差分析,根据组学数据组的相对标准偏差值可以确定是否存在批次异常,以及存在异常的批次。例如,可以用箱线图可视化结果,第一个箱线图可以表示前两个批次的测试样本中组学数据的相对标准偏差的统计结果,第二个箱线图可以表示前三个批次的测试样本中组学数据的相对标准偏差的统计结果,依次计算直到所有测试样本的相对标准偏差计算完成,随着测试样本的增加,根据箱线图可以得出测试样本的漂移或离散情况。当箱线图的相对标准偏差值满足预设阈值时,可确定批次质量正常,当第N个箱线图的RSD值不满足预设阈值时,可确定第N+1个测试样本异常,根据第N+1个测试样本的异常情况确定第N+1批次为异常批次。
本实施例的技术方案,在上述实施例的基础上增加了对预处理数据进行预设类型的分析处理,得到未经批次矫正的分析结果,将未经批次矫正的分析结果和批次矫正后的分析结果通过显示设备进行显示,通过比对分析结果,可以确定组学数据的批次质量。通过在多批次样本的组学数据中添加测试样本的组学数据确定批次异常情况,对测试样本的组学数据进行预设类型的分析,根据测试样本组学数据的分析结果可以确定异常测试样本,将异常测试样本所在的批次确定为异常批次。
实施例三
图3是本发明实施例三提供的组学数据的批次矫正方法的流程图。在上述实施例的基础上,本发明实施例三还提供了组学数据的批次矫正方法的优选示例,该方法包括:原始数据输入与原始数据清洗、组学数据归一化与缺失值处理、批次矫正、以及数据分析评估。
对于原始数据输入与原始数据清洗。输入需要两个文件,样本信息文件,来自用户自己构造的一个文本文件(格式可以是csv、txt或excel)。样本信息文件包括几列信息:样本名称(ID),样本对应的分组名称(Type),质谱进样顺序(order),批次信息(batch),其中批次信息(batch)中属于一个批次的样本用相同的数字或者字母表示,样本信息文件示意图如下表1所示。
ID Type order batch
XX1 C1 1 1
QC1 QC 2 1
XX2 C1 3 1
XX3 C1 4 1
XX4 C1 5 1
QC2 C1 6 1
YY1 C2 7 2
YY2 C2 8 2
YY3 C2 9 2
YY4 C2 10 2
QC2 QC 11 2
表1
输入质谱下机数据,根据不同组学类型,软件进行相应的数据格式整理,根据样本信息文件中的样本名称(ID),提取并保留必要信息,如各个样本检测的蛋白或代谢物对应的表达强度信息,保存为一个带有原始表达强度的文本文件,没有对其在其他样本中的值进行任何校准、标准化或校正,供后续数据分析评估模块分析。检查样本强度分布是否一致,样本间的相关性,QC(即测试样本)样本重复性,如果强度或样本相关性等指标不同,可检查强度是否显示特定批次的偏差,是否随着样本检测顺序呈现强度逐渐降低。通过比较样本全局定量特性有助于选择归一化方法和识别需要进一步控制的技术因素,以及非常清晰表现大样本队列中有无批次影响。
对于组学数据归一化与缺失值处理。缺失值一般是由于仪器采集时造成的,很多分析方法不允许数据含有缺失值,会对数据方法选择有较大影响,过多的缺失值存在也不能准确表征数据信息。然而简单的舍弃缺失值,或者不合适的方法直接填充缺失值,会造成大量有用的信息丢失或者产生非生物学差异得出错误的结论。缺失值处理流程,会尽量消除缺失值对结果带来的影响。
数据经过第一个模块处理后(原始数据输入与原始数据清洗模块),根据其表达定量文本文件,首先进行归一化,提供多种归一化方法供用户选择,如中位数归一化(
Figure PCTCN2022143821-appb-000001
Figure PCTCN2022143821-appb-000002
x i为样本中分子的定量,x为样本所有分子的定量值序列,norm(x i)为中位数归一化后的值),分位数归一化(
Figure PCTCN2022143821-appb-000003
样本中的归一化值等于该分子表达定量值减去样本中分子表达定量值的中位数除以样本中所有分子上四分位数与下四分位数的差值),默认方差稳定归一化(variance stabilization normalization,vsn);然后经过缺失值处理,根据分子在所有样本中的缺失率,剔除超过缺失率的分子(缺失率由用户自行定义,缺失率参数范围0~1,1为保留所有缺失值,0为剔除所有含有缺失值的分子),随后对剩下的缺失值进行填充(可用K最近邻法“knn”进行填充或者多重插补法),得到经过归一化处理且无缺失值的定量信息结果,供后续批次矫正模块数据分析评估模块处理。
原始数据在经过归一化与缺失值处理后,能有效降低仪器不稳定引入的数据集噪音,改善生物学解释性。归一化使得检测定量值分布保持一致,样本之间可以比较,消除奇异样本数据导致的不良影响,改善样本离散程度,消除数据偏离度对分子表达的影响。
对于批次矫正。根据输入的样本信息文件batch列定义的批次信息,读取归一化且缺失值处理后得到的定量信息,Combat方法进行批次矫正,该模型的假设是基于位置和尺度(Location and scale,L/S)的调整,为批次内数据的位置(平均值)和/或规模(方差) 假设了一个模型,然后调整批次以满足假设模型的规范,因此,L/S批次调整假定批次效应可以通过标准化各批次的均值和方差而被模拟出来,基于以下定义L/S模型基础:
Y ijg=α g+Xβ gigigijg
Y ijg表示来自批次i中样本j的分子g的表达值,α g是分子g的整体平均表达量,X是根据样本对应的批次信息构成的一个矩阵,该信息来源与样本信息文件的batch列信息,β g是X所对应的回归系数变量,γ ig表示批次i对分子g的加法批次效应,误差项∈ ijg是服从期望值为0方差为δ 2 g的正态分布,δ ig表示批次i中分子g的乘法批次效应。其中,在各批次中各样本的分子g的表达值Y ijg、分子g的整体平均表达量α g和批次矩阵X已知的情况下,可基于回归方式确定加法批次效应γ ig、乘法批次效应δ ig,误差项∈ ijg以及回归系数变量β g
批次矫正后的分子表达量:
Figure PCTCN2022143821-appb-000004
Figure PCTCN2022143821-appb-000005
是α g,β g,γ ig,δ ig的参数估计。
Combat算法基于以上模型基础,使用经典贝叶斯方法进行拓展:通过在每个批次中的分子之间表达信息来估计代表批次效应的L/S模型参数,从而将批次效应参数估计规定到批次效应估计的总体平均值(跨分子)。经典贝叶斯估计值随后用于调整批次效应的数据,为每个分子的批次效应提供更稳健的调整。
以下三个步骤描述了该方法:
首先对分子表达量数据进行标准化,使得分子具有相似的总体均值和方差。示例性的,共有m个批次,i代表批次数,i的取值为i=1,...,m,使用g代表每个分子,g取值为g=1,...G,j为批次i中的样本j,使用最小二乘法对理论值α g、β g进行估计得到近似值
Figure PCTCN2022143821-appb-000006
具体方法如下:
假设误差项γ _ig满足正态分布,则在分子层面,所有分子的误差项之和满足:
Figure PCTCN2022143821-appb-000007
由此可以得到方差的估计值
Figure PCTCN2022143821-appb-000008
(N为样本数量),最后分子表达量标准化后的数值Z ijg为:
Figure PCTCN2022143821-appb-000009
标准化后的数据应满足Z ijg~N(γ ig2 ig)(此γ ig与(1)中的误差项并非同一含义),若Z ijg的正态分布参数γ ig,δ 2 ig满足
Figure PCTCN2022143821-appb-000010
和δ 2 ig~Inverse Gamma(λ ii),则使用参数经验贝叶斯,若Z ijg的正态分布参数不满足以上条件,则需要更为灵活的先验分布,这时可以使用非参数经验贝叶斯。从而计算出批次效应估计值γ * ig和δ 2* ig,最后得到调整后的分子表达数据Y * ijg
当使用参数经验贝叶斯时,使用以下推导方法:
第一步进行经验超先验估计,推导出θ i和λ i的估计值
Figure PCTCN2022143821-appb-000011
Figure PCTCN2022143821-appb-000012
和δ 2 ig~Inverse Gamma(λ ii)中的超参数γ i,
Figure PCTCN2022143821-appb-000013
λ ii使用矩量法(Method of Moments)得到。
批次i中分子g的样本均值为
Figure PCTCN2022143821-appb-000014
因此γ i,
Figure PCTCN2022143821-appb-000015
的估计值可以表示为:
Figure PCTCN2022143821-appb-000016
Figure PCTCN2022143821-appb-000017
同时根据
Figure PCTCN2022143821-appb-000018
可得到批次i中分子g的样本方差
Figure PCTCN2022143821-appb-000019
并可以依此计算出
Figure PCTCN2022143821-appb-000020
的平均值
Figure PCTCN2022143821-appb-000021
Figure PCTCN2022143821-appb-000022
的方差
Figure PCTCN2022143821-appb-000023
Figure PCTCN2022143821-appb-000024
与倒伽马(inverse gamma)分布的总体矩(theoretical moments)相等,即平均值
Figure PCTCN2022143821-appb-000025
方差
Figure PCTCN2022143821-appb-000026
可推导出θ i和λ i的估计值:
Figure PCTCN2022143821-appb-000027
Figure PCTCN2022143821-appb-000028
第二步进行参数批次效应校正,应用贝叶斯理论找到γ ig有条件的(后验)分布
Figure PCTCN2022143821-appb-000029
后验分布应满足:
Figure PCTCN2022143821-appb-000030
以上公式可以确定正态分布的核(kernel of a normal distribution)的期望值表示为:
Figure PCTCN2022143821-appb-000031
基于之前已经得到的
Figure PCTCN2022143821-appb-000032
and
Figure PCTCN2022143821-appb-000033
可估计出
Figure PCTCN2022143821-appb-000034
Figure PCTCN2022143821-appb-000035
对于
Figure PCTCN2022143821-appb-000036
有条件的后验分布,已有γ ig和倒伽马(Inverse Gamma(λ ii))先验,从而应满足:
Figure PCTCN2022143821-appb-000037
Figure PCTCN2022143821-appb-000038
以上公式可以被看作一个期望值如下的倒伽马分布:
Figure PCTCN2022143821-appb-000039
使用第一步中由矩量法(Method of Moments)得到的
Figure PCTCN2022143821-appb-000040
Figure PCTCN2022143821-appb-000041
以上期望值可以被写为:
Figure PCTCN2022143821-appb-000042
由于
Figure PCTCN2022143821-appb-000043
Figure PCTCN2022143821-appb-000044
不存在封闭解(closed form solution),所以我们使用迭代的方法,首先代入一个相对合理的
Figure PCTCN2022143821-appb-000045
的值(例如
Figure PCTCN2022143821-appb-000046
)去计算
Figure PCTCN2022143821-appb-000047
之后再用得到的
Figure PCTCN2022143821-appb-000048
去计算
Figure PCTCN2022143821-appb-000049
持续循环,直至
Figure PCTCN2022143821-appb-000050
Figure PCTCN2022143821-appb-000051
的值收敛。
使用类似于L/S模型基础的方法,得到调整后的最终分子表达数据γ * ijg如下:
Figure PCTCN2022143821-appb-000052
γ * ijg为最终矫正数据。
当使用非参数经验贝叶斯时,使用以下推导方法:
在此情况下,Z ijg仍然符合正态分布Z ijg~N(γ ig2 ig),与前文推导类似,使
Figure PCTCN2022143821-appb-000053
Figure PCTCN2022143821-appb-000054
我们通过找到批次效应参数的后验期望值的估计值E[γ ig]和
Figure PCTCN2022143821-appb-000055
来估算批次效应参数γ ig,δ 2 ig
Z ig是一个包含了Z ijg中j=1,…,n i的向量,后验分布
Figure PCTCN2022143821-appb-000056
中γ ig的后验期望值可以表示为:
Figure PCTCN2022143821-appb-000057
Figure PCTCN2022143821-appb-000058
作为先验参数γ ig,
Figure PCTCN2022143821-appb-000059
的未知的密度函数,使
Figure PCTCN2022143821-appb-000060
已知
Figure PCTCN2022143821-appb-000061
Figure PCTCN2022143821-appb-000062
在Z ijg的概率密度函数,使用贝叶斯理论,E[γ ig]可以表示为:
Figure PCTCN2022143821-appb-000063
其中
Figure PCTCN2022143821-appb-000064
使用蒙特卡罗积分方法(Monte Carlo integration),选用经验估计的一对
Figure PCTCN2022143821-appb-000065
来估计C(Z ig)的值和等式(3)中的积分,这对值可以从
Figure PCTCN2022143821-appb-000066
中随机抽取。同时,让
Figure PCTCN2022143821-appb-000067
Figure PCTCN2022143821-appb-000068
g″的取值为g″=1,…,G,这时C(Z ig)的取值可以用
Figure PCTCN2022143821-appb-000069
Figure PCTCN2022143821-appb-000070
来估计,等式(3)中的积分可以表示为:
Figure PCTCN2022143821-appb-000071
同样的方法可以被用来计算
Figure PCTCN2022143821-appb-000072
的后验期望值,此时用来调整非参数经典贝叶斯的
Figure PCTCN2022143821-appb-000073
Figure PCTCN2022143821-appb-000074
可以被表示为
Figure PCTCN2022143821-appb-000075
与参数经验贝叶斯一样,最终的矫正数据为:
Figure PCTCN2022143821-appb-000076
对于数据分析评估。数据分析评估分为只针对QC样本的分析、针对总体样本评估两种方式,按照只针对QC样本的分析、针对总体样本(包括QC样本)的分析的类别分为不同的分析方式。
只针对QC样本的分析。在蛋白质组学、代谢组学和脂质组学数据采集过程中,一般10至20个样本之间插入一个QC样本,分析QC样本数据质量,可反映整个数据采集稳定性质量控制步骤的目的是评估原始数据的偏差,评估归一化和/或批次效应校正是否改善了数据。如果样本之间的相似性不再受技术因素驱动,组内重复性高,则认为偏差被消除。
1.累计QC样本RSD分析:根据样本信息文件中,质谱进样顺序(order)规定好QC样本顺序,根据样本名(ID),提取定量数据中QC样本分子表达数据,依次计算随着QC样本增加,每个分子的RSD值的变化(
Figure PCTCN2022143821-appb-000077
n为样本数,x i为分子在第i个样本的表达强度,
Figure PCTCN2022143821-appb-000078
为分子在所有n个样本中的平均表达强度值)用箱线图可视化结果:横坐标为累计QC样本的数量,纵坐标为RSD值。图中第一个箱线为根据进样顺序,前两个QC样本中分子表达量的RSD的统计结果,第二个箱线图为前3个QC样本中分子表达量的RSD的箱线图统计结果,依次计算下去,直到所有QC样本的RSD计算完成。反映随着QC样本的增加,QC样本的稳定性,表现了QC样本质谱信号漂移或离散情况,由于QC样本在插入在整个数据采集过程中,一般认为RSD值小于0.3样本检测的稳定性较好,因此进一步可反映在整个数据检测过程中仪器的稳定性。图4是本发明实施例三提供的累计QC样本RSD箱线图。
2.累计QC样本RSD百分比分析:在上述累计QC样本RSD分析基础上,计算RSD值小于0.3的分子个数的占所有分子数量的百分比,用柱状图可视化结果,横坐标为累计QC样本数,纵坐标为RSD小于0.3的百分比值,柱状图上数字表示具体百分比数值。百分比值越大柱状图越高,重复性越好,若在所有QC样本上百分比高且稳定,则能判断在整个检测过程中不存在批次效应且数据质量好。若确定是由于批次影响可在样本信息文件中加入batch列,添加批次信息,可自动化运行批次矫正模块。图5是本发明实施例三提供的累计QC样本RSD百分比柱状图。
3.每个QC样本的强度分布分析:表现分子在QC样本中的表达强度信息,由于表达值数量级较大,横坐标为log2后的定量值,评估QC样本方差和异常值,反映整体稳定性。
4.分子稳定性分析:在所有QC样本中随机选取30个分子,横坐标表示QC样本名称,纵坐标表示log2转化后的表达量值,不同颜色表示不同的分子,原点表示分子,同一个分子在不同QC样本中用线连接,可视化分子强度信息,表征QC样本中鉴定到的分子稳定性, 单个分子特征可评估与批次相关的表达量上偏差。
针对总体样本(包括QC样本)。
1.通过PCA/TSNE/UMAP分析,查看样本无监督情况下在线性(PCA)或者非线性(TSNE/UMAP)角度下,样本聚类情况,主成分分析(PCA)是一种识别变异主要方向的技术,称为主成分。数据在双分量轴上的投影可视化样本接近度。通过技术/生物因素或通过突出显示重复对样本进行额外着色,有助于解释是什么驱动样本接近。这种技术对于通过生物和技术因素评估聚类或检查重复相似性特别方便,样本之间的相似性不再受技术因素驱动时意味PCA/TSNE/UMAP没有显示按批次进行聚类。
2.通过PLS-DA或者OPLS-DA分析,查看样本有监督情况下建模对样本聚类分布情况,使提取后得到的特征变量不仅能很好的概括原始变量的信息,而且对因变量有很强的解释能力,上述结果可以用散点图表示,每个点表示样本,颜色表示样本对应的分组。
3.表达强度分布分析,可以通过箱线图表示每个样本强度分布情况,按照进样顺序绘制样本强度平均值或中值,允许估计样本在测量过程中的信号漂移或离散偏差。
4.分子趋势分析,可以随机选取50个分子展示在所有样本中表达强度信息,用散点图可视化,两种颜色,一种表示QC样本,另一种表示目标样本,并根据两类样本拟合曲线,表征分子在QC样本中的稳定程度。
5.相关性分析,计算样本间的pearson/sperman相关性系数,该分析可比较样本重复性,特别是组内的相关性,特别是当QC样本相关性高时,表明数据质量较好,当评估批次内和批次之间的样本相关性时,与不相关批次相比,若来自同一批次的样品的相关性更高是明显的偏差迹象,可反映有批次影响,可以分别使用热图与小提琴图可视化,热图颜色根据相关性系数大小使用渐变色,还可评估出某个样本与其他样本的偏差情况;由于大样本队列样本量较大,热图可视化不方便展示与查看数据,为了清晰表现组内样本间的相关性,同时使用小提琴图可视化分析相关性结果,颜色表示分组,值越大组内相关性越大。
实施例四
图6是本发明实施例四提供的一种组学数据的批次矫正装置的结构示意图。如图6所示,该装置包括:
数据预处理模块610,用于获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据;
批次矫正模块620,用于对所述预处理数据进行批次矫正处理,得到矫正数据;
数据分析模块630,用于对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
本实施例的技术方案,通过提供一种组学数据的批次矫正装置,获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据;对所述预处理数据进行批次矫正处理,得到矫正数据;对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。实现了组学数据批次检测的批次矫正,可以更准确、更高效地评估数据的质量。
可选的,批次矫正模块620具体用于:
对于任一批次中任一样本的任一类型组学数据,基于组学数据的批次矩阵和所述组学数据的初始表达量确定矫正参数;
基于所述组学数据的初始表达量和矫正参数确定矫正数据。
可选的,批次矫正模块620具体用于:
基于所述组学数据的批次矩阵和所述组学数据的初始表达量确定初始矫正参数;
基于所述初始矫正参数和所述预设分布确定矫正参数。
可选的,批次矫正模块620具体用于:
基于所述组学数据的初始表达量和矫正参数确定标准化数值;
基于所述标准化数值和所述矫正参数确定矫正数据。
可选的,数据分析模块630具体用于:
对所述预处理数据进行预设类型的分析处理,得到未经批次矫正的分析结果,将所述未经批次矫正的分析结果和所述批次矫正分析结果通过显示设备进行显示。
可选的,数据分析模块630具体用于:
将所述未经批次矫正的分析结果和所述批次矫正分析结果进行比对,判定组学数据是否存在批次质量异常。
可选的,数据分析模块630具体用于:
将每一项预处理后的预处理数据进行预设类型的分析处理,得到至少一个未经批次矫正的分析结果。
可选的,数据分析模块630具体用于:
根据所述测试样本的批次顺序依次确定进行分析的组学数据组,对各组学数据组进行预设类型的分析处理,得到各组学数据组的分析结果;
基于所述各组学数据组的分析结果确定是否存在组学数据是否存在批次质量异常。
可选的,数据分析模块630具体用于:
确定所述各组学数据组的分析结果中是否存在异常分析结果;
若是,则基于存在异常分析结果的组学数据组确定异常测试样本,将所述异常测试样本所在批次确定为异常批次。
可选的,数据分析模块630具体用于:
所述预设类型的分析处理,包括:样本强度分布分析、降维分析、判别分析、样本相关性分析和混合样本重复性分析。
本发明实施例所提供的组学数据的批次矫正装置可执行本发明任意实施例所提供的一种组学数据的批次矫正方法,具备执行方法相应的功能模块和有益效果。
实施例五
图7是本发明实施例五提供的一种电子设备的结构示意图。电子设备10旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本发明的实现。
如图7所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(ROM)12、随机访问存储器(RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序, 来执行各种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(I/O)接口15也连接至总线14。
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如各种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理,例如一种组学数据的批次矫正方法。
在一些实施例中,一种组学数据的批次矫正方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的一种组学数据的批次矫正方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行一种组学数据的批次矫正方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本发明的一种组学数据的批次矫正方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
实施例六
本发明实施例六还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,计算机指令用于使处理器执行一种组学数据的批次矫正方法,该方法包括:
获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据;
对所述预处理数据进行批次矫正处理,得到矫正数据;
对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
在本发明的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以 供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。备选地,计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)、区块链网络和互联网。
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务中,存在的管理难度大,业务扩展性弱的缺陷。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发明中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本发明的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本发明保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等,均应包含在本发明保护范围之内。

Claims (13)

  1. 一种组学数据的批次矫正方法,其特征在于,包括:
    获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据;
    对所述预处理数据进行批次矫正处理,得到矫正数据;
    对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
  2. 根据权利要求1所述的方法,其特征在于,对所述预处理数据进行批次矫正处理,得到矫正数据,包括:
    对于任一批次中任一样本的任一类型组学数据,基于组学数据的批次矩阵和所述组学数据的初始表达量确定矫正参数;
    基于所述组学数据的初始表达量和矫正参数确定矫正数据。
  3. 根据权利要求2所述的方法,其特征在于,所述矫正参数符合预设分布;
    所述基于组学数据的批次矩阵和所述组学数据的初始表达量确定矫正参数,包括:
    基于所述组学数据的批次矩阵和所述组学数据的初始表达量确定初始矫正参数;
    基于所述初始矫正参数和所述预设分布确定矫正参数。
  4. 根据权利要求2所述的方法,其特征在于,所述基于所述组学数据的初始表达量和矫正参数确定矫正数据,包括:
    基于所述组学数据的初始表达量和矫正参数确定标准化数值;
    基于所述标准化数值和所述矫正参数确定矫正数据。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    对所述预处理数据进行预设类型的分析处理,得到未经批次矫正的分析结果,将所述未经批次矫正的分析结果和所述批次矫正分析结果通过显示设备进行显示。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    将所述未经批次矫正的分析结果和所述批次矫正分析结果进行比对,判定组学数据是否存在批次质量异常。
  7. 根据权利要求5所述的方法,其特征在于,所述预处理包括数据清洗、数据归一化和缺失值处理的一项或多项;
    所述对所述预处理数据进行预设类型的分析处理,得到未经批次矫正的分析结果,包括:
    将每一项预处理后的预处理数据进行预设类型的分析处理,得到至少一个未经批次矫正的分析结果。
  8. 根据权利要求1所述的方法,其特征在于,所述每一批次样本中分别包括测试样本;所述多批次样本的组学数据包括各批次中测试样本的组学数据;
    所述方法还包括:
    根据所述测试样本的批次顺序依次确定进行分析的组学数据组,对各组学数据组进行预设类型的分析处理,得到各组学数据组的分析结果;
    基于所述各组学数据组的分析结果确定是否存在组学数据存在批次质量异常。
  9. 根据权利要求8所述的方法,其特征在于,基于所述各组学数据组的分析结果确定是否存在组学数据是否存在批次质量异常,包括:
    确定所述各组学数据组的分析结果中是否存在异常分析结果;
    若是,则基于存在异常分析结果的组学数据组确定异常测试样本,将所述异常测试样本所在批次确定为异常批次。
  10. 根据权利要求1所述的方法,其特征在于,所述预设类型的分析处理,包括:样本强度分布分析、降维分析、判别分析、样本相关性分析和测试样本重复性分析。
  11. 一种组学数据的批次矫正装置,其特征在于,包括:
    数据预处理模块,用于获取多批次样本的组学数据,对所述组学数据进行预处理,得到预处理数据;
    批次矫正模块,用于对所述预处理数据进行批次矫正处理,得到矫正数据;
    数据分析模块,用于对所述矫正数据进行预设类型的分析处理,得到批次矫正分析结果。
  12. 一种电子设备,其特征在于,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-10中任一项所述的一种组学数据的批次矫正方法。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-10中任一项所述的一种组学数据的批次矫正方法。
PCT/CN2022/143821 2022-09-08 2022-12-30 组学数据的批次矫正方法、装置、存储介质及电子设备 WO2024051052A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211097799.1 2022-09-08
CN202211097799.1A CN115359846A (zh) 2022-09-08 2022-09-08 一种组学数据的批次矫正方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2024051052A1 true WO2024051052A1 (zh) 2024-03-14

Family

ID=84006395

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143821 WO2024051052A1 (zh) 2022-09-08 2022-12-30 组学数据的批次矫正方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN115359846A (zh)
WO (1) WO2024051052A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117910886A (zh) * 2024-03-19 2024-04-19 宝鸡核力材料科技有限公司 应用于钛合金熔炼下的熔炼效果智能分析方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359846A (zh) * 2022-09-08 2022-11-18 上海氨探生物科技有限公司 一种组学数据的批次矫正方法、装置、存储介质及电子设备
WO2024108592A1 (zh) * 2022-11-25 2024-05-30 深圳先进技术研究院 一种组学数据处理方法、装置及计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130994A1 (en) * 2016-04-11 2019-05-02 Discerndx, Inc. Mass Spectrometric Data Analysis Workflow
CN111796095A (zh) * 2019-04-09 2020-10-20 苏州扇贝生物科技有限公司 一种蛋白质组质谱数据处理方法及装置
CN113588847A (zh) * 2021-09-26 2021-11-02 萱闱(北京)生物科技有限公司 一种生物代谢组学数据处理方法、分析方法及装置和应用
CN114705766A (zh) * 2022-01-29 2022-07-05 中央民族大学 基于is联合svr的大规模组学数据校正方法及系统
CN115359846A (zh) * 2022-09-08 2022-11-18 上海氨探生物科技有限公司 一种组学数据的批次矫正方法、装置、存储介质及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130994A1 (en) * 2016-04-11 2019-05-02 Discerndx, Inc. Mass Spectrometric Data Analysis Workflow
CN111796095A (zh) * 2019-04-09 2020-10-20 苏州扇贝生物科技有限公司 一种蛋白质组质谱数据处理方法及装置
CN113588847A (zh) * 2021-09-26 2021-11-02 萱闱(北京)生物科技有限公司 一种生物代谢组学数据处理方法、分析方法及装置和应用
CN114705766A (zh) * 2022-01-29 2022-07-05 中央民族大学 基于is联合svr的大规模组学数据校正方法及系统
CN115359846A (zh) * 2022-09-08 2022-11-18 上海氨探生物科技有限公司 一种组学数据的批次矫正方法、装置、存储介质及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117910886A (zh) * 2024-03-19 2024-04-19 宝鸡核力材料科技有限公司 应用于钛合金熔炼下的熔炼效果智能分析方法及系统
CN117910886B (zh) * 2024-03-19 2024-05-28 宝鸡核力材料科技有限公司 应用于钛合金熔炼下的熔炼效果智能分析方法及系统

Also Published As

Publication number Publication date
CN115359846A (zh) 2022-11-18

Similar Documents

Publication Publication Date Title
WO2024051052A1 (zh) 组学数据的批次矫正方法、装置、存储介质及电子设备
EP3955204A1 (en) Data processing method and apparatus, electronic device and storage medium
WO2020232874A1 (zh) 基于迁移学习的建模方法、装置、计算机设备和存储介质
US11030246B2 (en) Fast and accurate graphlet estimation
US11373760B2 (en) False detection rate control with null-hypothesis
JP7294369B2 (ja) 情報処理に用いられる方法、装置、電子機器及びプログラム
CN110796159A (zh) 基于k-means算法的电力数据分类方法及系统
de Andrade Silva et al. An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks
CN113988458A (zh) 反洗钱风险监控方法和模型训练方法、装置、设备及介质
Weine et al. Application of equal local levels to improve QQ plot testing bands with R package qqconf
CN114463587A (zh) 一种异常数据检测方法、装置、设备及存储介质
Marjoram Approximation bayesian computation
Yu et al. Asymptotic properties and information criteria for misspecified generalized linear mixed models
Lötsch et al. Comments on the importance of visualizing the distribution of pain-related data
Hoffmann et al. Nonparametric inference of gradual changes in the jump behaviour of time-continuous processes
Kousathanas et al. A guide to general-purpose ABC software
US20220405299A1 (en) Visualizing feature variation effects on computer model prediction
Calhoun Out-of-sample comparisons of overfit models
Kojadinovic Hierarchical clustering of continuous variables based on the empirical copula process and permutation linkages
McKeague et al. Significance testing for canonical correlation analysis in high dimensions
CN114529136A (zh) 基于主成分分析和Topsis的电子部组件评价方法和装置
CN114385460A (zh) 数据稳定性的检测方法及装置、存储介质
CN108399249B (zh) 数据归一化方法、用户画像提供方法、设备及存储介质
Olea et al. The out-of-sample prediction error of the square-root-LASSO and related estimators
US11645555B2 (en) Feature selection using Sobolev Independence Criterion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22958008

Country of ref document: EP

Kind code of ref document: A1