WO2015188395A1 - Big data oriented metabolome feature data analysis method and system thereof - Google Patents

Big data oriented metabolome feature data analysis method and system thereof Download PDF

Info

Publication number
WO2015188395A1
WO2015188395A1 PCT/CN2014/080283 CN2014080283W WO2015188395A1 WO 2015188395 A1 WO2015188395 A1 WO 2015188395A1 CN 2014080283 W CN2014080283 W CN 2014080283W WO 2015188395 A1 WO2015188395 A1 WO 2015188395A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
metabolome
vector
feature
sub
Prior art date
Application number
PCT/CN2014/080283
Other languages
French (fr)
Chinese (zh)
Inventor
周家锐
华韵之
纪震
朱泽轩
曾启明
Original Assignee
周家锐
华韵之
纪震
朱泽轩
曾启明
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 周家锐, 华韵之, 纪震, 朱泽轩, 曾启明 filed Critical 周家锐
Publication of WO2015188395A1 publication Critical patent/WO2015188395A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to the field of bioinformatics, and in particular to a method and system for analyzing metabolome characteristic data for big data. Background technique
  • Metabolites are a general term for small molecular organic compounds that complete metabolic processes in living organisms and contain a wealth of physiological state information. Metabolomics is a systematic and systematic study of metabolites that effectively reveals the biochemical mechanisms behind metabolic phenomena. Metabolomics is thought to provide a more comprehensive picture of the true state of a living being compared to traditional research methods. Therefore, it has gained more and more attention and is widely used in many scientific research and practical fields.
  • the signal data obtained by the collection and detection of metabolites is the basic object of metabolomics research. It is usually analyzed using machine learning methods to mine physiological state information.
  • the prior art generally uses a machine learning algorithm based on feature selection to analyze metabolomic feature data, which mainly comprises two parts: (1). Using feature selection to perform dimensionality reduction on the input data to clarify the important The characteristic signal and its corresponding metabolites, and eliminate the unrelated noise, thereby improving the performance of the prediction algorithm.
  • feature selection methods include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Minimum Redundancy Maximum Association (Minimum). Redundancy Maximum Relevance, mRMR), etc. (2).
  • classification/regression algorithm uses the classification/regression algorithm to predict the dimensionality-reduced data, and estimate the physiological results that may be generated by the input features to guide the follow-up medical and scientific research.
  • classification/regression algorithms include k-Nearest Neighbor (kN), Linear Regression, Logistic Regression, and Support Vector Machine (SVM).
  • kN k-Nearest Neighbor
  • SVM Support Vector Machine
  • the feature dimension is high, contains a lot of noise, and the nonlinear relationship between the characteristic signal and the target state.
  • the above conventional methods are often difficult to obtain satisfactory learning results within a reasonable computing time.
  • Feature Weighting is a generalized form of feature selection when a weight value can take any real value in the range [0, 1]. Compared with feature selection, feature weighting is more suitable for the analysis of metabolome feature data: First, existing research shows that feature weighting can obtain better predictive effect improvement ability than feature selection, and the formed system can target physiological The state is more accurately estimated. Secondly, the weighted weights are continuous values, which can more accurately describe the specific correlation between the corresponding metabolite signals and the target state. This information is of great value for subsequent related research. However, the metabolomic group feature data is large in scale and high in dimension, and its feature weighting is a complex large-scale multi-mode optimization problem, which is difficult to process using traditional mathematical methods. Therefore, its practical application is severely limited.
  • the main drawbacks of the existing machine learning algorithms for metabolomic characterization data are as follows: First, the weights in feature selection can only obtain two discrete values of ⁇ 0, 1 ⁇ , but cannot make important differences in the importance of metabolite signals. A more precise description. For example, if two metabolites have an effect on the target physiological state, but the extent of the difference, the corresponding signal The weights should also vary. The metabolite signal weights that have a greater impact should also be larger, and vice versa. However, feature selection can only give 0 or 1 weights, and it is difficult to describe such differences. Lead to the loss of important biological information.
  • the weighting algorithm in the feature weighting algorithm is difficult to set up, and there is currently no effective solution. Especially for feature weighting on big data, existing algorithms are difficult to effectively process, but only near. This seriously affects the performance of the analysis.
  • the object of the present invention is to provide a metadata-based feature data analysis method and system for large data, aiming at solving the problem that the current data analysis method cannot quickly and effectively analyze metabolome big data.
  • a method for analyzing a metabolome characteristic data for big data comprising the following steps:
  • mapping protocol MapReduce
  • [!, / 2 is the first feature vector
  • N is the data set size
  • D is the feature vector total dimension
  • step A is specifically:
  • A1 read the initialization iteration counter k and judge the value of the reading.
  • When 0, construct the D-dimensional weight vector ⁇ , and its value is initialized to a random value in the range of [0, 1].
  • When k> 0, The output weight of the last iteration is taken as the initial value of the current weight vector, ie W k W k .
  • A1 also includes: Initialize the iteration counter 0,
  • step B is specifically:
  • the big data-oriented metabolome feature data analysis method wherein the computational intelligence method comprises differential evolution, particle swarm optimization or cultural genetic algorithm.
  • the big data-oriented metabolome feature data analysis method wherein the step B3 calculates an evolutionary population; and the fitness function value of each of the optimized individuals in the 3 ⁇ 4 is specifically:
  • the candidate solution vector is used as the sub-weighting vector W M
  • the weighted sub-feature vector set F* ⁇ [ m ⁇ , 2 is used to train the machine learning classification/regression algorithm to obtain the prediction accuracy of the classification/regression algorithm;
  • a metabolome-oriented feature data analysis system for big data comprising: a data segmentation module, configured to receive input metabolome feature data, divide the data into a plurality of data blocks, and divide the plurality of data blocks The mapping is sent to each of the computing nodes in the mapping specification framework;
  • the heuristic weighting module is configured to optimize the weighted weights on the plurality of data blocks divided by the data segmentation module by using the computational intelligence method;
  • the weight fusion module is configured to combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome feature data and output the weighted weights.
  • the present invention provides a metadata-based feature data analysis method and system for big data, which is based on the characteristics of metabolomic feature big data.
  • a parallel weighted analysis system for the MapReduce framework For the one hand, the system's data blocking processing mechanism reduces the difficulty of weighted analysis and effectively improves the prediction accuracy.
  • the parallelized structure of the system means that the system can be deployed to multiple compute nodes (such as multiple computers) for simultaneous processing, which can significantly reduce the overall computing time.
  • the MapReduce framework can schedule, adjust, and balance each computing node to ensure system efficiency and stability.
  • the computational intelligence algorithm applied in this system can effectively solve complex large-scale optimization problems.
  • Figure 3 is a schematic diagram showing the working principle of the big data-oriented metabolome characteristic data analysis system of the present invention.
  • FIG. 4 is a schematic diagram of a data segmentation process performed in step S100 of FIG. 1.
  • FIG. 5 is a schematic diagram of the process of weighting weight optimization of data blocks in step S200 of FIG. 1.
  • FIG. 6 is a schematic diagram of a process of performing a protocol for optimizing weighted weights in step S300 of FIG. 1 . detailed description
  • the present invention provides a method for analyzing metabolomic characteristic data for large data and a system thereof, and the present invention will be further described in detail below in order to make the objects, technical solutions and effects of the present invention more clear and clear. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
  • a large data-oriented metabolome feature data analysis method as shown in FIG. 1 wherein the method comprises the following steps:
  • the present invention further provides a metadata-based feature data analysis system for big data, wherein the system is as shown in FIG. 2, and includes:
  • a data segmentation module 100 configured to receive the input metabolome feature data, and divide the segment It is a plurality of data blocks, and the multiple data block maps are sent into each operation node in the mapping specification framework.
  • the heuristic weighting module 200 is configured to optimize the weighting weights on the plurality of data blocks divided by the data segmentation module by using the computational intelligence method.
  • the weight fusion module 300 is configured to combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome feature data and output the weighted weights.
  • FIG. 3 The working principle of the big data-oriented metabolome characteristic data analysis system of the present invention is as shown in FIG. 3:
  • the data segmentation module divides the data. After being input to the data splitting module, the data is divided into data block B 1 data block B 2 data block B M .
  • a plurality of data block mappings are sent to each of the computing nodes in the mapping specification framework, that is, to the heuristic weighting module.
  • the heuristic weighting module optimizes the weighted weights.
  • the data block weighted weights optimized by each heuristic weighting module are sent to the weight fusion module.
  • the weight fusion module performs a specification on each optimized weighted weight.
  • step S6 Whether the iteration is completed, if not, returning to step S2, and if yes, executing step S6.
  • step S100 the data segmentation process in step S100 is as shown in FIG. 4, and the specific steps are as follows:
  • the split data block set I map is sent to each operation node in the mapping specification framework.
  • Common mapping protocol frameworks include Hadoop and Nokia Disco.
  • the step S200 performs a weighting weight optimization process on the data block as shown in FIG. 5:
  • the computational intelligence method is used to optimize the evolutionary population ps.
  • Common algorithms include Differential Evolution (DE), Particle Swarm Optimization (PSO), and Memetic Algorithm (MA).
  • the candidate solver ⁇ ££ ⁇ of the optimal individual in the population is taken as the best sub-weighting vector obtained by optimization.
  • the step (4) further includes:
  • the candidate solution vector is used as the sub-weight W m .
  • B) will work with the F ⁇ multiplied by the sub-feature vectors are weighted, if any of a weight value W m of less than a preset threshold value corresponding to the characteristic signal metabolic / deleting on this dimension, dimension reduction realized, eventually forming weighting Sub-feature vector F* ⁇ , chorus.
  • the weighted sub-feature vector set F* ⁇ [ m ⁇ , 2 , used to train machine learning classification / Regression algorithm to obtain the prediction accuracy of the classification/regression algorithm.
  • algorithms such as support vector machine based on Kernel Methods and Extreme Learning Machine (ELM) are generally used.
  • the prediction accuracy of the classification/regression algorithm is taken as the fitness function value of the current individual Xi.
  • the accuracy rate is represented by the classification error rate; for the regression algorithm, the mean square error (Root) Mean Square Error, RMSE).
  • the step S300 performs a protocol processing process on the optimized weighting weights as shown in FIG. 6, which is specifically:
  • Update iteration counter A A+ 1, judge whether it is less than if it is, then jump to the subdivision step (2) of step S100, and if not, execute step (8).
  • the system of the invention has the following advantages:
  • the system is a parallel weighted analysis system based on the mapping protocol framework for the characteristics of metabolonomic feature big data.
  • data block processing reduces the difficulty of weighted analysis and effectively improves the prediction accuracy.
  • Parallelized architecture means that the system can be deployed to multiple compute nodes (such as multiple computers) for simultaneous processing, significantly reducing overall computation time.
  • the mapping protocol framework can schedule, adjust, and balance each computing node to ensure system efficiency and stability.
  • computational intelligence algorithms can effectively solve complex large-scale optimization problems. By introducing it into each heuristic weighting module, it is used to optimize the sub-weighted vector for better analysis results.
  • the experimental data shows that the weighting design method based on computational intelligence has better prediction accuracy than other existing feature weighting and feature selection algorithms. A more effective estimate of the target's physiological state can be used to better guide subsequent biological and medical applications.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

A big data oriented metabolome feature data analysis method and system thereof, the method comprising: A. receiving inputted metabolome feature data, dividing into a plurality of data blocks, and mapping the plurality of data blocks to respective operation nodes in a map-reduce frame; B. optimizing the weighted values of the plurality of data blocks by using a computation intelligent method; C. combining the optimized weighted values of the plurality of data blocks into a weighted value of the overall metabolome feature data and outputting the weighted value of the overall metabolome feature data. The data block processing mechanism of the system reduces weighting analysis difficulty and effectively improves prediction accuracy. In addition, a parallel structure enables the system to be deployed at a plurality of computing nodes, significantly reducing operation time while ensuring the efficiency and stability of the system. The computation intelligent algorithm used in the system can effectively solve the problem of complicated large-scale optimization, providing better predictive accuracy to realize more effective prediction on the target physiological status.

Description

一种面向大数据的代谢组特征数据分析方法及其系统 技术领域  Metadata group characteristic data analysis method and system for big data
本发明涉及生物信息学领域,尤其涉及一种面向大数据的代谢组 特征数据分析方法及其系统。 背景技术  The present invention relates to the field of bioinformatics, and in particular to a method and system for analyzing metabolome characteristic data for big data. Background technique
代谢物是生物体内完成代谢过程的小分子有机化合物总称,包含 了丰富的生理状态信息。 代谢组学是代谢物的整体系统性研究方法, 可有效揭示代谢现象背后的生化机理。 与传统研究方法相比, 代谢组 学被认为可更全面地展示生命体的真实状态。因此获得了越来越多的 重视, 被广泛应用于诸多科研与实用领域中。  Metabolites are a general term for small molecular organic compounds that complete metabolic processes in living organisms and contain a wealth of physiological state information. Metabolomics is a systematic and systematic study of metabolites that effectively reveals the biochemical mechanisms behind metabolic phenomena. Metabolomics is thought to provide a more comprehensive picture of the true state of a living being compared to traditional research methods. Therefore, it has gained more and more attention and is widely used in many scientific research and practical fields.
代谢物经釆集、检测获得的信号数据, 称为代谢组特征数据, 是 代谢组学的研究基本对象。 通常使用机器学习方法对其进行分析, 以 挖掘其中的生理状态信息。 现有技术一般使用基于特征选择 (Feature Selection) 的机器学习算法对代谢组特征数据进行分析, 其主要包含 两个部分: (1). 使用特征选择对输入数据进行降维运算, 以厘清其中 重要的特征信号以及其所对应的代谢物质, 并消除无关噪声,从而提 升预测算法性能。 目前常用的特征选择方法包括主成份分析 (Principal Component Analysis, PCA)、 线性判另 ll分析 (Linear Discriminant Analysis, LDA) 以及最小冗余最大关联 (Minimum Redundancy Maximum Relevance, mRMR)选择等。 (2). 使用分类 /回 归算法对降维后的数据进行预测学习,估计输入特征所可能产生的生 理结果, 以指导后续医疗、 科研等相关工作。 目前常用的分类 /回归 算法包括 k-近邻算法 (k-Nearest Neighbor, k-N )、 线性回归 (Linear Regression)、 £辑回归 (Logistic Regression) 以及支持向量机 (Support Vector Machine, SVM) 等。 但由于代谢组特征数据一般都具 有规模庞大、 特征维度高、 包含大量噪声、 以及特征信号与目标状态 间呈非线性关系等特点。上述的传统方法往往难以在合理运算时间内 获得令人满意的学习结果。 The signal data obtained by the collection and detection of metabolites, called metabolome characteristic data, is the basic object of metabolomics research. It is usually analyzed using machine learning methods to mine physiological state information. The prior art generally uses a machine learning algorithm based on feature selection to analyze metabolomic feature data, which mainly comprises two parts: (1). Using feature selection to perform dimensionality reduction on the input data to clarify the important The characteristic signal and its corresponding metabolites, and eliminate the unrelated noise, thereby improving the performance of the prediction algorithm. Currently commonly used feature selection methods include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Minimum Redundancy Maximum Association (Minimum). Redundancy Maximum Relevance, mRMR), etc. (2). Using the classification/regression algorithm to predict the dimensionality-reduced data, and estimate the physiological results that may be generated by the input features to guide the follow-up medical and scientific research. Currently used classification/regression algorithms include k-Nearest Neighbor (kN), Linear Regression, Logistic Regression, and Support Vector Machine (SVM). However, due to the large size of the metabolome, the feature dimension is high, contains a lot of noise, and the nonlinear relationship between the characteristic signal and the target state. The above conventional methods are often difficult to obtain satisfactory learning results within a reasonable computing time.
特征加权分析 (Feature Weighting) 是特征选择当权值可取得 [0, 1]范围内任意实数值时的泛化形式。 与特征选择相比, 特征加权更适 合被用于代谢组特征数据的分析: 首先, 现有研究表明, 特征加权可 获得比特征选择更佳的预测效果提升能力,所形成的系统可对目标生 理状态进行更为精确的估计。 其次, 加权权值为连续数值, 可更为准 确地描述所对应代谢物信号与目标状态间的具体关联,这一信息对后 续相关研究具有重要价值。 但代谢组特征数据规模庞大、 维度较高, 其特征加权属于复杂的大规模多模优化问题,难以使用传统数学方法 进行处理。 因此严重限制了其实际运用。  Feature Weighting is a generalized form of feature selection when a weight value can take any real value in the range [0, 1]. Compared with feature selection, feature weighting is more suitable for the analysis of metabolome feature data: First, existing research shows that feature weighting can obtain better predictive effect improvement ability than feature selection, and the formed system can target physiological The state is more accurately estimated. Secondly, the weighted weights are continuous values, which can more accurately describe the specific correlation between the corresponding metabolite signals and the target state. This information is of great value for subsequent related research. However, the metabolomic group feature data is large in scale and high in dimension, and its feature weighting is a complex large-scale multi-mode optimization problem, which is difficult to process using traditional mathematical methods. Therefore, its practical application is severely limited.
现有针对代谢组特征数据的机器学习算法, 其主要缺陷在于: 其一, 特征选择中的权值仅能取得 {0, 1 }两个离散值, 而无法对 代谢物信号的重要性差异进行更为精确的描述。 例如, 若两种代谢物 质对目标生理状态都具有影响, 但其程度有所差别, 则其所对应信号 的权值也应各不相同。 影响较大的代谢物信号权值也应较大,反之亦 然。 但特征选择仅能赋予 0或 1两种权值, 难以描述此类差异性。 导 致重要的生物学信息丟失。 The main drawbacks of the existing machine learning algorithms for metabolomic characterization data are as follows: First, the weights in feature selection can only obtain two discrete values of {0, 1 }, but cannot make important differences in the importance of metabolite signals. A more precise description. For example, if two metabolites have an effect on the target physiological state, but the extent of the difference, the corresponding signal The weights should also vary. The metabolite signal weights that have a greater impact should also be larger, and vice versa. However, feature selection can only give 0 or 1 weights, and it is difficult to describe such differences. Lead to the loss of important biological information.
其二, 特征加权算法中权值设定难度较大, 目前缺少行之有效的 解决方法。特别是对于大数据上的特征加权, 现有算法都难以进行有 效处理, 而仅能近求解。 从而严重影响了分析性能。  Second, the weighting algorithm in the feature weighting algorithm is difficult to set up, and there is currently no effective solution. Especially for feature weighting on big data, existing algorithms are difficult to effectively process, but only near. This seriously affects the performance of the analysis.
其三, 现有机器学习技术主要针对小规模数据进行设计, 并未考 虑代谢组特征的大数据情况。 这往往造成面对庞大数据时, 分类 /回 归算法性能显著下降, 运算时间指数增加。 另外现有算法的运算复杂 度较高, 且架构上难以并行化处理, 导致无法在合理时间内对代谢组 大数据进行有效分析。  Third, existing machine learning techniques are designed primarily for small-scale data, without considering the big data of metabolome features. This often results in a significant decrease in the performance of the classification/return algorithm and an increase in the computation time index when faced with large data. In addition, the existing algorithms have high computational complexity and are difficult to parallelize on the architecture, which makes it impossible to effectively analyze metabolomic big data in a reasonable time.
因此, 现有技术还有待于改进和发展。 发明内容  Therefore, the prior art has yet to be improved and developed. Summary of the invention
鉴于上述现有技术的不足,本发明的目的在于提供一种面向大数 据的代谢组特征数据分析方法及其系统, 旨在解决目前数据分析方法 无法对代谢组大数据进行快速有效分析的问题。  In view of the above deficiencies of the prior art, the object of the present invention is to provide a metadata-based feature data analysis method and system for large data, aiming at solving the problem that the current data analysis method cannot quickly and effectively analyze metabolome big data.
本发明的技术方案如下:  The technical solution of the present invention is as follows:
一种面向大数据的代谢组特征数据分析方法, 其中, 所述方法包 括以下步骤:  A method for analyzing a metabolome characteristic data for big data, wherein the method comprises the following steps:
A、 接收输入的代谢组特征数据, 将其分割为多个数据块, 并将 该多个数据块映射送入映射规约( MapReduce )框架中的各个运算节 点中; A. receiving the input metabolome characteristic data, dividing the data into multiple data blocks, and mapping the multiple data blocks into each operation section in the mapping protocol (MapReduce) framework Point
B、 利用计算智能方法同时对多个数据块上的加权权值进行优 化;  B. Optimizing weighted weights on multiple data blocks simultaneously using a computational intelligence method;
C、 将优化后的多个数据块加权权值合并为整体代谢组特征数据 的加权权值并输出。  C. Combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome characteristic data and output.
所述的面向大数据的代谢组特征数据分析方法, 其中, 所述代谢 组特征数据表示为代谢组特征数据集 F = {F F2, FN}, 其中 F„ = The big data-oriented metabolome feature data analysis method, wherein the metabolome feature data is represented as a metabolome feature data set F = {FF 2 , F N }, wherein F„ =
[!,/2, 为第《个特征矢量, N为数据集大小, D为特征矢量总维 数; 所述多个数据块的数量为 且每个数据块包含 =L»/A/个元 素, 设定系统总迭代次数为 f次。 [!, / 2 , is the first feature vector, N is the data set size, D is the feature vector total dimension; the number of the multiple data blocks is and each data block contains =L»/A/ elements , set the total number of iterations of the system to f times.
所述的面向大数据的代谢组特征数据分析方法, 其中, 所述步骤 A具体为:  The big data-oriented metabolome feature data analysis method, wherein the step A is specifically:
A1、 读取初始化迭代计数器 k并对所读数值进行判断, 当 = 0 时, 构造 D维加权矢量 ^, 其值初始化为 [0, 1]范围内的随机值, 当 k> 0 时, 将上一次迭代的输出权值作为本次加权矢量的初始值, 即 Wk= Wk. A1, read the initialization iteration counter k and judge the value of the reading. When = 0, construct the D-dimensional weight vector ^, and its value is initialized to a random value in the range of [0, 1]. When k> 0, The output weight of the last iteration is taken as the initial value of the current weight vector, ie W k = W k .
A2、构造包含 个空集的数据块集 IB =
Figure imgf000006_0001
= 0, ...,BM = 0}, 以及包含所有索引值的索引矢量 i) = [l,2,3, 并初始化数 据块计数器 w = 0。
A2, construct a data block set containing an empty set IB =
Figure imgf000006_0001
= 0, ..., B M = 0}, and the index vector containing all index values i) = [l, 2, 3, and initialize the number According to the block counter w = 0.
A3、 构造子索引矢量 / = 0, 子加权矢量 ^ = 0, 以及子特征 矢量集 F = {Fm Fm,2, Fm^} , 其中任意子特征矢量有 Fm,n = 0, 并初始化块内计数器 / = 0。 A3, constructor index vector / = 0, sub-weighted vector ^ = 0, and sub-feature vector set F = {F m F m , 2 , F m ^} , where any sub-feature vector has F m , n = 0, and initialize the in-block counter / = 0.
A4、从索引矢量 2)中随机选择一索引值 ί加入子索引矢量 /中, 同时将索引值 ί从 i)中移除, 将加权矢量 在第 ί 维上的权值 wd 加入子加权矢量 Wk 轮流取得代谢组特征数据集 F中每个特征矢量 A4, randomly selecting an index value from the index vector 2), adding the sub-index vector / , and removing the index value ί from i), adding the weight vector w d of the weight vector to the sub-weighting Vector W k takes turns to obtain each feature vector in the metabolome feature data set F
F„,将其在第 6维上的特征信号值 ^加入 F的第"个子特征矢量 F,„。 F ", the value of the signal which is characteristic of dimension ^ 6 F is added to the first" sub-feature vector F ∞, ".
A5、 更新块内计数器 / = / + 1 , 并判断 /是否小于 , 若是, 则 跳转至步骤 A2, 若否, 则执行步骤 A6。 A5. Update the in-block counter /= / + 1 and judge whether / is less than, if yes, go to step A2, if no, go to step A6.
A6、 添加当前数据块为 B = {Im, Wk,m, Wm} , 并更新数据块计数 器 w = w + l。并判断 w是否小于 M,若是,则跳转至步骤 A1 ,若否, 则执行步骤 A7。 节点。 A6. Add the current data block to B = {I m , W k , m , W m } , and update the data block counter w = w + l. And determine whether w is less than M, and if so, then jump to step A1, and if not, execute step A7. node.
所述的面向大数据的代谢组特征数据分析方法, 其中, 所述步骤 The big data-oriented metabolome feature data analysis method, wherein the step
A1之前还包括: 初始化迭代计数器 0, A1 also includes: Initialize the iteration counter 0,
所述的面向大数据的代谢组特征数据分析方法, 其中, 所述步骤 B具体为:  The big data-oriented metabolome feature data analysis method, wherein the step B is specifically:
Bl、 针对数据块 B = {Im, Wk,m, Wm), 构造计算智能方法的进化 种群; ^,其中每个寻优个体的候选解为 维矢量; ^, 其中 , = 1,2, I , 该 值初始化为 = Wk,m Bl, for the data block B = {I m , W k , m , W m ), construct an evolutionary population of the computational intelligence method; ^, wherein the candidate solution of each of the optimized individuals is a dimension vector; ^, where, = 1 , 2, I , the value is initialized to = W k , m
B2、 设置计算智能方法最大迭代次数为 初始化迭代计数器 g =0; B2. Set the maximum number of iterations of the computational intelligence method to initialize the iteration counter g =0;
B3、 计算进化种群 中每个寻优个体的适应度函数值, 并根据 各寻优个体的适应度函数值, 使用计算智能方法优化进化种群 ps B3. Calculate the fitness function value of each of the optimized individuals in the evolutionary population, and use the computational intelligence method to optimize the evolutionary population according to the fitness function values of each of the optimized individuals.
B4、 更新迭代计数器 = + 1, 并判断 g是否小于 若是, 则 跳转至步骤 B3, 若否, 则执行步骤 B5; B4, update iteration counter = + 1, and determine whether g is less than if it is, then jump to step B3, if not, then perform step B5;
B 5、 将种群中最优个体的候选解 Xbest作为优化取得的最佳子加权 矢量 , 即 ^ =^ =argmin (^.) B 5. The candidate solution X best of the optimal individual in the population is taken as the best sub-weighting vector obtained by optimization, that is, ^ =^ =argmin (^.)
Xi甲 ·  Xijia ·
B6、 将子加权矢量 与子索引矢量 /构成键值对 PM = <IM.B6, the sub-weight vector and the sub-index vector / constitute a key-value pair P M = <I M .
WKM>, 作为映射规约框架中映射过程的输出。 所述的面向大数据的代谢组特征数据分析方法, 其中, 所述计算 智能方法包括差分进化、 粒子群优化或文化基因算法。 W KM >, as the output of the mapping process in the mapping specification framework. The big data-oriented metabolome feature data analysis method, wherein the computational intelligence method comprises differential evolution, particle swarm optimization or cultural genetic algorithm.
所述的面向大数据的代谢组特征数据分析方法, 其中, 所述步骤 B3中计算进化种群; ¾中每个寻优个体的适应度函数值具体为:  The big data-oriented metabolome feature data analysis method, wherein the step B3 calculates an evolutionary population; and the fitness function value of each of the optimized individuals in the 3⁄4 is specifically:
B31、 对于输入的第 ,个寻优个体, 将其候选解矢量 作为子加 权矢量 WM B31. For the first optimized individual, the candidate solution vector is used as the sub-weighting vector W M
B32、 将 与 F中的各子特征矢量 相乘以进行加权, 当 B32, multiplying each sub-feature vector in F to perform weighting, when
WM中任一权值 小于预设阔值 则删除此维度上的对应代谢特征 信号 , 实现降维, 最终形成加权子特征矢量 : If any weight in W M is less than the preset threshold, the corresponding metabolic signature signal on this dimension is deleted, and the dimensionality reduction is implemented, and finally the weighted sub-feature vector is formed:
m* 二 m* two
,n F m,n ®W m
Figure imgf000009_0001
Jf I, F m,n ,,W I, E W ,w I,>S] ) ·
, n F m,n ®W m
Figure imgf000009_0001
Jf I, F m,n ,,WI, EW ,w I,>S] ) ·
B33、 将加权后的子特征矢量集合 F* = [ m^ ,2, 用 于训练机器学习分类 /回归算法, 获得分类 /回归算法的预测准确 率; B33. The weighted sub-feature vector set F* = [ m ^ , 2 is used to train the machine learning classification/regression algorithm to obtain the prediction accuracy of the classification/regression algorithm;
B34、将分类 /回归算法的预测准确率作为当前个体;^的适应度 函数值 /( ■)。 B34. Using the prediction accuracy of the classification/regression algorithm as the current individual; Function value / ( ■).
所述的面向大数据的代谢组特征数据分析方法, 其中, 所述步骤 The big data-oriented metabolome feature data analysis method, wherein the step
C具体为: C is specifically:
C1、收集输出的所有 个键值对,构成键值对集合 = {P1 P2, ... PM}, 并对其进行规约处理; C1, collecting all the key-value pairs of the output, forming a set of key-value pairs = {P 1 P 2 , ... P M }, and subjecting them to a protocol;
C2、 构造全零值的 D维加权矢量 Wk = [0, 0, 0]。 初始化数据 块计数器 w = 0; C2. Construct a D-dimensional weight vector W k = [0, 0, 0] of all zero values. Initialize the data block counter w = 0;
C3、取得键值对集合 P中的第 w个键值对 Pm = <Im. Wk,m>,初始 化块内计数器 / = 0; C3, obtaining the w-th key-value pair P m = <I m . W k , m > in the set of key-value pairs P, and initializing the intra-block counter /= 0;
C4、 将子加权矢量 ^,中第 /维上的权值, 添加至加权矢量 Wk 的第 维上, 即 Wk= {wd= WKm[l] I d = Im[l]) =\ …, ; C4, adding the weights on the dimension/vector in the sub-weight vector ^, to the dimension of the weight vector W k , ie W k = {w d = W Km [l] I d = I m [l] ) =\ ..., ;
C5、 更新块内计数器 / = /+ 1, 判断 /是否小于 , 若是, 则跳转 至步骤 C4, 若否, 则执行步骤 C6; C5, update the in-block counter / = / + 1, determine / is less than, if yes, then go to step C4, if not, then go to step C6;
C6、 更新数据块计数器 w = w + 1 , 判断 w是否小于 若是, 则跳转至步骤 C3, 若否, 则执行步骤 C7; C6, update the data block counter w = w + 1 , determine whether w is less than, if yes, then go to step C3, if not, then perform step C7;
C7、 更新迭代计数器 Α = Α+ 1, 判断 是否小于 若是, 则跳 转至步骤 A, 若否, 则执行步骤 C8; C8、 利用最终得到的加权矢量 ^对输入代谢组特征数据集 F进 行力口权。 所述的面向大数据的代谢组特征数据分析方法, 其中, 利用最终 得到的加权矢量 对输入代谢组特征数据集 F进行加权, 而后将其 用于训练机器学习算法, 获得整体的分类 /回归预测准确率, 将加权 矢量 ^与分类 /回归预测准确率作为结果输出。 一种面向大数据的代谢组特征数据分析系统, 其中, 所述系统包 括: 数据分割模块, 用于接收输入的代谢组特征数据, 将其分割为多 个数据块,并将该多个数据块映射送入映射规约框架中的各个运算节 点中; C7, update iteration counter Α = Α + 1, determine whether it is less than if it is, then jump to step A, if not, then perform step C8; C8. Perform a force right on the input metabolome feature data set F by using the finally obtained weight vector ^. The big data-oriented metabolome feature data analysis method, wherein the input metabolome feature data set F is weighted by using the finally obtained weight vector, and then used to train a machine learning algorithm to obtain an overall classification/regression prediction Accuracy, the weight vector ^ and the classification/regressive prediction accuracy are output as results. A metabolome-oriented feature data analysis system for big data, wherein the system comprises: a data segmentation module, configured to receive input metabolome feature data, divide the data into a plurality of data blocks, and divide the plurality of data blocks The mapping is sent to each of the computing nodes in the mapping specification framework;
启发式加权模块,用于利用计算智能方法同时对经数据分割模块 分割后的多个数据块上的加权权值进行优化;  The heuristic weighting module is configured to optimize the weighted weights on the plurality of data blocks divided by the data segmentation module by using the computational intelligence method;
权值融合模块,用于将优化后的多个数据块加权权值合并为整体 代谢组特征数据的加权权值并输出。  The weight fusion module is configured to combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome feature data and output the weighted weights.
有益效果:本发明提供一种面向大数据的代谢组特征数据分析方 法及其系统, 该系统是针对代谢组特征大数据的特点所设计的基于 MapReduce框架的并行加权分析系统。一方面, 系统的数据分块处理 机制降低了加权分析难度, 有效提升了预测准确性。 另一方面, 系统 的并行化结构意味着系统可部署至多个计算节点 (如多台计算机) 同时处理, 可显著降低整体运算时间。 此外, MapReduce框架可对各 运算节点进行调度、 调节与均衡, 保证系统的效率与稳定性。 另外本 系统所应用的计算智能算法可有效地解决复杂的大规模优化问题。通 过将其弓 )入各启发式加权模块可获得更佳的分析结果。其预测准确性 优于其它现有特征加权、特征选择算法, 可对目标生理状态进行更为 有效的预估。 附图说明 方法流程图。 系统的原理框图。 Advantageous Effects: The present invention provides a metadata-based feature data analysis method and system for big data, which is based on the characteristics of metabolomic feature big data. A parallel weighted analysis system for the MapReduce framework. On the one hand, the system's data blocking processing mechanism reduces the difficulty of weighted analysis and effectively improves the prediction accuracy. On the other hand, the parallelized structure of the system means that the system can be deployed to multiple compute nodes (such as multiple computers) for simultaneous processing, which can significantly reduce the overall computing time. In addition, the MapReduce framework can schedule, adjust, and balance each computing node to ensure system efficiency and stability. In addition, the computational intelligence algorithm applied in this system can effectively solve complex large-scale optimization problems. Better analysis results can be obtained by arching them into each heuristic weighting module. Its prediction accuracy is better than other existing feature weighting and feature selection algorithms, which can make a more effective estimation of the target physiological state. BRIEF DESCRIPTION OF THE DRAWINGS Method flow chart. The block diagram of the system.
图 3 为本发明的面向大数据的代谢组特征数据分析系统的工作 原理图。  Figure 3 is a schematic diagram showing the working principle of the big data-oriented metabolome characteristic data analysis system of the present invention.
图 4为图 1中步骤 S100进行数据分割过程示意图。  FIG. 4 is a schematic diagram of a data segmentation process performed in step S100 of FIG. 1.
图 5为图 1中步骤 S200对数据块加权权值优化过程示意图。 图 6为图 1中步骤 S300对优化后加权权值进行规约处理过程示 意图。 具体实施方式 FIG. 5 is a schematic diagram of the process of weighting weight optimization of data blocks in step S200 of FIG. 1. FIG. 6 is a schematic diagram of a process of performing a protocol for optimizing weighted weights in step S300 of FIG. 1 . detailed description
本发明提供一种面向大数据的代谢组特征数据分析方法及其系 统, 为使本发明的目的、 技术方案及效果更加清楚、 明确, 以下对本 发明进一步详细说明。应当理解, 此处所描述的具体实施例仅仅用以 解释本发明, 并不用于限定本发明。  The present invention provides a method for analyzing metabolomic characteristic data for large data and a system thereof, and the present invention will be further described in detail below in order to make the objects, technical solutions and effects of the present invention more clear and clear. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
如图 1 所示的一种面向大数据的代谢组特征数据分析方法, 其 中, 所述方法包括以下步骤:  A large data-oriented metabolome feature data analysis method as shown in FIG. 1 , wherein the method comprises the following steps:
S100、接收输入的代谢组特征数据, 将其分割为多个数据块, 并 将该多个数据块映射送入映射规约框架中的各个运算节点中。 其中, 设若输入的代谢组特征数据为代谢组特征数据集 F = {F  S100. Receive input metabolome feature data, divide the data into multiple data blocks, and map the plurality of data blocks into each operation node in the mapping protocol framework. Wherein, if the input metabolomic characteristic data is a metabolome characteristic data set F = {F
F2, ..., FN} , 其中 ^ = [ !,/2, 为第"个特征矢量, N为数据集大 小, D为特征矢量总维数; 所述多个数据块的数量为 且每个数据 块包含 = L» /A/个元素, 设定系统总迭代次数为 f次。 F 2 , ..., F N } , where ^ = [ !, / 2 , is the first "feature vector, N is the data set size, D is the feature vector total dimension; the number of the plurality of data blocks is And each data block contains = L» /A/ elements, and the total number of system iterations is set to f times.
S200、利用计算智能方法同时对多个数据块上的加权权值进行优 化。  S200: Optimize weighted weights on multiple data blocks simultaneously by using a computational intelligence method.
S300、将优化后的多个数据块加权权值合并为整体代谢组特征数 据的加权权值并输出。  S300: Combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome characteristic data and output the weighted weights.
基于上述的方法,本发明还提供一种面向大数据的代谢组特征数 据分析系统, 其中, 所述系统如图 2所示, 其包括:  Based on the above method, the present invention further provides a metadata-based feature data analysis system for big data, wherein the system is as shown in FIG. 2, and includes:
数据分割模块 100, 用于接收输入的代谢组特征数据, 将其分割 为多个数据块,并将该多个数据块映射送入映射规约框架中的各个运 算节点中。 a data segmentation module 100, configured to receive the input metabolome feature data, and divide the segment It is a plurality of data blocks, and the multiple data block maps are sent into each operation node in the mapping specification framework.
启发式加权模块 200 , 用于利用计算智能方法同时对经数据分割 模块分割后的多个数据块上的加权权值进行优化。  The heuristic weighting module 200 is configured to optimize the weighting weights on the plurality of data blocks divided by the data segmentation module by using the computational intelligence method.
权值融合模块 300 , 用于将优化后的多个数据块加权权值合并为 整体代谢组特征数据的加权权值并输出。  The weight fusion module 300 is configured to combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome feature data and output the weighted weights.
本发明的面向大数据的代谢组特征数据分析系统的工作原理如 图 3所示:  The working principle of the big data-oriented metabolome characteristic data analysis system of the present invention is as shown in FIG. 3:
51、 代谢组特征数据输入。  51. Metabolome feature data input.
52、 数据分割模块分割数据。 输入到数据分割模块后, 由其对数 据进行分割形成数据块 B1 数据块 B2 数据块 BM。 将多个数据 块映射送入映射规约框架中的各个运算节点, 即送入启发式加权模 块。 52. The data segmentation module divides the data. After being input to the data splitting module, the data is divided into data block B 1 data block B 2 data block B M . A plurality of data block mappings are sent to each of the computing nodes in the mapping specification framework, that is, to the heuristic weighting module.
53、启发式加权模块优化加权权值。将经各启发式加权模块优化 的数据块加权权值送入权值融合模块。  53. The heuristic weighting module optimizes the weighted weights. The data block weighted weights optimized by each heuristic weighting module are sent to the weight fusion module.
S4、 权值融合模块对各优化后加权权值进行规约。  S4. The weight fusion module performs a specification on each optimized weighted weight.
55、 迭代是否完成, 若否, 则返回到步骤 S2、 若是, 则执行步 骤 S6。  55. Whether the iteration is completed, if not, returning to step S2, and if yes, executing step S6.
56、 输出加权矢量与分类 /回归预测准确率。  56. Output weight vector and classification/regressive prediction accuracy.
较佳实施例中, 所述步骤 S100的数据分割过程如图 4所示, 其 具体步骤为:  In the preferred embodiment, the data segmentation process in step S100 is as shown in FIG. 4, and the specific steps are as follows:
(1). 初始化迭代计数器 = 0。 (2) . 读取初始化迭代计数器 并对所读数值进行判断, 当 = 0 时, 构造 D维加权矢量 。, 其值初始化为 [0, 1]范围内的随 机值: Wo = [wi, w2,…-, = rand(0,l)。 (1). Initialization iteration counter = 0. (2) . Read the initialization iteration counter and judge the value read. When = 0, construct a D-dimensional weight vector. , whose value is initialized to a random value in the range [0, 1]: Wo = [wi, w 2 ,...-, = rand(0,l).
(3) . 当 A> 0时, 将上一次迭代的输出权值作为本次加权矢量的 初始值, 即 Wk= Wk. (3) When A> 0, the output of the previous iteration as the initial value of the weight of this weight vector, i.e., W k = W k.
(4) . 构造包含 个空集的数据块集 IB =
Figure imgf000015_0001
B2 = 0, ...,B =
(4) . Construct a data block set containing an empty set IB =
Figure imgf000015_0001
B 2 = 0, ..., B =
0}, 以及包含所有索引值的索引矢量 i) = [l,2,3, 并 初始化数据块计数器 m→。 0}, and the index vector containing all index values i) = [l, 2, 3, and initialize the data block counter m →.
(5) . 构造子索引矢量 / = 0, 子加权矢量 f^, = 0, 以及子特征 矢量集 F = {Fm^ Fm , Fm^} , 其中任意子特征矢量有 (5) . Constructor index vector / = 0, sub-weighted vector f^, = 0, and sub-feature vector set F = {F m ^ F m , F m ^} , where any sub-feature vector has
F,„ = 0, 并初始化块内计数器 / = 0。 F , „ = 0, and initialize the in-block counter / = 0.
(6) . 从索引矢量 2)中随机选择一索引值 加入子索引矢量 /中, 同时将索引值 ί从 i)中移除。 (6) from the index vector 2) randomly selects a sub-index index vector Add / while ί removed from the index value i).
(7) . 将加权矢量 在第 ί维上的权值 加入子加权矢量 Wk,m, 轮流取得代谢组特征数据集 F中每个特征矢量 F„, 将其在第 d维上的特征信号值 fd加入 m的第"个子特征矢量 Fm,n(7) Adding the weight of the weight vector in the ί dimension to the sub-weighting vector W k , m , taking each eigenvector F„ in the metabolome feature data set F in turn, and placing it in the first The first "sub-feature characteristic signal value F d on the d-dimensional vector added m F m, n.
(8) . 更新块内计数器 / = / + 1 , 并判断 /是否小于 , 若是, 则跳 转至步骤 (4), 若否, 则执行步骤 (9)。 (8) . Update the in-block counter / = / + 1 and judge if / is less than, if yes, skip to step (4), if no, go to step (9).
(9) . 添加当前数据块为 B = {Im, Wk,m, Wm} , 并更新数据块计数 器 w = w + l。 并判断 w是否小于 若是, 则跳转至步骤 (3), 若否, 则执行步骤 (10)。 (9). Add the current data block to B = {I m , W k , m , W m } and update the data block counter w = w + l. And determining whether w is less than if it is, then going to step (3), and if not, executing step (10).
(10) . 将分割后的数据块集 I 映射送入映射规约框架中的各个运 算节点。 常用映射规约框架包括 Hadoop及 Nokia Disco等。 进一步地, 所述步骤 S200对数据块加权权值优化过程如图 5所 其具体为: (10). The split data block set I map is sent to each operation node in the mapping specification framework. Common mapping protocol frameworks include Hadoop and Nokia Disco. Further, the step S200 performs a weighting weight optimization process on the data block as shown in FIG. 5:
(1) . 对于第 个并行运算的启发式加权模块, 其输入数据块 为 Bra = {Im,
Figure imgf000016_0001
Wm}。
(1) For the heuristic weighting module of the first parallel operation, the input data block is B ra = {I m ,
Figure imgf000016_0001
Wm}.
(2) . 构造计算智能方法的进化种群; 其中每个寻优个体的 候选解为 L维矢量 ·, 其中 I = 1, 2, \ps\ , 该 值初始化 为 Xi= wk,m(2). Constructing an evolutionary population of computational intelligence methods; where the candidate solution for each of the optimized individuals is an L-dimensional vector·, where I = 1, 2, \ps\ , the value is initialized Is Xi= w k , m .
(3) . 设置计算智能方法最大迭代次数为 G, 初始化迭代计数 器 =0。 (3) . Set the calculation intelligence method to maximize the number of iterations to G, and initialize the iteration counter to =0.
(4) . 计算进化种群 ps中每个寻优个体的适应度函数值。 (4) Calculate the fitness function value of each of the optimized individuals in the evolutionary population ps.
(5) . 根据各寻优个体的适应度函数值, 使用计算智能方法优 化进化种群 ps。 常用算法包括差分进化 (Differential Evolution, DE)、 粒子群优化 (Particle Swarm Optimization, PSO) 以及文化基因算法 (Memetic Algorithm, MA) 等。 (5). Based on the fitness function values of each of the optimized individuals, the computational intelligence method is used to optimize the evolutionary population ps. Common algorithms include Differential Evolution (DE), Particle Swarm Optimization (PSO), and Memetic Algorithm (MA).
(6) . 更新迭代计数器 = +1, 并判断 g是否小于 若是, 则跳转至步骤 (4), 若否, 则执行步骤 (7)。 (6) Update the iteration counter = +1, and judge whether g is less than if it is, then go to step (4), if no, then go to step (7).
(7) . 优化完成后, 将种群中最优个体的候选解 Χδ£ ^作为优化 取得的最佳子加权矢量 , 即 (7) After the optimization is completed, the candidate solver δ££ ^ of the optimal individual in the population is taken as the best sub-weighting vector obtained by optimization.
Wk,m = Xbest = argmin/( z) W k, m = X best = argmin / (z)
Xi e ps X i e ps
(8). 将子加权矢量 与子索引矢量 /构成键值对 Pm = <Im. (8). The sub-weighted vector and the sub-index vector / form a key-value pair P m = <I m .
WKm>, 作为映射规约框架中映射过程的输出。 较佳实施例中, 所述步骤 (4)还进一步包括: W Km >, as the output of the mapping process in the mapping specification framework. In a preferred embodiment, the step (4) further includes:
a)、 对于输入的第 ,个寻优个体, 将其候选解矢量 作为子加权 Wm。 b)、 将 与 F中的各子特征矢量 相乘以进行加权, 当 Wm 中任一权值 小于预设阈值 则删除此维度上的对应代谢特征信号 /, 实现降维, 最终形成加权子特征矢量 F*,„。 a), for the first searched individual, the candidate solution vector is used as the sub-weight W m . B), will work with the F multiplied by the sub-feature vectors are weighted, if any of a weight value W m of less than a preset threshold value corresponding to the characteristic signal metabolic / deleting on this dimension, dimension reduction realized, eventually forming weighting Sub-feature vector F* , „.
 Two
m*,n F m,n ® W m
Figure imgf000018_0001
Jf I, e F m,n ,, w I, e W , w I, > S] ) c)、 将加权后的子特征矢量集合 F* = [ m^ ,2, 用于 训练机器学习分类 /回归算法, 获得分类 /回归算法的预测准确率。 在代谢组特征数据的加权分析中, 一般使用基于核方法 (Kernel Methods) 的支持向量机与极限学习机 (Extreme Learning Machine, ELM) 等算法。
m*,n F m,n ® W m
Figure imgf000018_0001
Jf I, e F m,n ,, w I, e W , w I, > S] ) c), the weighted sub-feature vector set F* = [ m ^ , 2 , used to train machine learning classification / Regression algorithm to obtain the prediction accuracy of the classification/regression algorithm. In the weighted analysis of metabolome feature data, algorithms such as support vector machine based on Kernel Methods and Extreme Learning Machine (ELM) are generally used.
d)、将分类 /回归算法的预测准确率作为当前个体 Xi的适应度函 数值 对于分类算法, 准确率以分类错误率 (Classification Error Rate)表示;对于回归算法,则以均方跟误差 (Root Mean Square Error, RMSE) 表示。  d), the prediction accuracy of the classification/regression algorithm is taken as the fitness function value of the current individual Xi. For the classification algorithm, the accuracy rate is represented by the classification error rate; for the regression algorithm, the mean square error (Root) Mean Square Error, RMSE).
较佳实施例中, 所述步骤 S300对经优化的加权权值进行规约处 理过程如图 6所示, 其具体为:  In the preferred embodiment, the step S300 performs a protocol processing process on the optimized weighting weights as shown in FIG. 6, which is specifically:
(1). 收集输出的所有 M个键值对,构成键值对集合 P = {P P PM}, 并对其进行规约处理。 (1). Collect all M key-value pairs of the output to form a set of key-value pairs P = {PP P M }, and it is subject to protocol processing.
(2) .构造全零值的 D维加权矢量 = [0, 0, 0]。 初始化数据块 计数器 w = 0。  (2) Construct a D-dimensional weight vector with all zeros = [0, 0, 0]. Initialize the data block counter w = 0.
(3) .取得键值对集合 p中的第 w个键值对 Pm = <Im. Wk,m>,初始化 块内计数器 / = 0。 (3). Get the wth key-value pair P m = <I m . W k , m > in the set of key-value pairs p, and initialize the in-block counter / = 0.
(4) .将子加权矢量 ^,中第 /维上的权值, 添加至加权矢量 Wk的 第 维上, 即 Wk= {wd= Wk,m[l] I d = Im[l]},l=\, 2, 。(4) Add the weights in the dimension/vector in the sub-weight vector ^, to the dimension of the weight vector W k , ie W k = {w d = W k , m [l] I d = I m [l]}, l=\, 2, .
(5) . 更新块内计数器 / = /+ 1, 判断 /是否小于 , 若是, 则跳转 至步骤 (4), 若否, 则执行步骤 (6)。 (5) . Update the in-block counter / = /+ 1, judge / is less than, if yes, go to step (4), if no, go to step (6).
(6) . 更新数据块计数器 w = w + 1 , 判断 w是否小于 若是, 则 跳转至步骤 (3), 若否, 则执行步骤 (7)。  (6). Update the data block counter w = w + 1 to determine if w is less than If yes, go to step (3), if no, go to step (7).
(7) . 更新迭代计数器 A = A+ 1, 判断 是否小于 若是, 则跳转 至步骤 S100的细分步骤 (2), 若否, 则执行步骤 (8)。  (7). Update iteration counter A = A+ 1, judge whether it is less than if it is, then jump to the subdivision step (2) of step S100, and if not, execute step (8).
(8) . 利用最终得到的加权矢量 ^对输入代谢组特征数据集 F进行 力口权。 另外, 利用最终得到的加权矢量 对输入代谢组特征数据集 F 进行加权。 而后将其用于训练机器学习算法, 获得整体的分类 /回归 预测准确率, 其过程如步骤 S200的细分步骤 (4)的 b)-d)步所示, 最后 将加权矢量 与分类 /回归预测准确率作为结果输出。 本发明的系统相较于现有技术, 其优势为: (8). Using the weight vector ^ obtained finally, the input metabolome feature data set F is used. In addition, the input metabolome feature data set F is weighted using the resulting weight vector. It is then used to train machine learning algorithms to obtain overall classification/regression. The accuracy is predicted, and the process is as shown in steps b)-d) of the subdivision step (4) of step S200, and finally the weight vector and the classification/regressive prediction accuracy are output as results. Compared with the prior art, the system of the invention has the following advantages:
第一, 本系统是针对代谢组特征大数据的特点,基于映射规约框 架的并行加权分析系统。一方面,数据分块处理降低了加权分析难度, 有效提升了预测准确性。 另一方面, 并行化结构意味着本系统可部署 至多个计算节点 (如多台计算机) 同时处理, 可显著降低整体运算时 间。 此外, 映射规约框架可对各运算节点进行调度、 调节与均衡, 保 证系统的效率与稳定性。  First, the system is a parallel weighted analysis system based on the mapping protocol framework for the characteristics of metabolonomic feature big data. On the one hand, data block processing reduces the difficulty of weighted analysis and effectively improves the prediction accuracy. Parallelized architecture, on the other hand, means that the system can be deployed to multiple compute nodes (such as multiple computers) for simultaneous processing, significantly reducing overall computation time. In addition, the mapping protocol framework can schedule, adjust, and balance each computing node to ensure system efficiency and stability.
第二, 计算智能算法可为有效地解决复杂的大规模优化问题。 通 过将其引入各启发式加权模块, 用于优化子加权矢量, 可获得更佳的 分析结果。 实验数据表明, 基于计算智能的权值设计方法, 其预测准 确性优于其它现有特征加权、特征选择算法。 可对目标生理状态进行 更为有效的预估, 从而更好地指导后续生物、 医学应用。  Second, computational intelligence algorithms can effectively solve complex large-scale optimization problems. By introducing it into each heuristic weighting module, it is used to optimize the sub-weighted vector for better analysis results. The experimental data shows that the weighting design method based on computational intelligence has better prediction accuracy than other existing feature weighting and feature selection algorithms. A more effective estimate of the target's physiological state can be used to better guide subsequent biological and medical applications.
第三, 优化获得的加权矢量中各权值数值, 具体描述了对应代谢 物信号及其所代表的代谢物质, 对所预测目标生理状态的相关程度。 这一信息对后续相关研究具有重要意义,可帮助厘清生物体代谢过程 的背后机理。  Third, optimize the weight values in the obtained weight vector, and specifically describe the degree of correlation between the corresponding metabolite signal and the metabolites it represents, and the predicted physiological state of the target. This information is important for subsequent research and can help clarify the underlying mechanisms of the metabolic process in living organisms.
应当理解的是, 本发明的应用不限于上述的举例, 对本领域普通 技术人员来说, 可以根据上述说明加以改进或变换, 所有这些改进和 变换都应属于本发明所附权利要求的保护范围。  It is to be understood that the application of the present invention is not limited to the above-described examples, and those skilled in the art can make modifications and changes in accordance with the above description. All such modifications and changes are intended to fall within the scope of the appended claims.

Claims

权利要求书 Claim
1、 一种面向大数据的代谢组特征数据分析方法, 其特征在于, 所述方法包括以下步骤:  A method for analyzing a metabolome characteristic data for big data, characterized in that the method comprises the following steps:
A、 接收输入的代谢组特征数据, 将其分割为多个数据块, 并将 该多个数据块映射送入映射规约框架中的各个运算节点中;  A. receiving the input metabolome feature data, dividing the data into a plurality of data blocks, and mapping the plurality of data blocks into each operation node in the mapping specification framework;
B、 利用计算智能方法同时对多个数据块上的加权权值进行优 化;  B. Optimizing weighted weights on multiple data blocks simultaneously using a computational intelligence method;
C、 将优化后的多个数据块加权权值合并为整体代谢组特征数据 的加权权值并输出。  C. Combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome characteristic data and output.
2、 根据权利要求 1所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 所述代谢组特征数据表示为代谢组特征数据集 F =  2. The big data-oriented metabolome feature data analysis method according to claim 1, wherein the metabolome feature data is represented as a metabolome feature data set F =
{F F2, ...,FN}, 其中 ^ = [ !,/2, ...,_ D]为第"个特征矢量, N为数据 集大小, D为特征矢量总维数; 所述多个数据块的数量为 且每个 数据块包含 = L»/A/个元素, 设定系统总迭代次数为 f次。 {FF 2 , ..., F N }, where ^ = [ !, / 2 , ..., _ D] is the first "feature vector, N is the data set size, and D is the total dimension of the feature vector; The number of data blocks is one and each data block contains = L»/A/ elements, and the total number of system iterations is set to f times.
3、 根据权利要求 2所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 所述步骤 A具体为:  The big data-oriented metabolome feature data analysis method according to claim 2, wherein the step A is specifically:
A1、 读取初始化迭代计数器 k并对所读数值进行判断, 当 = 0 时, 构造 D维加权矢量 ^, 其值初始化为 [0, 1]范围内的随机值, 当 > 0时, 将上一次迭代的输出权值作为本次加权矢量的初始值, 即 Wk= WkA; 0}, 以及包含所有索引值的索引矢量 i) = [l,2,3, 并初始化数 据块计数器 w = 0; A1, read the initialization iteration counter k and judge the value of the reading. When = 0, construct a D-dimensional weight vector ^, whose value is initialized to a random value in the range [0, 1], when > 0, it will be The output weight of one iteration is taken as the initial value of the current weight vector, ie W k = W kA ; 0}, and the index vector containing all index values i) = [l, 2, 3, and initialize the data block counter w = 0;
A3、 构造子索引矢量 / = 0, 子加权矢量 ^ = 0, 以及子特征 矢量集 F = {Fm Fm,2, Fm^} , 其中任意子特征矢量有 Fm,n = 0, 并初始化块内计数器 / = 0; A3, constructor index vector / = 0, sub-weighted vector ^ = 0, and sub-feature vector set F = {F m F m , 2 , F m ^} , where any sub-feature vector has F m , n = 0, and initialize the in-block counter / = 0;
A4、从索引矢量 2)中随机选择一索引值 ί加入子索引矢量 /中, 同时将索引值 ί从 i)中移除, 将加权矢量 在第 ί 维上的权值 wd 加入子加权矢量 Wk 轮流取得代谢组特征数据集 F中每个特征矢量 A4, randomly selecting an index value from the index vector 2), adding the sub-index vector / , and removing the index value ί from i), adding the weight vector w d of the weight vector to the sub-weighting Vector W k takes turns to obtain each feature vector in the metabolome feature data set F
F„,将其在第 6维上的特征信号值 ^加入 F的第"个子特征矢量 F„, the feature signal value ^ in the sixth dimension is added to the first "sub-feature vector of F
A5、 更新块内计数器 / = / + 1, 并判断 /是否小于 , 若是, 则 跳转至步骤 A2, 若否, 则执行步骤 A6; A6、 添加当前数据块为 B = {Im, Wk,m, m}, 并更新数据块计数 器 w = w + l。并判断 w是否小于 若是,则跳转至步骤 A1,若否, 则执行步骤 Α7; A5, update the in-block counter / = / + 1, and determine / is less than, if yes, then go to step A2, if not, then step A6; A6. Add the current data block to B = {I m , W k , m , m }, and update the data block counter w = w + l. And determining whether w is less than if yes, then jumping to step A1, if not, performing step Α7;
Α7、将分割后的数据块集 IB映射送入映射规约框架中的各个运算 节点。 Α 7. The divided data block set IB mapping is sent to each operation node in the mapping specification framework.
4、 根据权利要求 3所述的面向大数据的代谢组特征数据分析方 法,其特征在于,所述步骤 A1之前还包括:初始化迭代计数器 =0。  4. The big data-oriented metabolome feature data analysis method according to claim 3, wherein the step A1 further comprises: initializing an iteration counter =0.
5、 根据权利要求 4所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 所述步骤 B具体为:  The big data-based metabolome feature data analysis method according to claim 4, wherein the step B is specifically:
Bl、 针对数据块 B = {Im, Wk,m, m}, 构造计算智能方法的进化 种群; ^,其中每个寻优个体的候选解为 维矢量; ^, 其中 , = 1,2, I , 该 值初始化为 = Wk,m Bl, constructing an evolutionary population of the computational intelligence method for the data block B = {I m , W k , m , m }, ^, wherein the candidate solution of each of the optimized individuals is a dimension vector; ^, where, = 1, 2, I , the value is initialized to = W k , m
B2、 设置计算智能方法最大迭代次数为 初始化迭代计数器 g =0; B2. Set the maximum number of iterations of the computational intelligence method to initialize the iteration counter g =0;
B3、 计算进化种群 中每个寻优个体的适应度函数值, 并根据 各寻优个体的适应度函数值, 使用计算智能方法优化进化种群 ps; B4、 更新迭代计数器 = + 1 , 并判断 g是否小于 若是, 则 跳转至步骤 B3 , 若否, 则执行步骤 B5; B3, calculating a fitness function value of each of the optimized individuals in the evolved population, and using the computational intelligence method to optimize the evolutionary population ps according to the fitness function values of each of the optimized individuals; B4, update iteration counter = + 1 , and determine whether g is less than if yes, then jump to step B3, if not, then perform step B5;
B 5、 将种群中最优个体的候选解 Xbest作为优化取得的最佳子加权 矢量 , 即B 5. The candidate solution X best of the optimal individual in the population is taken as the best sub-weighting vector obtained by optimization, that is,
= = arg mm  = = arg mm
Xt e ps X t e ps
B6、 将子加权矢量 与子索引矢量 /构成键值对 Pm = <Im.B6, the sub-weight vector and the sub-index vector / constitute a key-value pair P m = <I m .
WKm>, 作为映射规约框架中映射过程的输出。 W Km >, as the output of the mapping process in the mapping specification framework.
6、 根据权利要求 5所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 所述计算智能方法包括差分进化、 粒子群优化或文 化基因算法。 6. The big data-oriented metabolome feature data analysis method according to claim 5, wherein the computational intelligence method comprises differential evolution, particle swarm optimization or cultural genetic algorithm.
7、 根据权利要求 6所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 所述步骤 B3中计算进化种群; ¾中每个寻优个体的 适应度函数值具体为:  7. The big data-oriented metabolome feature data analysis method according to claim 6, wherein the evolutionary population is calculated in the step B3; the fitness function value of each of the optimized individuals in the 3⁄4 is specifically:
B31、 对于输入的第 ,个寻优个体, 将其候选解矢量 作为子加 权矢量 Wm B31. For the first searched individual, the candidate solution vector is used as the sub-weight vector W m
B32、 将 与 F中的各子特征矢量 相乘以进行加权, 当 B32, multiplying each sub-feature vector in F to perform weighting, when
Wm中任一权值 W/小于预设阔值 则删除此维度上的对应代谢特征 信号 , 实现降维, 最终形成加权子特征矢量 ; B33、 将加权后的子特征矢量集合 F* = [ m^ f m'2, 用 于训练机器学习分类 /回归算法, 获得分类 /回归算法的预测准确 率; Any weight W/ less than the preset threshold value in W m deletes the corresponding metabolic feature signal in this dimension, realizes dimensionality reduction, and finally forms a weighted sub-feature vector; B33. The weighted sub-feature vector set F* = [ m ^ f m ' 2 is used to train the machine learning classification/regression algorithm to obtain the prediction accuracy of the classification/regression algorithm;
B34、将分类 /回归算法的预测准确率作为当前个体;^的适应度 函数值 /( ■)。  B34. The prediction accuracy of the classification/regression algorithm is taken as the current individual; the fitness value of ^ is the function value /(■).
8、 根据权利要求 7所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 所述步骤 C具体为:  The big data-based metabolome feature data analysis method according to claim 7, wherein the step C is specifically:
C1、收集输出的所有 个键值对,构成键值对集合 = {P1 P2, ... C1, collect all the key-value pairs of the output, constitute a set of key-value pairs = {P 1 P 2 , ...
PM}, 并对其进行规约处理; P M }, and subject to its specification;
C2、 构造全零值的 D维加权矢量 Wk = [0, 0, 0]。 初始化数据 块计数器 w = 0; C2. Construct a D-dimensional weight vector W k = [0, 0, 0] of all zero values. Initialize the data block counter w = 0;
C3、取得键值对集合 P中的第 w个键值对 Pm = <Im. Wk,m>,初始 化块内计数器 / = 0; C3, obtaining the w-th key-value pair P m = <I m . W k , m > in the set of key-value pairs P, and initializing the intra-block counter /= 0;
C4、 将子加权矢量 ^,中第 /维上的权值, 添加至加权矢量 Wk 的第 维上, 即 Wk= {wd= Wm[l] I d = Im[l]},l= \, 2, C5、 更新块内计数器 / = / + 1 , 判断 /是否小于 , 若是, 则跳转 至步骤 C4, 若否, 则执行步骤 C6; C4, adding the weights in the dimension/vector in the sub-weight vector ^, to the dimension of the weight vector W k , ie W k = {w d = W m [l] I d = I m [l] },l= \, 2, C5, update the in-block counter / = / + 1 , determine / is less than, if yes, then go to step C4, and if not, proceed to step C6;
C6、 更新数据块计数器 w = w + 1 , 判断 w是否小于 若是, 则跳转至步骤 C3 , 若否, 则执行步骤 C7;  C6, update the data block counter w = w + 1 , determine whether w is less than if it is, then jump to step C3, if not, then perform step C7;
C7、 更新迭代计数器 Α = Α + 1 , 判断 是否小于 若是, 则跳 转至步骤 A, 若否, 则执行步骤 C8;  C7, update iteration counter Α = Α + 1 , determine whether it is less than if it is, then jump to step A, if not, then perform step C8;
C8、 利用最终得到的加权矢量 对输入代谢组特征数据集 F进 行力口权。 C8. Using the weight vector obtained finally, the input metabolome feature data set F is weighted.
9、 根据权利要求 8所述的面向大数据的代谢组特征数据分析方 法, 其特征在于, 利用最终得到的加权矢量 对输入代谢组特征数 据集 IF进行加权, 而后将其用于训练机器学习算法, 获得整体的分类 9. The big data-oriented metabolome feature data analysis method according to claim 8, wherein the input metabolome feature data set IF is weighted by using the finally obtained weight vector, and then used to train a machine learning algorithm. , get the overall classification
/回归预测准确率, 将加权矢量 ^与分类 /回归预测准确率作为结 果输出。 /Regression prediction accuracy, and the weight vector ^ and the classification/regressive prediction accuracy are output as results.
10、 一种面向大数据的代谢组特征数据分析系统, 其特征在于, 所述系统包括: 数据分割模块, 用于接收输入的代谢组特征数据, 将其分割为多 个数据块,并将该多个数据块映射送入映射规约框架中的各个运算节 点中; 10. A metabolome-oriented feature data analysis system for big data, the system comprising: a data segmentation module, configured to receive input metabolome feature data, divide the data into a plurality of data blocks, and Multiple data block maps are fed into individual arithmetic sections in the mapping specification framework Point
启发式加权模块,用于利用计算智能方法同时对经数据分割模块 分割后的多个数据块上的加权权值进行优化;  The heuristic weighting module is configured to optimize the weighted weights on the plurality of data blocks divided by the data segmentation module by using the computational intelligence method;
权值融合模块,用于将优化后的多个数据块加权权值合并为整体 代谢组特征数据的加权权值并输出。  The weight fusion module is configured to combine the optimized weighted weights of the plurality of data blocks into weighted weights of the overall metabolome feature data and output the weighted weights.
PCT/CN2014/080283 2014-06-13 2014-06-19 Big data oriented metabolome feature data analysis method and system thereof WO2015188395A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410265541.7 2014-06-13
CN201410265541.7A CN104063631B (en) 2014-06-13 2014-06-13 A kind of metabolism group characteristic analysis method and its system towards big data

Publications (1)

Publication Number Publication Date
WO2015188395A1 true WO2015188395A1 (en) 2015-12-17

Family

ID=51551341

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/080283 WO2015188395A1 (en) 2014-06-13 2014-06-19 Big data oriented metabolome feature data analysis method and system thereof

Country Status (2)

Country Link
CN (1) CN104063631B (en)
WO (1) WO2015188395A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523031A (en) * 2018-11-16 2019-03-26 河南智慧云大数据有限公司 A kind of big data intelligence machine learning system for depth analysis
CN110046770A (en) * 2019-04-23 2019-07-23 中国科学技术大学 Grain mildew prediction technique and device
CN111611293A (en) * 2020-04-24 2020-09-01 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN114172770A (en) * 2021-11-26 2022-03-11 哈尔滨工程大学 Modulation signal identification method of quantum root tree mechanism evolution extreme learning machine

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900404B (en) * 2015-12-24 2024-02-09 北京小米移动软件有限公司 Method, apparatus, and system for channel access in unlicensed bands
CN106407743B (en) * 2016-08-31 2019-03-05 上海美吉生物医药科技有限公司 A kind of high-throughput data analysing method based on cluster
CN107133448B (en) * 2017-04-10 2020-05-01 温州医科大学 Metabonomics data fusion optimization processing method
CN108181891B (en) * 2017-12-13 2020-05-05 东北大学 Industrial big data fault diagnosis method based on intelligent core principal component analysis
CN110739076A (en) * 2019-10-29 2020-01-31 上海华东电信研究院 medical artificial intelligence public training platform
CN112202910B (en) * 2020-10-10 2021-10-08 上海威固信息技术股份有限公司 Computer distributed storage system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN101814082A (en) * 2010-01-20 2010-08-25 中国人民解放军总参谋部第六十三研究所 Method for automatic feature weighting and selection in detection of similar and duplicate record based on ant colony optimization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN101814082A (en) * 2010-01-20 2010-08-25 中国人民解放军总参谋部第六十三研究所 Method for automatic feature weighting and selection in detection of similar and duplicate record based on ant colony optimization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI, JINGXI ET AL.: "Assessment Method of Weapon System Effectiveness Based on Support Vector Machine with Weighted Feature", SHIP SCIENCE AND TECHNOLOGY, vol. 35, no. 5, 31 May 2013 (2013-05-31), XP055242113 *
LI, ZHILONG ET AL.: "Text Feature Selection and Feature Weighting Based on GPU", INDUSTRIAL CONTROL COMPUTER, vol. 27, no. 5, 25 May 2014 (2014-05-25) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523031A (en) * 2018-11-16 2019-03-26 河南智慧云大数据有限公司 A kind of big data intelligence machine learning system for depth analysis
CN109523031B (en) * 2018-11-16 2022-12-13 河南智慧云大数据有限公司 Big data intelligent machine learning system for deep analysis
CN110046770A (en) * 2019-04-23 2019-07-23 中国科学技术大学 Grain mildew prediction technique and device
CN110046770B (en) * 2019-04-23 2021-04-23 中国科学技术大学 Grain mildew prediction method and device
CN111611293A (en) * 2020-04-24 2020-09-01 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111611293B (en) * 2020-04-24 2023-09-29 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN114172770A (en) * 2021-11-26 2022-03-11 哈尔滨工程大学 Modulation signal identification method of quantum root tree mechanism evolution extreme learning machine
CN114172770B (en) * 2021-11-26 2023-05-02 哈尔滨工程大学 Modulation signal identification method of quantum root tree mechanism evolution extreme learning machine

Also Published As

Publication number Publication date
CN104063631A (en) 2014-09-24
CN104063631B (en) 2017-07-18

Similar Documents

Publication Publication Date Title
WO2015188395A1 (en) Big data oriented metabolome feature data analysis method and system thereof
US20230281465A1 (en) Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation
Zhang et al. Reinforced dynamics for enhanced sampling in large atomic and molecular systems
You et al. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis
Tsymbalov et al. Dropout-based active learning for regression
Li et al. A computer aided diagnosis system for thyroid disease using extreme learning machine
WO2016062044A1 (en) Model parameter training method, device and system
Li et al. Recent developments in methods for identifying reaction coordinates
Whata et al. Deep learning for SARS COV-2 genome sequences
Shuvo et al. QDeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks
Yu et al. Protein function prediction using dependence maximization
Yang et al. A multi-point mechanism of expected hypervolume improvement for parallel multi-objective bayesian global optimization
WO2022088390A1 (en) Image incremental clustering method and apparatus, electronic device, storage medium and program product
Ying et al. Enhanced protein fold recognition through a novel data integration approach
Wang et al. Improving prediction of self-interacting proteins using stacked sparse auto-encoder with PSSM profiles
Caldonazzo Garbelini et al. Sequence motif finder using memetic algorithm
Hu et al. Accelerating multi-objective neural architecture search by random-weight evaluation
Huai et al. Latency-constrained DNN architecture learning for edge systems using zerorized batch normalization
Hu et al. Learning from deep representations of multiple networks for predicting drug–target interactions
Wei et al. MOO-DNAS: Efficient neural network design via differentiable architecture search based on multi-objective optimization
Vivek et al. Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
Kaim et al. Ensemble cnn attention-based bilstm deep learning architecture for multivariate cloud workload prediction
Angaitkar et al. gHPCSO: Gaussian distribution based hybrid particle cat swarm optimization for linear B-cell epitope prediction
Liu et al. Cascading model based back propagation neural network in enabling precise classification
Deng et al. Predict the protein-protein interaction between virus and host through hybrid deep neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14894717

Country of ref document: EP

Kind code of ref document: A1

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14894717

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/02/2017)

122 Ep: pct application non-entry in european phase

Ref document number: 14894717

Country of ref document: EP

Kind code of ref document: A1