CN117727372B - Data integration method and system based on regularization model - Google Patents

Data integration method and system based on regularization model Download PDF

Info

Publication number
CN117727372B
CN117727372B CN202311789980.3A CN202311789980A CN117727372B CN 117727372 B CN117727372 B CN 117727372B CN 202311789980 A CN202311789980 A CN 202311789980A CN 117727372 B CN117727372 B CN 117727372B
Authority
CN
China
Prior art keywords
data set
feature
model
network knowledge
dataset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311789980.3A
Other languages
Chinese (zh)
Other versions
CN117727372A (en
Inventor
黄海辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaoguan University
Original Assignee
Shaoguan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaoguan University filed Critical Shaoguan University
Priority to CN202311789980.3A priority Critical patent/CN117727372B/en
Publication of CN117727372A publication Critical patent/CN117727372A/en
Application granted granted Critical
Publication of CN117727372B publication Critical patent/CN117727372B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a regularization model-based data integration method and a regularization model-based data integration system, wherein the regularization model-based data integration method comprises the following steps: setting a Laplace regularization term and an Lq norm penalty term in a preset prediction model to obtain a DSNet model; converting DSNet models to obtain converted DSNet models, obtaining a first feature operator and a second feature operator, inputting a network knowledge dataset and a gene dataset into the converted DSNet models, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solution of all characteristic dimensions of the first characteristic operator and the optimal solution of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set; and the regularization model is used for carrying out integrated analysis on the biological data and the network knowledge data, so that the accuracy and the efficiency of the integrated analysis of the high-dimensional data are improved.

Description

Data integration method and system based on regularization model
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data integration method and system based on a regularization model.
Background
With the rapid development of modern high-throughput biomedical instruments, data in the life sciences field has accumulated a lot. For example, gene expression panorama (GEO) has collected over 340 ten thousand sample data. Due to the large data accumulation, it is a great challenge to identify robust gene biomarker data from a vast data pool that is relevant to the onset and progression of certain diseases.
In analyzing gene expression data using machine learning techniques, researchers are often faced with three major problems of "large p, small n", data heterogeneity and low reproducibility. In the prior art, the main approach to solve these three problems is to process various gene datasets through meta-analysis or integrated analysis to enable them to improve statistical performance in genomic studies. Typical genetic dataset integration methods include meta-threshold gradient descent regularization, meta-lasso, meta-non-convex optimization, data Sharing Lasso (DSL), DSL2, and the like.
However, the above-described typical gene dataset integration method does not fully utilize external biological knowledge such as gene-gene or protein-protein interaction networks, thereby limiting the performance of the dataset integration method. In an integrated analysis approach to both the genetic dataset and the external network knowledge dataset, the L1 penalty term can be applied to handle a priori network knowledge in various models, however the L1 penalty term tends to make model coefficients, creating additional bias, especially when handling high dimensional data such as genomic data. The Lq penalty term (0 < q < 1) can theoretically provide better sparsity, computational efficiency, and computational accuracy than the L1 penalty term.
Disclosure of Invention
The invention provides a regularization model-based data integration method and a regularization model-based data integration system, which are used for integrating and analyzing biological data and network knowledge data in a regularization model, so that accuracy and efficiency of high-dimensional data integration analysis are improved.
In order to solve the technical problems, the invention provides a regularization model-based data integration method, which comprises the following steps:
Acquiring a network knowledge data set and a gene data set;
Setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model;
Converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
And integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set.
Adding a Laplace regularization term and an Lq norm penalty term to a prediction model to serve as constraint to obtain a DSNet model; and then inputting the network knowledge dataset and the gene dataset to be integrated into the DSNet model, calculating the optimal solution of the feature operator in each feature dimension, and integrating data in the feature dimension according to the obtained optimal solution. The method and the device use the regularization model to integrate and analyze the biological data and the network knowledge data, and improve the accuracy and the efficiency of high-dimensional data integration analysis.
Further, setting a laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model includes:
Setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model;
wherein, DSNet model specifically is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; the |Δ d|L|Δd | means that smoothing processing is performed on Δ d in the network knowledge dataset.
The invention integrates the gene dataset and the network knowledge dataset by using the regularization DSNet model, and sets regularization items in the DSNet model to improve the generalization capability of the model and prevent the model from being overfitted, so that the meta-analysis of the gene dataset is advanced by the integrated analysis of the gene dataset and the network knowledge dataset, and the performance of the model and the accuracy of the data integration result are improved.
Further, the converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator, including:
converting the DSNet model based on each characteristic dimension to obtain a conversion expression of the DSNet model; the conversion expression of DSNet model is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; i i represents the degree of feature I of the gene dataset, i.e., the number of edges connected to I; i k denotes the degree of the network knowledge data set feature k, i.e., the number of edges connected to k; b ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b ik is equal to 0; β i represents the ith dimension of β, β k represents the kth dimension of β, Δ d,i represents the ith dimension of Δ d, and Δ d,k represents the kth dimension of Δ d;
Obtaining a first feature operator beta and a second feature operator delta d from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
Inputting the network knowledge dataset and the gene dataset into a DSNet model after conversion, and calculating optimal solutions of all feature dimensions of a first feature operator beta and optimal solutions of all feature dimensions of a second feature operator delta d by using a coordinate descent algorithm.
Further, the calculating the optimal solution of all feature dimensions of the first feature operator β and the optimal solution of all feature dimensions of the second feature operator Δ d by using a coordinate descent algorithm includes:
calculating the optimal solution of all feature dimensions of the first feature operator beta, wherein the optimal solution is specifically as follows:
wherein,
Where β k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x ik represents the k gene in the ith sample data in the gene data set, x ij represents the j gene in the ith sample data in the gene data set, y i represents the true tag value, ω k、λ1,k、m1,k、νi、m2,k、ωk andRepresenting abstract intermediate quantities in the computing process;
the optimal solution of all feature dimensions of the second feature operator delta d is calculated as follows:
wherein,
m1,k=1+λ4sgn(Δd,k)
Where Δ d,k represents the kth dimension of Δ d, Δ d represents a unique effect specific to one dataset, and k represents the dimension of Δ d; x ik represents the k gene in the ith sample data in the gene dataset, x ij represents the j gene in the ith sample data in the gene dataset, y i represents the true tag value, lambda 2,k、ωk、m1,k、Δd,k、m2,k、rik、ωk andRepresenting an abstract intermediate quantity in the computation process.
According to the method, the optimal solutions of the characteristic operators for representing the data sharing effect and the characteristic operators for representing the data unique effect in all characteristic dimensions are calculated, the unified model of all data sets and the individual model of each data set are balanced, and the reliability and the accuracy are improved; the uniqueness of the data set is emphasized, and meanwhile, the public property of the data set is utilized to conduct comprehensive and accurate data integration analysis.
Further, the integrating the network knowledge data set and the gene data set in each feature dimension according to the optimal solution of all feature dimensions of the first feature operator and the optimal solution of all feature dimensions of the second feature operator to obtain an integration result of the network knowledge data set and the gene data set, including:
Judging whether links exist between the network knowledge data set and the gene data set in each characteristic dimension according to the optimal solutions of all the characteristic dimensions of the first characteristic operator and the optimal solutions of all the characteristic dimensions of the second characteristic operator;
If the network knowledge data set and the gene data set are linked in any one characteristic dimension, integrating the network knowledge data set and the gene data set in the characteristic dimension with the link to obtain a data integration result;
if the network knowledge data set and the gene data set are not linked in all feature dimensions, classifying the network knowledge data set and the gene data set to obtain a data classification result;
Integrating the data integration result and the data classification result in all feature dimensions, and outputting the integration result of the network knowledge data set and the gene data set.
According to the invention, the homogeneity and the heterogeneity of the gene data set and the network knowledge data set are analyzed according to the linking condition of the gene data set and the network knowledge data set in each dimension, so that the integration and the classification in the characteristic dimension are performed, and the stable and reliable gene data element analysis is promoted on the basis of considering the homogeneity, the heterogeneity and the priori external knowledge of the data.
Based on the method item embodiment, the invention correspondingly provides a system item embodiment, and provides a data integration system based on a regularization model, which comprises the following steps: the system comprises a data acquisition module, a penalty item setting module, a characteristic solving module and a data integration module;
the data acquisition module is used for acquiring a network knowledge data set and a gene data set;
the penalty setting module is used for setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model;
The feature solving module is configured to convert the DSNet model to obtain a converted DSNet model, obtain a first feature operator and a second feature operator, input the network knowledge dataset and the gene dataset into the converted DSNet model, and output an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
The data integration module is used for integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set.
Further, the penalty term setting module is configured to set a laplace regularization term and an L1/2 norm penalty term in a preset prediction model, so as to obtain a DSNet model, which specifically is:
The penalty setting module is used for setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model and a DSNet model;
wherein, DSNet model specifically is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; the |Δ d|L|Δd | means that smoothing processing is performed on Δ d in the network knowledge dataset.
Further, the feature solving module includes: the system comprises a model conversion unit, a characteristic operator unit and a multidimensional solving unit;
The model conversion unit is used for converting the DSNet model based on each characteristic dimension to obtain a conversion expression of the DSNet model; the conversion expression of DSNet model is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; i i represents the degree of feature I of the gene dataset, i.e., the number of edges connected to I; i k denotes the degree of the network knowledge data set feature k, i.e., the number of edges connected to k; b ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b ik is equal to 0; β i represents the ith dimension of β, β k represents the kth dimension of β, Δ d,i represents the ith dimension of Δ d, and Δ d,k represents the kth dimension of Δ d;
The feature operator unit is used for obtaining a first feature operator beta and a second feature operator delta d from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
The multidimensional solving unit is used for inputting the network knowledge dataset and the gene dataset into the DSNet model after conversion, and calculating the optimal solution of all feature dimensions of the first feature operator beta and the optimal solution of all feature dimensions of the second feature operator delta d by using a coordinate descent algorithm.
Further, the multi-dimensional solving unit includes: sharing effector units and unique effector units;
the sharing effector unit is configured to calculate an optimal solution of all feature dimensions of the first feature operator β, and specifically includes:
wherein,
Where β k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x ik represents the k gene in the ith sample data in the gene data set, x ij represents the j gene in the ith sample data in the gene data set, y i represents the true tag value, ω k、λ1,k、m1,k、νi、m2,k、ωk andRepresenting abstract intermediate quantities in the computing process;
The unique effector unit is configured to calculate an optimal solution of all feature dimensions of the second feature operator Δ d, and specifically includes the following steps:
wherein,
m1,k=1+λ4sgn(Δd,k)
Where Δ d,k represents the kth dimension of Δ d, Δ d represents a unique effect specific to one dataset, and k represents the dimension of Δ d; x ik represents the k gene in the ith sample data in the gene dataset, x ij represents the j gene in the ith sample data in the gene dataset, y i represents the true tag value, lambda 2,k、ωk、m1,k、Δd,k、m2,k、rik、ωk andRepresenting an abstract intermediate quantity in the computation process.
Further, the data integration module includes: the device comprises a dimension link unit, an integration unit, a classification unit and a feature merging unit;
The dimension link unit is used for judging whether the network knowledge dataset and the gene dataset are linked in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator;
the integration unit is used for integrating the network knowledge data set and the gene data set on the characteristic dimension with the link if the network knowledge data set and the gene data set are linked on any characteristic dimension, so as to obtain a data integration result;
the classifying unit is used for classifying the network knowledge data set and the gene data set to obtain a data classifying result if the network knowledge data set and the gene data set are not linked in all characteristic dimensions;
the feature merging unit is used for integrating the data integration result and the data classification result in all feature dimensions and outputting the integration result of the network knowledge dataset and the gene dataset.
Drawings
Fig. 1: the invention provides a regularization model-based data integration method, which comprises a method flow chart of one embodiment of the data integration method;
fig. 2: the invention provides a module structure diagram of one embodiment of a data integration system based on a regularization model;
fig. 3: the invention provides a module structure diagram of one embodiment of a feature solving module;
fig. 4: the invention provides a module structure diagram of one embodiment of a multi-dimensional solving unit;
fig. 5: the invention provides a module structure diagram of one embodiment of a data integration module.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present application, it should be understood that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", and "a third" may explicitly or implicitly include one or more such feature. In the description of the application, unless otherwise indicated, the meaning of "a number" is two or more.
The invention provides a regularization model-based data integration method and a regularization model-based data integration system, which are used for integrating and analyzing biological data and network knowledge data in a regularization model, so that accuracy and efficiency of high-dimensional data integration analysis are improved.
Example 1
Based on the above requirements, an embodiment of the present invention provides a regularization model-based data integration method, where the flow of the method is shown in fig. 1, and the method includes steps S1 to S4, and the steps are specifically as follows:
S1, acquiring a network knowledge data set and a gene data set.
And acquiring relevant network knowledge data and gene data from the big data platform, and carrying out normalization processing on the acquired data to obtain a network knowledge data set and a gene data set.
Based on the network knowledge dataset and the gene dataset described above, assuming the dataset has N observations, each in the form of (x i,yi,di), where x i∈Rp, p represents the number of genes in the gene dataset; y i denotes the real label, y i∈{0,1};di e {1,2, …, D }, D denotes the number of datasets being processed. Let X i be the row to form a matrix X, y= (y 1,y2,...,yn),d=(d1,d2,...,dn).
S2, setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model.
And constructing a prediction model oriented to the network knowledge dataset and the gene dataset, and setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model.
Wherein, DSNet model specifically is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; the |Δ d|L|Δd | means that smoothing processing is performed on Δ d in the network knowledge dataset.
S3, converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used to represent a data unique effect. The method comprises the steps of S3.1 to S3.3, wherein the steps are as follows:
s3.1, converting the DSNet model based on each characteristic dimension to obtain a conversion expression of the DSNet model.
The conversion expression of DSNet model is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; i i represents the degree of feature I of the gene dataset, i.e., the number of edges connected to I; i k denotes the degree of the network knowledge data set feature k, i.e., the number of edges connected to k; b ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b ik is equal to 0; β i represents the ith dimension of β, β k represents the kth dimension of β, Δ d,i represents the ith dimension of Δ d, and Δ d,k represents the kth dimension of Δ d.
S3.2, a first feature operator beta and a second feature operator delta d are obtained from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used to represent a data unique effect.
Based on the conversion expression of the step S3.1DSNet model, a first eigenvector beta for representing the data sharing effect and a second eigenvector delta d for representing the data unique effect are obtained.
S3.3, inputting the network knowledge dataset and the gene dataset into a DSNet model after conversion, and calculating optimal solutions of all feature dimensions of the first feature operator beta and optimal solutions of all feature dimensions of the second feature operator delta d by using a coordinate descent algorithm.
Inputting the network knowledge data set and the gene data set in the step S1 into the DSNet model after conversion, and calculating the optimal solution of all feature dimensions of the first feature operator beta and the optimal solution of all feature dimensions of the second feature operator delta d by using a coordinate descent algorithm to obtain the linking condition of the network knowledge data set and the gene data on all feature dimensions. Where β i is approximated as sgn (β ii and β i will be derived using standard linear regression or Lasso) and similarly, Δ d,i is approximated as sgn (Δ d,id,i and Δ d,i will be derived using standard linear regression or Lasso) in the solution process of the eigen operator.
Step S3.3 includes steps S3.3.1 to S3.3.2, each of which is specifically as follows:
S3.3.1, calculating the optimal solutions of all feature dimensions of the first feature operator beta.
Fixing the second eigenvector delta d, separating the first eigenvector beta, where DSNet model is expressed as:
Where β i represents the ith dimension of β, x ij represents the j gene in the ith sample data in the gene dataset, β k represents the kth dimension of β, and x ik represents the k gene in the ith sample data in the gene dataset; β i represents vector set β i={β12,…,βi},βk represents vector set β k={β12,…,βk; b ij is equal to 1 when there is a link between data i in the genetic dataset and data j in the network knowledge dataset, otherwise b ij is equal to 0; b ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b ik is equal to 0.
The first derivative of the above formula for β k is:
Assume that />Substituting into the first derivative of β k above, we can obtain:
Assume that />Substituting into the first derivative of β k above, we can obtain: /(I)
First derivative expression for beta k And (3) carrying out classification solving:
(1) When ω k < 0:
There are H (τ) =3μ2k >0 and H (0) >0, where the optimal solution for β k is β k =0.
(2) When (when)When (1):
With H (0) >0 and The optimal solution for β k at this point is β k =0.
(3) When (when)When (1):
expression type There are three heels, respectively: /(I) And/>Wherein/>Is the only solution, then the optimal solution for beta k is/>Wherein/>
The optimal solution combining the three cases can be obtained:
(1) When β k > 0:
(2) When β k < 0:
In summary, the optimal solution for the first eigenvector β k can be expressed as follows:
wherein,
/>
Where β k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x ik represents the k gene in the ith sample data in the gene data set, x ij represents the j gene in the ith sample data in the gene data set, y i represents the true tag value, ω k、λ1,k、m1,k、vi、m2,k、ωk andRepresenting an abstract intermediate quantity in the computation process.
On the basis of the optimal solution of the first feature operator beta k, the coordinate descent algorithm can be utilized to directly pay attention to solving all feature dimensions of beta k, so that the optimal solution of all feature dimensions of the first feature operator beta is obtained.
S3.3.2, calculating the optimal solutions of all feature dimensions of the second feature operator delta d.
Similar to the calculation process of the optimal solution of all feature dimensions of the first feature operator β in step S3.3.1, the first feature operator β is fixed, the second feature operator Δ d is separated, and the optimal solution of the second feature operator Δ d is calculated as follows:
wherein,
m1,k=1+λ4sgn(Δd,k)
Where Δ d,k represents the kth dimension of Δ d, Δ d represents a unique effect specific to one dataset, and k represents the dimension of Δ d; x ik represents the k gene in the ith sample data in the gene dataset, x ij represents the j gene in the ith sample data in the gene dataset, y i represents the true tag value, lambda 2,k、ωk、m1,k、Δd,k、m2,k、rik、ωk andRepresenting an abstract intermediate quantity in the computation process.
On the basis of the optimal solution of the second feature operator delta d, the coordinate descent algorithm can be utilized to directly pay attention to solving all feature dimensions of delta d, so that the optimal solution of all feature dimensions of the second feature operator delta d is obtained.
S4, integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator, and obtaining an integration result of the network knowledge data set and the gene data set. The method comprises the steps of S4.1 to S4.2, wherein the steps are as follows:
s4.1, judging whether links exist between the network knowledge dataset and the gene dataset in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator.
If the network knowledge data set and the gene data set are linked in any feature dimension, integrating the network knowledge data set and the gene data set in the feature dimension with the link to obtain a data integration result.
And if the network knowledge data set and the gene data set have no links in all feature dimensions, classifying the network knowledge data set and the gene data set to obtain a data classification result.
And S4.2, integrating the data integration result and the data classification result in all characteristic dimensions, and outputting the integration result of the network knowledge data set and the gene data set.
The embodiment of the invention has the following beneficial effects:
According to the regularization model-based data integration method, a DSNet model is obtained by adding a Laplacian regularization term and an Lq norm penalty term into a prediction model to be used as constraints; and then inputting the network knowledge dataset and the gene dataset to be integrated into the DSNet model, calculating the optimal solution of the feature operator in each feature dimension, and integrating data in the feature dimension according to the obtained optimal solution. The method and the device use the regularization model to integrate and analyze the biological data and the network knowledge data, and improve the accuracy and the efficiency of high-dimensional data integration analysis.
Example 2
Based on the foregoing disclosure of the embodiments, an embodiment of the present invention provides a data integration system based on a regularization model, including: a data acquisition module 101, a penalty term setting module 102, a feature solving module 103 and a data integration module 104. The system architecture is shown in fig. 2.
The data acquisition module 101 is configured to acquire a network knowledge data set and a gene data set;
The penalty setting module 102 is configured to set a laplace regularization term and an L1/2 norm penalty term in a preset prediction model, so as to obtain a DSNet model;
The feature solving module 103 is configured to convert the DSNet model to obtain a converted DSNet model, obtain a first feature operator and a second feature operator, input the network knowledge dataset and the genetic dataset into the converted DSNet model, and output an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
The data integration module 104 is configured to integrate a network knowledge dataset and a gene dataset in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator, so as to obtain an integration result of the network knowledge dataset and the gene dataset.
In a possible implementation manner, the penalty setting module 102 is configured to set a laplacian regularization term and an L1/2 norm penalty in a preset prediction model, so as to obtain a DSNet model, which is specifically:
The penalty setting module 102 is configured to set a laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model, and obtain a DSNet model;
wherein, DSNet model specifically is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; the |Δ d|L|Δd | means that smoothing processing is performed on Δ d in the network knowledge dataset.
In one possible implementation manner, the feature solving module 103 has a structure as shown in fig. 3, and includes: a model conversion unit 201, a feature operator unit 202, and a multi-dimensional solving unit 203;
The model conversion unit 201 is configured to convert the DSNet model based on each feature dimension to obtain a conversion expression of the DSNet model; the conversion expression of DSNet model is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; i i represents the degree of feature I of the gene dataset, i.e., the number of edges connected to I; i k denotes the degree of the network knowledge data set feature k, i.e., the number of edges connected to k; b ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b ik is equal to 0; β i represents the ith dimension of β, β k represents the kth dimension of β, Δ d,i represents the ith dimension of Δ d, and Δ d,k represents the kth dimension of Δ d;
The feature operator unit 202 is configured to obtain a first feature operator β and a second feature operator Δ d from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
the multidimensional solving unit 203 is configured to input the network knowledge dataset and the genetic dataset into the converted DSNet model, and calculate an optimal solution of all feature dimensions of the first feature operator β and an optimal solution of all feature dimensions of the second feature operator Δ d by using a coordinate descent algorithm.
Further, the multi-dimensional solving unit 203 has a structure as shown in fig. 4, and includes: sharing an effector unit 301 and a unique effector unit 302;
the sharing effector unit 301 is configured to calculate an optimal solution of all feature dimensions of the first feature operator β, and specifically is as follows:
wherein,
Where β k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x ik represents the k gene in the ith sample data in the gene data set, x ij represents the j gene in the ith sample data in the gene data set, y i represents the true tag value, ω k、λ1,k、m1,k、vi、m2,k、ωk andRepresenting abstract intermediate quantities in the computing process;
The unique effector unit 302 is configured to calculate an optimal solution of all feature dimensions of the second feature operator Δ d, which is specifically as follows:
wherein,
/>
m1,k=1+λ4sgn(Δd,k)
Where Δ d,k represents the kth dimension of Δ d, Δ d represents a unique effect specific to one dataset, and k represents the dimension of Δ d; x ik represents the k gene in the ith sample data in the gene dataset, x ij represents the j gene in the ith sample data in the gene dataset, y i represents the true tag value, lambda 2,k、ωk、m1,k、Δd,k、m2,k、rik、ωk andRepresenting an abstract intermediate quantity in the computation process.
In one possible implementation manner, the data integration module 104 has a structure as shown in fig. 5, and includes: a dimension linking unit 401, an integrating unit 402, a classifying unit 403, and a feature merging unit 404;
the dimension linking unit 401 is configured to determine whether a network knowledge dataset and a gene dataset have links in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator;
The integrating unit 402 is configured to integrate the network knowledge dataset and the gene dataset in a feature dimension where a link exists if the network knowledge dataset and the gene dataset exist in any feature dimension, so as to obtain a data integration result;
The classifying unit 403 is configured to classify the network knowledge dataset and the gene dataset if no link exists between the network knowledge dataset and the gene dataset in all feature dimensions, so as to obtain a data classification result;
The feature merging unit 404 is configured to integrate the data integration result and the data classification result in all feature dimensions, and output an integration result of the network knowledge dataset and the gene dataset.
The more detailed working principle and the procedure of the present embodiment can be, but not limited to, those described in embodiment 1.
The embodiment of the invention has the following beneficial effects:
The invention provides a regularization model-based data integration system, which comprises: the system comprises a data acquisition module, a penalty term setting module, a characteristic solving module and a data integration module. The method comprises the steps of constructing a regularization model for data integration by setting a penalty term module; calculating optimal solutions of feature operators on all feature dimensions through a feature solving module; carrying out data integration classification according to the link conditions of the network knowledge data set and the gene data in each characteristic dimension by a data integration module; the invention uses the regularization model to integrate and analyze the biological data and the network knowledge data, thereby improving the accuracy and the efficiency of the high-dimensional data integration analysis.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (4)

1. A regularized model-based data integration method, comprising:
Acquiring a network knowledge data set and a gene data set;
Setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model; wherein, DSNet model specifically is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; |Δ d|L|Δd | denotes that smoothing is performed on Δ d in the network knowledge dataset;
Converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
And integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set.
2. The regularization model-based data integration method of claim 1, wherein integrating the network knowledge dataset and the genetic dataset in each feature dimension based on the optimal solution for all feature dimensions of the first feature operator and the optimal solution for all feature dimensions of the second feature operator to obtain an integration result of the network knowledge dataset and the genetic dataset, comprises:
Judging whether links exist between the network knowledge data set and the gene data set in each characteristic dimension according to the optimal solutions of all the characteristic dimensions of the first characteristic operator and the optimal solutions of all the characteristic dimensions of the second characteristic operator;
If the network knowledge data set and the gene data set are linked in any one characteristic dimension, integrating the network knowledge data set and the gene data set in the characteristic dimension with the link to obtain a data integration result;
if the network knowledge data set and the gene data set are not linked in all feature dimensions, classifying the network knowledge data set and the gene data set to obtain a data classification result;
Integrating the data integration result and the data classification result in all feature dimensions, and outputting the integration result of the network knowledge data set and the gene data set.
3. A regularized model-based data integration system, comprising: the system comprises a data acquisition module, a penalty item setting module, a characteristic solving module and a data integration module;
the data acquisition module is used for acquiring a network knowledge data set and a gene data set;
the penalty setting module is used for setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model; wherein, DSNet model specifically is as follows:
Where β represents the shared effect that remains consistent across the D datasets, Δ d represents the unique effect specific to one dataset, λ 1 and λ 2 represent parameters that control model sparsity, and λ 3 and λ 4 represent parameters that adjust model smoothness; x i denotes an input matrix, y i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; |Δ d|L|Δd | denotes that smoothing is performed on Δ d in the network knowledge dataset;
The feature solving module is configured to convert the DSNet model to obtain a converted DSNet model, obtain a first feature operator and a second feature operator, input the network knowledge dataset and the gene dataset into the converted DSNet model, and output an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;
The data integration module is used for integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set.
4. The regularization model-based data integration system of claim 3, wherein the data integration module comprises: the device comprises a dimension link unit, an integration unit, a classification unit and a feature merging unit;
The dimension link unit is used for judging whether the network knowledge dataset and the gene dataset are linked in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator;
the integration unit is used for integrating the network knowledge data set and the gene data set on the characteristic dimension with the link if the network knowledge data set and the gene data set are linked on any characteristic dimension, so as to obtain a data integration result;
the classifying unit is used for classifying the network knowledge data set and the gene data set to obtain a data classifying result if the network knowledge data set and the gene data set are not linked in all characteristic dimensions;
the feature merging unit is used for integrating the data integration result and the data classification result in all feature dimensions and outputting the integration result of the network knowledge dataset and the gene dataset.
CN202311789980.3A 2023-12-25 2023-12-25 Data integration method and system based on regularization model Active CN117727372B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311789980.3A CN117727372B (en) 2023-12-25 2023-12-25 Data integration method and system based on regularization model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311789980.3A CN117727372B (en) 2023-12-25 2023-12-25 Data integration method and system based on regularization model

Publications (2)

Publication Number Publication Date
CN117727372A CN117727372A (en) 2024-03-19
CN117727372B true CN117727372B (en) 2024-05-17

Family

ID=90199728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311789980.3A Active CN117727372B (en) 2023-12-25 2023-12-25 Data integration method and system based on regularization model

Country Status (1)

Country Link
CN (1) CN117727372B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686399A (en) * 2018-12-13 2019-04-26 韶关学院 A kind of gene data collection confluence analysis method
CN113838519A (en) * 2021-08-20 2021-12-24 河南大学 Gene selection method and system based on adaptive gene interaction regularization elastic network model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10515724B2 (en) * 2016-05-10 2019-12-24 Macau University Of Science And Technolog Method and system for determining an association of biological feature with medical condition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686399A (en) * 2018-12-13 2019-04-26 韶关学院 A kind of gene data collection confluence analysis method
CN113838519A (en) * 2021-08-20 2021-12-24 河南大学 Gene selection method and system based on adaptive gene interaction regularization elastic network model

Also Published As

Publication number Publication date
CN117727372A (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN110033021B (en) Fault classification method based on one-dimensional multipath convolutional neural network
EP4123515A1 (en) Data processing method and data processing device
Aaron et al. Dynamic incremental k-means clustering
CN111046961B (en) Fault classification method based on bidirectional long-time and short-time memory unit and capsule network
CN113298009B (en) Entropy regularization-based self-adaptive adjacent face image clustering method
CN113420421B (en) QoS prediction method based on time sequence regularized tensor decomposition in mobile edge calculation
Sklar Fast MLE computation for the Dirichlet multinomial
CN112633503A (en) Tool variable generation and counterfactual reasoning method and device based on neural network
CN117727372B (en) Data integration method and system based on regularization model
Xu et al. A fast algorithm for unbiased estimation of variance of AUC based on dynamic programming
Yang et al. A semi-supervised feature contrast convolutional neural network for processes fault diagnosis
Sachdeva et al. Pyuoi: The union of intersections framework in python
CN115712848A (en) Structured deep clustering network and method based on graph structure learning
CN116644593A (en) non-Gaussian detection discrimination method and device for uncertain noise
CN115578325A (en) Image anomaly detection method based on channel attention registration network
CN115392375A (en) Intelligent evaluation method and system for multi-source data fusion degree
Das et al. Five years of gene networks modeling in single-cell RNA-sequencing studies: current approaches and outstanding challenges
CN115392102A (en) Method and device for establishing energy consumption prediction model and method and system for predicting energy consumption
Luo et al. A new approach to building the Gaussian process model for expensive multi-objective optimization
Yuan et al. Deep and spatio-temporal detection for abnormal traffic in cloud data centers
CN115240800B (en) Medical data intelligent analysis execution method based on big data platform
Nagananda et al. Gilda++: Grassmann incremental linear discriminant analysis
Jin et al. Semi-supervised partial least squares
CN113870950B (en) Identification system and identification method for key sRNA of rice infected by Pyricularia oryzae
CN114140635B (en) Non-negative matrix factorization method for self-expression learning supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant