CN117727372B

CN117727372B - Data integration method and system based on regularization model

Info

Publication number: CN117727372B
Application number: CN202311789980.3A
Authority: CN
Inventors: 黄海辉
Original assignee: Shaoguan University
Current assignee: Shaoguan University
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-05-17
Anticipated expiration: 2043-12-25
Also published as: CN117727372A

Abstract

The invention provides a regularization model-based data integration method and a regularization model-based data integration system, wherein the regularization model-based data integration method comprises the following steps: setting a Laplace regularization term and an Lq norm penalty term in a preset prediction model to obtain a DSNet model; converting DSNet models to obtain converted DSNet models, obtaining a first feature operator and a second feature operator, inputting a network knowledge dataset and a gene dataset into the converted DSNet models, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solution of all characteristic dimensions of the first characteristic operator and the optimal solution of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set; and the regularization model is used for carrying out integrated analysis on the biological data and the network knowledge data, so that the accuracy and the efficiency of the integrated analysis of the high-dimensional data are improved.

Description

Data integration method and system based on regularization model

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data integration method and system based on a regularization model.

Background

With the rapid development of modern high-throughput biomedical instruments, data in the life sciences field has accumulated a lot. For example, gene expression panorama (GEO) has collected over 340 ten thousand sample data. Due to the large data accumulation, it is a great challenge to identify robust gene biomarker data from a vast data pool that is relevant to the onset and progression of certain diseases.

In analyzing gene expression data using machine learning techniques, researchers are often faced with three major problems of "large p, small n", data heterogeneity and low reproducibility. In the prior art, the main approach to solve these three problems is to process various gene datasets through meta-analysis or integrated analysis to enable them to improve statistical performance in genomic studies. Typical genetic dataset integration methods include meta-threshold gradient descent regularization, meta-lasso, meta-non-convex optimization, data Sharing Lasso (DSL), DSL2, and the like.

However, the above-described typical gene dataset integration method does not fully utilize external biological knowledge such as gene-gene or protein-protein interaction networks, thereby limiting the performance of the dataset integration method. In an integrated analysis approach to both the genetic dataset and the external network knowledge dataset, the L1 penalty term can be applied to handle a priori network knowledge in various models, however the L1 penalty term tends to make model coefficients, creating additional bias, especially when handling high dimensional data such as genomic data. The Lq penalty term (0 < q < 1) can theoretically provide better sparsity, computational efficiency, and computational accuracy than the L1 penalty term.

Disclosure of Invention

The invention provides a regularization model-based data integration method and a regularization model-based data integration system, which are used for integrating and analyzing biological data and network knowledge data in a regularization model, so that accuracy and efficiency of high-dimensional data integration analysis are improved.

In order to solve the technical problems, the invention provides a regularization model-based data integration method, which comprises the following steps:

Acquiring a network knowledge data set and a gene data set;

Setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model;

Converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;

And integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set.

Adding a Laplace regularization term and an Lq norm penalty term to a prediction model to serve as constraint to obtain a DSNet model; and then inputting the network knowledge dataset and the gene dataset to be integrated into the DSNet model, calculating the optimal solution of the feature operator in each feature dimension, and integrating data in the feature dimension according to the obtained optimal solution. The method and the device use the regularization model to integrate and analyze the biological data and the network knowledge data, and improve the accuracy and the efficiency of high-dimensional data integration analysis.

Further, setting a laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model includes:

wherein, DSNet model specifically is as follows:

Where β represents the shared effect that remains consistent across the D datasets, Δ _d represents the unique effect specific to one dataset, λ ₁ and λ ₂ represent parameters that control model sparsity, and λ ₃ and λ ₄ represent parameters that adjust model smoothness; x _i denotes an input matrix, y _i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r _d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; the |Δ _d|L|Δ_d | means that smoothing processing is performed on Δ _d in the network knowledge dataset.

The invention integrates the gene dataset and the network knowledge dataset by using the regularization DSNet model, and sets regularization items in the DSNet model to improve the generalization capability of the model and prevent the model from being overfitted, so that the meta-analysis of the gene dataset is advanced by the integrated analysis of the gene dataset and the network knowledge dataset, and the performance of the model and the accuracy of the data integration result are improved.

Further, the converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator, including:

converting the DSNet model based on each characteristic dimension to obtain a conversion expression of the DSNet model; the conversion expression of DSNet model is as follows:

Where β represents the shared effect that remains consistent across the D datasets, Δ _d represents the unique effect specific to one dataset, λ ₁ and λ ₂ represent parameters that control model sparsity, and λ ₃ and λ ₄ represent parameters that adjust model smoothness; x _i denotes an input matrix, y _i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r _d represents the weight of the unique effect of each dataset; i _i represents the degree of feature I of the gene dataset, i.e., the number of edges connected to I; i _k denotes the degree of the network knowledge data set feature k, i.e., the number of edges connected to k; b _ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b _ik is equal to 0; β _i represents the ith dimension of β, β _k represents the kth dimension of β, Δ _d,i represents the ith dimension of Δ _d, and Δ _d,k represents the kth dimension of Δ _d;

Obtaining a first feature operator beta and a second feature operator delta _d from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;

Inputting the network knowledge dataset and the gene dataset into a DSNet model after conversion, and calculating optimal solutions of all feature dimensions of a first feature operator beta and optimal solutions of all feature dimensions of a second feature operator delta _d by using a coordinate descent algorithm.

Further, the calculating the optimal solution of all feature dimensions of the first feature operator β and the optimal solution of all feature dimensions of the second feature operator Δ _d by using a coordinate descent algorithm includes:

calculating the optimal solution of all feature dimensions of the first feature operator beta, wherein the optimal solution is specifically as follows:

wherein,

Where β _k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b _ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x _ik represents the k gene in the ith sample data in the gene data set, x _ij represents the j gene in the ith sample data in the gene data set, y _i represents the true tag value, ω _k、λ_1,k、m_1,k、ν_i、m_2,k、ω_k andRepresenting abstract intermediate quantities in the computing process;

the optimal solution of all feature dimensions of the second feature operator delta _d is calculated as follows:

wherein,

m_1,k＝1+λ₄sgn(Δ_d,k)

Where Δ _d,k represents the kth dimension of Δ _d, Δ _d represents a unique effect specific to one dataset, and k represents the dimension of Δ _d; x _ik represents the k gene in the ith sample data in the gene dataset, x _ij represents the j gene in the ith sample data in the gene dataset, y _i represents the true tag value, lambda _2,k、ω_k、m_1,k、Δ_d,k、m_2,k、r_ik、ω_k andRepresenting an abstract intermediate quantity in the computation process.

According to the method, the optimal solutions of the characteristic operators for representing the data sharing effect and the characteristic operators for representing the data unique effect in all characteristic dimensions are calculated, the unified model of all data sets and the individual model of each data set are balanced, and the reliability and the accuracy are improved; the uniqueness of the data set is emphasized, and meanwhile, the public property of the data set is utilized to conduct comprehensive and accurate data integration analysis.

Further, the integrating the network knowledge data set and the gene data set in each feature dimension according to the optimal solution of all feature dimensions of the first feature operator and the optimal solution of all feature dimensions of the second feature operator to obtain an integration result of the network knowledge data set and the gene data set, including:

Judging whether links exist between the network knowledge data set and the gene data set in each characteristic dimension according to the optimal solutions of all the characteristic dimensions of the first characteristic operator and the optimal solutions of all the characteristic dimensions of the second characteristic operator;

If the network knowledge data set and the gene data set are linked in any one characteristic dimension, integrating the network knowledge data set and the gene data set in the characteristic dimension with the link to obtain a data integration result;

if the network knowledge data set and the gene data set are not linked in all feature dimensions, classifying the network knowledge data set and the gene data set to obtain a data classification result;

Integrating the data integration result and the data classification result in all feature dimensions, and outputting the integration result of the network knowledge data set and the gene data set.

According to the invention, the homogeneity and the heterogeneity of the gene data set and the network knowledge data set are analyzed according to the linking condition of the gene data set and the network knowledge data set in each dimension, so that the integration and the classification in the characteristic dimension are performed, and the stable and reliable gene data element analysis is promoted on the basis of considering the homogeneity, the heterogeneity and the priori external knowledge of the data.

Based on the method item embodiment, the invention correspondingly provides a system item embodiment, and provides a data integration system based on a regularization model, which comprises the following steps: the system comprises a data acquisition module, a penalty item setting module, a characteristic solving module and a data integration module;

the data acquisition module is used for acquiring a network knowledge data set and a gene data set;

the penalty setting module is used for setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model;

The feature solving module is configured to convert the DSNet model to obtain a converted DSNet model, obtain a first feature operator and a second feature operator, input the network knowledge dataset and the gene dataset into the converted DSNet model, and output an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;

The data integration module is used for integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator to obtain an integration result of the network knowledge data set and the gene data set.

Further, the penalty term setting module is configured to set a laplace regularization term and an L1/2 norm penalty term in a preset prediction model, so as to obtain a DSNet model, which specifically is:

The penalty setting module is used for setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model and a DSNet model;

wherein, DSNet model specifically is as follows:

Further, the feature solving module includes: the system comprises a model conversion unit, a characteristic operator unit and a multidimensional solving unit;

The model conversion unit is used for converting the DSNet model based on each characteristic dimension to obtain a conversion expression of the DSNet model; the conversion expression of DSNet model is as follows:

The feature operator unit is used for obtaining a first feature operator beta and a second feature operator delta _d from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;

The multidimensional solving unit is used for inputting the network knowledge dataset and the gene dataset into the DSNet model after conversion, and calculating the optimal solution of all feature dimensions of the first feature operator beta and the optimal solution of all feature dimensions of the second feature operator delta _d by using a coordinate descent algorithm.

Further, the multi-dimensional solving unit includes: sharing effector units and unique effector units;

the sharing effector unit is configured to calculate an optimal solution of all feature dimensions of the first feature operator β, and specifically includes:

wherein,

The unique effector unit is configured to calculate an optimal solution of all feature dimensions of the second feature operator Δ _d, and specifically includes the following steps:

wherein,

m_1,k＝1+λ₄sgn(Δ_d,k)

Further, the data integration module includes: the device comprises a dimension link unit, an integration unit, a classification unit and a feature merging unit;

The dimension link unit is used for judging whether the network knowledge dataset and the gene dataset are linked in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator;

the integration unit is used for integrating the network knowledge data set and the gene data set on the characteristic dimension with the link if the network knowledge data set and the gene data set are linked on any characteristic dimension, so as to obtain a data integration result;

the classifying unit is used for classifying the network knowledge data set and the gene data set to obtain a data classifying result if the network knowledge data set and the gene data set are not linked in all characteristic dimensions;

the feature merging unit is used for integrating the data integration result and the data classification result in all feature dimensions and outputting the integration result of the network knowledge dataset and the gene dataset.

Drawings

Fig. 1: the invention provides a regularization model-based data integration method, which comprises a method flow chart of one embodiment of the data integration method;

fig. 2: the invention provides a module structure diagram of one embodiment of a data integration system based on a regularization model;

fig. 3: the invention provides a module structure diagram of one embodiment of a feature solving module;

fig. 4: the invention provides a module structure diagram of one embodiment of a multi-dimensional solving unit;

fig. 5: the invention provides a module structure diagram of one embodiment of a data integration module.

Detailed Description

The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present application, it should be understood that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", and "a third" may explicitly or implicitly include one or more such feature. In the description of the application, unless otherwise indicated, the meaning of "a number" is two or more.

Example 1

Based on the above requirements, an embodiment of the present invention provides a regularization model-based data integration method, where the flow of the method is shown in fig. 1, and the method includes steps S1 to S4, and the steps are specifically as follows:

S1, acquiring a network knowledge data set and a gene data set.

And acquiring relevant network knowledge data and gene data from the big data platform, and carrying out normalization processing on the acquired data to obtain a network knowledge data set and a gene data set.

Based on the network knowledge dataset and the gene dataset described above, assuming the dataset has N observations, each in the form of (x _i,y_i,d_i), where x _i∈R_p, p represents the number of genes in the gene dataset; y _i denotes the real label, y _i∈{0,1};d_i e {1,2, …, D }, D denotes the number of datasets being processed. Let X _i be the row to form a matrix X, y= (y ₁,y₂,...,y_n),d＝(d₁,d₂,...,d_n).

S2, setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model.

And constructing a prediction model oriented to the network knowledge dataset and the gene dataset, and setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model.

Wherein, DSNet model specifically is as follows:

S3, converting the DSNet model to obtain a converted DSNet model, obtaining a first feature operator and a second feature operator, inputting the network knowledge dataset and the gene dataset into the converted DSNet model, and outputting optimal solutions of all feature dimensions of the first feature operator and optimal solutions of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used to represent a data unique effect. The method comprises the steps of S3.1 to S3.3, wherein the steps are as follows:

s3.1, converting the DSNet model based on each characteristic dimension to obtain a conversion expression of the DSNet model.

The conversion expression of DSNet model is as follows:

Where β represents the shared effect that remains consistent across the D datasets, Δ _d represents the unique effect specific to one dataset, λ ₁ and λ ₂ represent parameters that control model sparsity, and λ ₃ and λ ₄ represent parameters that adjust model smoothness; x _i denotes an input matrix, y _i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r _d represents the weight of the unique effect of each dataset; i _i represents the degree of feature I of the gene dataset, i.e., the number of edges connected to I; i _k denotes the degree of the network knowledge data set feature k, i.e., the number of edges connected to k; b _ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b _ik is equal to 0; β _i represents the ith dimension of β, β _k represents the kth dimension of β, Δ _d,i represents the ith dimension of Δ _d, and Δ _d,k represents the kth dimension of Δ _d.

S3.2, a first feature operator beta and a second feature operator delta _d are obtained from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used to represent a data unique effect.

Based on the conversion expression of the step S3.1DSNet model, a first eigenvector beta for representing the data sharing effect and a second eigenvector delta _d for representing the data unique effect are obtained.

S3.3, inputting the network knowledge dataset and the gene dataset into a DSNet model after conversion, and calculating optimal solutions of all feature dimensions of the first feature operator beta and optimal solutions of all feature dimensions of the second feature operator delta _d by using a coordinate descent algorithm.

Inputting the network knowledge data set and the gene data set in the step S1 into the DSNet model after conversion, and calculating the optimal solution of all feature dimensions of the first feature operator beta and the optimal solution of all feature dimensions of the second feature operator delta _d by using a coordinate descent algorithm to obtain the linking condition of the network knowledge data set and the gene data on all feature dimensions. Where β _i is approximated as sgn (β _i)β_i and β _i will be derived using standard linear regression or Lasso) and similarly, Δ _d,i is approximated as sgn (Δ _d,i)Δ_d,i and Δ _d,i will be derived using standard linear regression or Lasso) in the solution process of the eigen operator.

Step S3.3 includes steps S3.3.1 to S3.3.2, each of which is specifically as follows:

S3.3.1, calculating the optimal solutions of all feature dimensions of the first feature operator beta.

Fixing the second eigenvector delta _d, separating the first eigenvector beta, where DSNet model is expressed as:

Where β _i represents the ith dimension of β, x _ij represents the j gene in the ith sample data in the gene dataset, β _k represents the kth dimension of β, and x _ik represents the k gene in the ith sample data in the gene dataset; β _i represents vector set β _i＝{β₁,β₂,…,β_i},β_k represents vector set β _k＝{β₁,β₂,…,β_k; b _ij is equal to 1 when there is a link between data i in the genetic dataset and data j in the network knowledge dataset, otherwise b _ij is equal to 0; b _ik is equal to 1 when there is a link between data i in the genetic dataset and data k in the network knowledge dataset, otherwise b _ik is equal to 0.

The first derivative of the above formula for β _k is:

Assume that />Substituting into the first derivative of β _k above, we can obtain:

Assume that />Substituting into the first derivative of β _k above, we can obtain: /(I)

First derivative expression for beta _k And (3) carrying out classification solving:

(1) When ω _k < 0:

There are H (τ) ^′＝3μ²-ω_k >0 and H (0) >0, where the optimal solution for β _k is β _k =0.

(2) When (when)When (1):

With H (0) >0 and The optimal solution for β _k at this point is β _k =0.

(3) When (when)When (1):

expression type There are three heels, respectively: /(I) And/>Wherein/>Is the only solution, then the optimal solution for beta _k is/>Wherein/>

The optimal solution combining the three cases can be obtained:

(1) When β _k > 0:

(2) When β _k < 0:

In summary, the optimal solution for the first eigenvector β _k can be expressed as follows:

wherein,

/>

Where β _k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b _ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x _ik represents the k gene in the ith sample data in the gene data set, x _ij represents the j gene in the ith sample data in the gene data set, y _i represents the true tag value, ω _k、λ_1,k、m_1,k、v_i、m_2,k、ω_k andRepresenting an abstract intermediate quantity in the computation process.

On the basis of the optimal solution of the first feature operator beta _k, the coordinate descent algorithm can be utilized to directly pay attention to solving all feature dimensions of beta _k, so that the optimal solution of all feature dimensions of the first feature operator beta is obtained.

S3.3.2, calculating the optimal solutions of all feature dimensions of the second feature operator delta _d.

Similar to the calculation process of the optimal solution of all feature dimensions of the first feature operator β in step S3.3.1, the first feature operator β is fixed, the second feature operator Δ _d is separated, and the optimal solution of the second feature operator Δ _d is calculated as follows:

wherein,

m_1,k＝1+λ₄sgn(Δ_d,k)

On the basis of the optimal solution of the second feature operator delta _d, the coordinate descent algorithm can be utilized to directly pay attention to solving all feature dimensions of delta _d, so that the optimal solution of all feature dimensions of the second feature operator delta _d is obtained.

S4, integrating a network knowledge data set and a gene data set on each characteristic dimension according to the optimal solutions of all characteristic dimensions of the first characteristic operator and the optimal solutions of all characteristic dimensions of the second characteristic operator, and obtaining an integration result of the network knowledge data set and the gene data set. The method comprises the steps of S4.1 to S4.2, wherein the steps are as follows:

s4.1, judging whether links exist between the network knowledge dataset and the gene dataset in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator.

If the network knowledge data set and the gene data set are linked in any feature dimension, integrating the network knowledge data set and the gene data set in the feature dimension with the link to obtain a data integration result.

And if the network knowledge data set and the gene data set have no links in all feature dimensions, classifying the network knowledge data set and the gene data set to obtain a data classification result.

And S4.2, integrating the data integration result and the data classification result in all characteristic dimensions, and outputting the integration result of the network knowledge data set and the gene data set.

The embodiment of the invention has the following beneficial effects:

According to the regularization model-based data integration method, a DSNet model is obtained by adding a Laplacian regularization term and an Lq norm penalty term into a prediction model to be used as constraints; and then inputting the network knowledge dataset and the gene dataset to be integrated into the DSNet model, calculating the optimal solution of the feature operator in each feature dimension, and integrating data in the feature dimension according to the obtained optimal solution. The method and the device use the regularization model to integrate and analyze the biological data and the network knowledge data, and improve the accuracy and the efficiency of high-dimensional data integration analysis.

Example 2

Based on the foregoing disclosure of the embodiments, an embodiment of the present invention provides a data integration system based on a regularization model, including: a data acquisition module 101, a penalty term setting module 102, a feature solving module 103 and a data integration module 104. The system architecture is shown in fig. 2.

The data acquisition module 101 is configured to acquire a network knowledge data set and a gene data set;

The penalty setting module 102 is configured to set a laplace regularization term and an L1/2 norm penalty term in a preset prediction model, so as to obtain a DSNet model;

The feature solving module 103 is configured to convert the DSNet model to obtain a converted DSNet model, obtain a first feature operator and a second feature operator, input the network knowledge dataset and the genetic dataset into the converted DSNet model, and output an optimal solution of all feature dimensions of the first feature operator and an optimal solution of all feature dimensions of the second feature operator; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;

The data integration module 104 is configured to integrate a network knowledge dataset and a gene dataset in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator, so as to obtain an integration result of the network knowledge dataset and the gene dataset.

In a possible implementation manner, the penalty setting module 102 is configured to set a laplacian regularization term and an L1/2 norm penalty in a preset prediction model, so as to obtain a DSNet model, which is specifically:

The penalty setting module 102 is configured to set a laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model, and obtain a DSNet model;

wherein, DSNet model specifically is as follows:

In one possible implementation manner, the feature solving module 103 has a structure as shown in fig. 3, and includes: a model conversion unit 201, a feature operator unit 202, and a multi-dimensional solving unit 203;

The model conversion unit 201 is configured to convert the DSNet model based on each feature dimension to obtain a conversion expression of the DSNet model; the conversion expression of DSNet model is as follows:

The feature operator unit 202 is configured to obtain a first feature operator β and a second feature operator Δ _d from a conversion expression of the DSNet model; wherein the first feature operator is used for representing a data sharing effect; the second feature operator is used for representing a data unique effect;

the multidimensional solving unit 203 is configured to input the network knowledge dataset and the genetic dataset into the converted DSNet model, and calculate an optimal solution of all feature dimensions of the first feature operator β and an optimal solution of all feature dimensions of the second feature operator Δ _d by using a coordinate descent algorithm.

Further, the multi-dimensional solving unit 203 has a structure as shown in fig. 4, and includes: sharing an effector unit 301 and a unique effector unit 302;

the sharing effector unit 301 is configured to calculate an optimal solution of all feature dimensions of the first feature operator β, and specifically is as follows:

wherein,

Where β _k represents the kth dimension of β, β represents the sharing effect that remains consistent across the D data sets, and k represents the dimension of β; b _ik represents the linking condition between data i in the gene data set and data k in the network knowledge data set, x _ik represents the k gene in the ith sample data in the gene data set, x _ij represents the j gene in the ith sample data in the gene data set, y _i represents the true tag value, ω _k、λ_1,k、m_1,k、v_i、m_2,k、ω_k andRepresenting abstract intermediate quantities in the computing process;

The unique effector unit 302 is configured to calculate an optimal solution of all feature dimensions of the second feature operator Δ _d, which is specifically as follows:

wherein,

/>

m_1,k＝1+λ₄sgn(Δ_d,k)

In one possible implementation manner, the data integration module 104 has a structure as shown in fig. 5, and includes: a dimension linking unit 401, an integrating unit 402, a classifying unit 403, and a feature merging unit 404;

the dimension linking unit 401 is configured to determine whether a network knowledge dataset and a gene dataset have links in each feature dimension according to the optimal solutions of all feature dimensions of the first feature operator and the optimal solutions of all feature dimensions of the second feature operator;

The integrating unit 402 is configured to integrate the network knowledge dataset and the gene dataset in a feature dimension where a link exists if the network knowledge dataset and the gene dataset exist in any feature dimension, so as to obtain a data integration result;

The classifying unit 403 is configured to classify the network knowledge dataset and the gene dataset if no link exists between the network knowledge dataset and the gene dataset in all feature dimensions, so as to obtain a data classification result;

The feature merging unit 404 is configured to integrate the data integration result and the data classification result in all feature dimensions, and output an integration result of the network knowledge dataset and the gene dataset.

The more detailed working principle and the procedure of the present embodiment can be, but not limited to, those described in embodiment 1.

The embodiment of the invention has the following beneficial effects:

The invention provides a regularization model-based data integration system, which comprises: the system comprises a data acquisition module, a penalty term setting module, a characteristic solving module and a data integration module. The method comprises the steps of constructing a regularization model for data integration by setting a penalty term module; calculating optimal solutions of feature operators on all feature dimensions through a feature solving module; carrying out data integration classification according to the link conditions of the network knowledge data set and the gene data in each characteristic dimension by a data integration module; the invention uses the regularization model to integrate and analyze the biological data and the network knowledge data, thereby improving the accuracy and the efficiency of the high-dimensional data integration analysis.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. A regularized model-based data integration method, comprising:

Acquiring a network knowledge data set and a gene data set;

Setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model; wherein, DSNet model specifically is as follows:

Where β represents the shared effect that remains consistent across the D datasets, Δ _d represents the unique effect specific to one dataset, λ ₁ and λ ₂ represent parameters that control model sparsity, and λ ₃ and λ ₄ represent parameters that adjust model smoothness; x _i denotes an input matrix, y _i denotes an output real label value, and a matrix superscript T denotes a transpose operation; Representation/> The norm, D, represents the number of datasets being processed, r _d represents the weight of the unique effect of each dataset; l represents a network knowledge data set represented by a symmetric laplace matrix, |β|l|β| represents performing smoothing processing on β in the network knowledge data set; |Δ _d|L|Δ_d | denotes that smoothing is performed on Δ _d in the network knowledge dataset;

2. The regularization model-based data integration method of claim 1, wherein integrating the network knowledge dataset and the genetic dataset in each feature dimension based on the optimal solution for all feature dimensions of the first feature operator and the optimal solution for all feature dimensions of the second feature operator to obtain an integration result of the network knowledge dataset and the genetic dataset, comprises:

3. A regularized model-based data integration system, comprising: the system comprises a data acquisition module, a penalty item setting module, a characteristic solving module and a data integration module;

the penalty setting module is used for setting a Laplace regularization term and an L1/2 norm penalty term in a preset prediction model to obtain a DSNet model; wherein, DSNet model specifically is as follows:

4. The regularization model-based data integration system of claim 3, wherein the data integration module comprises: the device comprises a dimension link unit, an integration unit, a classification unit and a feature merging unit;