CN115691680A - Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application - Google Patents

Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application Download PDF

Info

Publication number
CN115691680A
CN115691680A CN202211213760.1A CN202211213760A CN115691680A CN 115691680 A CN115691680 A CN 115691680A CN 202211213760 A CN202211213760 A CN 202211213760A CN 115691680 A CN115691680 A CN 115691680A
Authority
CN
China
Prior art keywords
ligand
receptor
cell
sequencing data
boosting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211213760.1A
Other languages
Chinese (zh)
Inventor
彭利红
刘龙龙
王钊
周立前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202211213760.1A priority Critical patent/CN115691680A/en
Publication of CN115691680A publication Critical patent/CN115691680A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application thereof. And then designing an integrated framework to predict the ligand-receptor interaction based on a class characteristic gradient lifting algorithm, a natural gradient lifting algorithm and a deep forest model. And combining the single cell sequencing data of the tumor tissue to filter the known and predicted ligand-receptor interaction data. And predicting the cell communication in the tumor microenvironment by combining an expression product method and an expression threshold value method according to the filtered ligand-receptor interaction and single cell sequencing data. The method can improve the prediction effect of cell communication, can be applied to cell communication prediction in human tumor tissues, and solves the problem of low accuracy of predicting the cell communication strength based on ligand-receptor interaction in the existing method.

Description

Cell communication prediction method based on Boosting and deep forest and single cell sequencing data and application
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a cell communication prediction method based on Boosting and deep forest and single cell sequencing data and application thereof.
Background
In multicellular organisms, cellular communication coordinates the activities of various cell types, thereby forming tissues, organs, and systems, and further performing various biological functions. Cellular communication is also essential for complex bodily processes, such as immune response, growth, and homeostasis in healthy or diseased conditions. To understand the biological function of each cell type in its tissues, we need to understand the protein information transmitted by each cell type.
The single cell sequencing technology can accurately quantify the copy number of the gene in a single cell nucleus. Since the deletion or amplification of the genome part in cancer cells causes the deletion or overexpression of key genes, which interferes with the growth of normal cells, the method can be used for analyzing the copy number of genes, thereby having wide application in cancer diagnosis. Single cell sequencing can often provide a large amount of gene data, and how to screen out key interrelations among cells is helpful to reveal a regulation mechanism among communication cells, and improve the prediction accuracy of researchers on the functions of tissues in a steady state and the disease change. A cell communication analysis method and system disclosed in CN202011620086.X uses cell communication prediction and ligand-target gene regulation prediction; the cell communication prediction comprises the analysis of the expression abundance of the ligand-receptor pairs, the analysis of the number of the significantly enriched ligand-receptor pairs and the construction of a cell interaction network diagram; ligand-target gene regulation prediction includes ligand activity analysis and ligand-target gene regulation potential analysis to describe the relationship between cells. Although the cell communication analysis process of the patent is more efficient and comprehensive. However, the method has low performance, fails to visualize the prediction result, lacks analysis of tumor microenvironment, and has certain limitation on the accuracy of the prediction of ligand-receptor interaction for the regulation of interaction between secreted ligand and plasma membrane receptor of intracellular communication, i.e. ligand-receptor interaction.
Disclosure of Invention
The invention aims to solve the technical problems that the accuracy of the cell communication prediction mediated by the ligand-receptor interaction is insufficient and needs to be improved, and provides a cell communication prediction method based on Boosting and deep forest and single cell sequencing data.
Another technical problem of the present invention is to provide an application of the cell communication prediction method based on Boosting and deep forest and single cell sequencing data.
The purpose of the invention is realized by the following technical scheme:
a cell communication prediction method based on Boosting, deep forest and single cell sequencing data comprises the following steps:
s1, extracting biological characteristics of sequences of ligands and receptors, and selecting the biological characteristics of each ligand-receptor pair by using a limit gradient algorithm;
s2, classifying the ligand-receptor pairs according to the biological characteristics of the ligand-receptor pairs by adopting a gradient lifting algorithm LRI-Catboost;
s3, classifying the ligand-receptor pairs according to the biological characteristics of the ligand-receptor pairs by adopting a natural gradient-based lifting model LRI-NGboost;
s4, adopting a deep forest algorithm to divide the biological characteristics of the ligand-receptor into a positive class and a negative class, respectively calculating and selecting the class with higher probability as a final class;
s5, filtering known and predicted ligand-receptor interaction data;
and S6, calculating according to the filtered ligand-receptor interaction, the single cell sequencing data and a scoring method to obtain the final cell communication strength.
Further, the biological features include 400-dimensional monoMono, 8000-dimensional monoDi, 8000-dimensional diMono, 147-dimensional CTD, and 80-dimensional PseudoAAC.
Further, the extreme gradient algorithm is:
Figure BDA0003875985510000021
wherein I is the ith sample, I L Representing the number of samples in the node space on the left,g i Is the first partial derivative, h i For the second partial derivative, λ and γ represent regularization parameters.
Further, the step of classifying the LRI-Catboost algorithm comprises the following steps of:
s21. A top-down greedy algorithm is used to implement a symmetric decision tree, each decision rule R is composed of a feature i e { 1., l } and a threshold v e R, at each level of the tree, the decision rule R partitions k disjoint sets into 2k disjoint subsets, and k = 2k disjoint subsets for a complete binary tree with k' levels k′ A set of feature vectors X e R is divided into two completely independent subsets (X) L And X R ) For each X ∈ X, LRI-CatBoost determines its class from these two subsets:
Figure BDA0003875985510000031
s22. When a set is given
Figure BDA0003875985510000032
And an objective function t: R l → R, the segmentation rule is defined as:
Figure BDA0003875985510000033
where M is used to evaluate X 1 ,..,X k Optimality of the segmentation rule r above;
s23, obtaining a prediction model M i,j Wherein M is i,j (i) The representation is based on permutation σ r The result of the ith sample of the first j samples, in each iteration t, is from { σ } 1 ,...,σ S Construction of a Tree T t And its gradient is calculated:
Figure BDA0003875985510000034
s24, calculating gradient grad of each sample i r,σ(i)-1 (i) When all can beAfter the pairs of energy contributions are all predicted, the leaf value of sample i is calculated by calculating the gradient grad of the samples previously belonging to the same leaf as sample i r,σ(i)-1 (i) Is obtained, a tree structure T is established t The unknown ligand-receptor pairs are then classified.
Further, M may be defined as:
Figure BDA0003875985510000035
wherein
Figure BDA0003875985510000036
Is shown with respect to X i Target score set for the sample.
Further, the LRI-NGBoost model consists of three parts: basic learners, parametric probability distributions, and prediction rules. For one sample x, LRI-NGBoost passes through conditional distribution P θ Predicting its label y, where the parameter theta is derived from the initial theta (0) And M base classifier outputs. For normal distributions with parameters μ and log σ, there are two basic classifiers for each stage
Figure BDA0003875985510000037
And
Figure BDA0003875985510000038
thus, it is possible to provide
Figure BDA0003875985510000039
Further, for one sample x, LRI-NGBoost passes through conditional distribution P θ Predicting its label y, where the parameter theta is derived from the initial theta (0) And M basic classifier outputs, two basic classifiers for each stage for normal distributions with parameters μ and log σ
Figure BDA00038759855100000310
And
Figure BDA00038759855100000311
Figure BDA00038759855100000312
the predicted output is composed of a stepwise scaling factor p (m) And a learning rate η, wherein the scaling factor ρ (m) Is a single scalar:
Figure BDA00038759855100000313
further, selecting random forests and additional trees as base classifiers, calculating the ratio of feature samples corresponding to positive classes and negative classes in each layer by each predictor for a ligand-receptor interaction feature, generating a class vector from the class probabilities obtained by all the predictors, and connecting the class vector with the original ligand-receptor interaction feature vector to be used as the input of the deep forest of the next layer;
when the prediction performance is better than that of all the previous layers, adding a new layer in the model; when the performance of the latter two layers is not improved, training is terminated, and finally the average of the interaction probabilities is calculated for each ligand-receptor pair belonging to the positive and negative classes, respectively, and the class with the larger average interaction probability is taken as the final class.
Furthermore, the scoring method is the combination of an expression product method and an expression threshold value method, and the cell communication score calculation method comprises the following steps:
Figure BDA0003875985510000041
wherein f is 1 (k 1 ,k 2 ) Cell communication fraction, g, calculated based on the expression product method 1 (k 1 ,k 2 ) Is a cellular communication score calculated based on expression thresholding.
Further, the cellular communication score calculated based on the expression product method is:
Figure BDA0003875985510000042
the cell communication score calculated based on the expression threshold method is as follows:
Figure BDA0003875985510000043
wherein the content of the first and second substances,
Figure BDA0003875985510000044
for cell types mediated by ligand i-receptor j interactions calculated based on the expression product method
Figure BDA0003875985510000045
And
Figure BDA0003875985510000046
the communication strength of the communication is scored and,
Figure BDA0003875985510000047
for cell types mediated by ligand i-receptor j interactions calculated based on expression thresholding
Figure BDA0003875985510000048
And
Figure BDA0003875985510000049
the communication strength score of (1).
Further, the present invention can also visualize the outcome of cellular communication prediction.
The cell communication prediction method based on Boosting, deep forest and single cell sequencing data is applied to prediction of cell communication in human tumor tissues.
Compared with the prior art, the beneficial effects are:
the invention designs a limit gradient lifting algorithm to select the characteristics of the ligand-receptor pair on the basis of extracting the biological characteristics of the ligand and the receptor. And then designing an integrated framework to predict ligand-receptor interaction based on a class characteristic gradient lifting algorithm, a natural gradient lifting algorithm and a deep forest model. And then filtering known and predicted ligand-receptor interaction according to single cell sequencing data, and predicting cell communication under the tumor microenvironment by combining an expression product method and an expression threshold method. The method of the invention can improve the prediction effect of cell communication.
Drawings
FIG. 1 is a flow chart of cellular communication prediction;
FIG. 2 is a block diagram of a framework for predicting ligand-receptor interactions;
FIG. 3 is a graph of AUC in datasets 1-4 for the method of the invention;
FIG. 4 is an AUPR plot of the method of the present invention on data sets 1-4;
wherein a is a data set 1, b is a data set 2, c is a data set 3, and d is a data set 4;
FIG. 5 is a thermodynamic diagram of the cell communication ligand-receptor interaction in human squamous cell carcinoma of head and neck tissue;
FIG. 6 is a thermodynamic diagram of the intensity of cellular communication in human squamous cell carcinoma of the head and neck;
FIG. 7 is a network of cellular communication intensity in human squamous cell carcinoma of head and neck;
FIG. 8 is a thermodynamic diagram of cell communication ligand-receptor interactions in human breast cancer tissue;
FIG. 9 is a thermodynamic diagram of the intensity of cellular communication in human breast cancer tissue;
FIG. 10 is a network of cellular communication intensity in human breast cancer tissue.
Detailed Description
The following examples are further explained and illustrated, but the present invention is not limited in any way by the specific examples. Unless otherwise specified, the methods and equipment used in the examples are conventional and the starting materials used are conventional commercial materials.
Example 1
As shown in fig. 1-2, the present embodiment provides a cell communication prediction method based on Boosting and deep forest and single cell sequencing data, which specifically includes the steps of:
s1, performing biological feature extraction on sequences of a ligand and a receptor to obtain 400-dimensional monoMono, 8000-dimensional monoDi, 8000-dimensional diMono, 147-dimensional CTD and 80-dimensional pseudoAAC. Each ligand or receptor can be described as a 16,627 dimensional vector and a ligand-receptor pair can be represented as a 33,254 dimensional vector. The biological characteristics of each ligand-receptor pair are selected using a limiting gradient algorithm. The extreme gradient algorithm is as follows:
Figure BDA0003875985510000051
wherein, I L Representing the number of samples in the node space on the left. λ and γ represent regularization parameters.
Higher feature gains mean more efficient and important features. After feature selection, each ligand-receptor pair is described as a d-dimensional vector.
S2, classifying by adopting a gradient lifting algorithm LRI-Catboost based on biological characteristics of ligand-receptor interaction;
let D = (X, Y) denote a dataset with n ligand-receptor pairs, where X denotes a training sample with D-dimensional feature vectors and Y ∈ Y denotes its label. For the ith ligand-receptor pair x i If it interacts, y i =1, otherwise y i =0。
A symmetric decision tree is implemented using a top-down greedy algorithm, each decision rule R consisting of a feature i ∈ { 1., l } and a threshold v ∈ R, at each level of the tree, the decision rule R partitions k disjoint sets into 2k disjoint subsets. In particular, k =2 for a complete binary tree with k' levels k′ A set of feature vectors X e R is divided into two completely independent subsets (X) L And X R ). For each X e X, LRI-CatBoost may determine its class from these two subsets:
Figure BDA0003875985510000061
thus, any k mutually incoherent sets based on the segmentation rule
Figure BDA0003875985510000062
Can be used to implement 2k mutually incoherent sets
Figure BDA0003875985510000063
When a set is given
Figure BDA0003875985510000064
And an objective function t: R l → R, the segmentation rule is defined as:
Figure BDA0003875985510000065
where M is used to evaluate X 1 ,..,X k The optimality of the above segmentation rule r. M may be defined as:
Figure BDA0003875985510000066
wherein
Figure BDA0003875985510000067
Is shown with respect to X i Target score set of the middle sample.
Obtaining a prediction model M i,j Wherein M is i,j (i) The representation is based on the permutation σ r The result of the ith sample of the first j samples. In each iteration t, from { σ } 1 ,...,σ S Construction of a tree T t And its gradient is calculated:
Figure BDA0003875985510000068
for each sample i, its gradient grad r,σ(i)-1 (i) Can be calculated out. When all possible pairs of contributions have been predicted, the leaf value of sample i can be calculated by computing the gradient grad of the samples previously belonging to the same leaf as sample i r,σ(i)-1 (i) The average value of (a) is obtained. When tree structure T t After establishment, the unknown ligand-receptor interaction data can be classified.
S3, predicting the interaction probability of each ligand-receptor pair by adopting a natural gradient lifting model LRI-NGboost;
the LRI-NGboost model consists of three parts: base classifier (f), parameter probability distribution (P) θ ) And a prediction rule (S). For one sample x, LRI-NGBoost passes conditional distribution P θ Predicting its label y, where the parameter theta is derived from the initial theta (0) And M base classifier outputs. For normal distributions with parameters μ and log σ, there are two base classifiers for each stage
Figure BDA0003875985510000071
And
Figure BDA0003875985510000072
thus, the device
Figure BDA0003875985510000073
Figure BDA0003875985510000074
The predicted output is composed of a stepwise scaling factor p (m) And a learning rate η, wherein the scaling factor ρ (m) Is a single scalar:
Figure BDA0003875985510000075
s4, adopting a deep forest algorithm to divide the biological characteristics of the ligand-receptor into a positive class and a negative class, and respectively calculating and selecting the class with larger average interaction probability as a final class;
random forests and extra trees are selected as base classifiers, and each cascade layer consists of 2 random forests and 2 extra trees. Each predictor consists of 100 decision trees. For one ligand-receptor interaction feature, each predictor calculates the ratio of feature samples corresponding to positive and negative classes in each layer. The class probabilities from all predictors yield a class vector. This vector is concatenated with the original ligand-receptor interaction feature vector and serves as input to the underlying deep forest.
When the prediction performance is better than all previous layers, we add a new layer in the model. Training will terminate when the performance of the next two layers is not improved. Finally, the mean values were calculated for the probability of interaction of each ligand-receptor pair belonging to the positive and negative classes, respectively. The class with the larger average probability of interaction is taken as the final class.
Finally, we obtained the final classification of each ligand-receptor pair by integrating the results of LRI-CatBoost, LRI-NGBoost, and LRI-DF.
S5, filtering the known and recognized ligand-receptor interaction. If a ligand or receptor in a certain ligand-receptor interaction is not expressed in the cells of the single cell sequencing data, the ligand-receptor interaction is excluded from the corresponding cellular communication.
And S6, calculating according to the filtered ligand-receptor interaction, the single cell sequencing data and a scoring method to obtain a final communication score.
The scoring method adopts a combination of an expression product method and an expression threshold value method.
(1) Expression product method: prediction of ligand i and receptor j and two cell types
Figure BDA0003875985510000076
And
Figure BDA0003875985510000077
score of interaction, wherein
Figure BDA0003875985510000078
Indicates that ligand i and receptor j are present in the cell type
Figure BDA0003875985510000079
The arithmetic mean of (1):
Figure BDA00038759855100000710
Figure BDA0003875985510000081
and
Figure BDA0003875985510000082
fraction f of cell communication therebetween 1 (k 1 ,k 2 ) It can be calculated that:
Figure BDA0003875985510000083
(2) Expression threshold method: prediction of ligand i and receptor j and two cell types
Figure BDA0003875985510000084
And
Figure BDA0003875985510000085
the interaction score of (1), wherein σ i And σ j Represents the standard deviation:
Figure BDA0003875985510000086
Figure BDA0003875985510000087
and
Figure BDA0003875985510000088
fraction of cell communication between g 1 (k 1 ,k 2 ) It can be calculated that:
Figure BDA0003875985510000089
calculated based on the expressproduct method and expressthreshold method
Figure BDA00038759855100000810
And
Figure BDA00038759855100000811
fraction of cellular communication therebetween f 1 (k 1 ,k 2 ) And g 1 (k 1 ,k 2 ) And combined to obtain the final cellular communication score. That is to say that the temperature of the molten steel is,
Figure BDA00038759855100000812
and
Figure BDA00038759855100000813
the cellular communication fraction therebetween can be calculated by the following formula:
Figure BDA00038759855100000814
example 2
The embodiment provides the cell communication prediction algorithm and four representative protein interaction prediction methods, namely a limit gradient lifting algorithm, a support vector machine, a distributed gradient lifting framework based on a decision tree algorithm and a cyclic convolution neural network algorithm based on ordinal regression, wherein the performances are evaluated by 20 times of 5-fold cross validation, and the AUC and the aucr are used as evaluation indexes, and the higher the AUC and the aucr values are, the better the algorithm performance is.
And setting parameters of the extreme gradient boost algorithm, the support vector machine and the distributed gradient boost framework based on the decision tree algorithm as default values. For the ordinal regression-based cyclic convolution neural network algorithm, the parameters are set as follows: left _ rate =0.01, n \\estimators =20, max_depth =3, criterion = friedman _mse, loss = default, min _samples _split =2. For the cellular communication prediction algorithm provided by the invention, the boosting type, max _ depth and n _ estimators in LRI-Catboost are respectively set to Ordered, 10 and 2000; learning rates, natural gradients, frac and eval in LRI-NGBoost are set to 0.01, true, 1.0 and 100; n _ trees in LRI-DF is set to 100 and predictor is set to forest. The dimension of the ligand-receptor interaction feature vector after dimensionality reduction was set to 300.
In this experiment, we collected four different ligand-receptor interaction datasets. Data sets 1 and 2 are both from the CellTalk database. Data set 3 was constructed by Skelly et al. Data set 4 was constructed by Ximerirkis et al. The specific data set conditions are shown in table 1 below:
TABLE 1
Data set Ligands Receptors Ligand-receptor interactions
Data set 1 812 780 3390
Data set 2 650 588 2031
Data set 3 574 559 2006
Data set 4 1129 1335 6585
The properties obtained according to the different processes described above are shown in table 2 below:
TABLE 2
Figure BDA0003875985510000091
As can be seen from table 2 above and fig. 3-4, the ligand-receptor interaction prediction algorithm of the present invention achieves the best AUC and aucr on the four data sets, which are 0.8533, 0.8316, 0.8150 and 0.8434 respectively, which are 1.39%, 3.29%, 3.59% and 1.89% higher than the performance of the second distributed gradient lifting framework based on the decision tree algorithm. Meanwhile, the optimal AUPR is obtained on the four data sets, namely 0.8681, 0.8442, 0.8259 and 0.8632, which are respectively 1.11%, 2.19%, 2.11% and 1.54% higher than the performance of the second distributed gradient lifting frame based on the decision tree algorithm. The ligand-receptor interaction prediction algorithm LRI-CNbDP of the invention obtains the best AUC and AUPR on the four data sets used in the experiment, and proves the strong ligand-receptor interaction prediction performance.
Example 3
This example provides the predicted application of the scheme of the present invention in practice, and downloads the relevant sequencing data of human head and neck squamous carcinoma tissue, cell types including head and neck squamous carcinoma cell, fibroblast, B cell, muscle cell, macrophage, endothelial cell, T cell, dendritic cell and mast cell, from the GEO database, and combines the filtered ligand-receptor interaction and single cell sequencing data of the present invention to establish the cell communication network related to breast cancer, and makes the cell communication prediction in human tissue. As shown in fig. 5-7, the methods of the present invention found that in human head and neck squamous cells, the intensity of communication between fibroblasts and human head and neck squamous cell carcinoma cells was higher.
Example 4
This example provides the prediction application of the scheme of the present invention in practice, and downloads the relevant sequencing data in the cell cancer tissue in the human breast tissue from the GEO database, and establishes a cell communication network related to breast cancer by combining the filtered ligand-receptor interaction and single cell sequencing data in the present invention, so as to predict the cell communication in the human tissue. As shown in fig. 8-10, the probability of communication between immune cells and breast cancer cells was higher in human breast cancer tissues.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A cell communication prediction method based on Boosting, deep forest and single cell sequencing data is characterized by comprising the following steps:
s1, extracting biological characteristics of sequences of ligands and receptors, and selecting the biological characteristics of each ligand-receptor pair by using a limit gradient algorithm;
s2, classifying the ligand-receptor pairs according to the biological characteristics of the ligand-receptor pairs by adopting a gradient lifting algorithm LRI-Catboost;
s3, classifying the ligand-receptor pairs according to the biological characteristics of the ligand-receptor pairs by adopting a natural gradient-based lifting model LRI-NGboost;
s4, adopting a deep forest algorithm to divide the biological characteristics of the ligand-receptor into a positive class and a negative class, respectively calculating and selecting the class with higher probability as a final class;
s5, filtering known and predicted ligand-receptor interaction data;
and S6, calculating according to the filtered ligand-receptor interaction, the single cell sequencing data and a scoring method to obtain the final cell communication strength.
2. The method of cellular communication prediction based on Boosting and deep forest and single cell sequencing data of claim 1, wherein the biological features include 400-dimensional monoMono, 8000-dimensional monoDi, 8000-dimensional diMono, 147-dimensional CTD and 80-dimensional PseudoAAC.
3. The cell communication prediction method based on Boosting and deep forest and single cell sequencing data according to claim 1, wherein the extreme gradient algorithm is as follows:
Figure FDA0003875985500000011
wherein I is the ith sample, I L Representing the number of samples in the left nodal space, g i Is the first partial derivative, h i For the second partial derivative, λ and γ represent regularization parameters.
4. The cell communication prediction method based on Boosting, deep forest and single cell sequencing data according to claim 1, wherein the step of classifying the LRI-Catboost algorithm comprises the steps of:
s21. Carrying out decision tree induction by using a top-down greedy algorithm, wherein each decision rule R consists of a characteristic i epsilon { 1., l } and a threshold v epsilon R, and at each layer of the tree, the decision rule R divides k disjoint sets into 2 disjoint subsets, and k =2 disjoint subsets for a complete binary tree with k' level k′ A set of feature vectors X e R is divided into two completely independent subsets (X) L And X R ),For each X ∈ X, LRI-CatBoost determines its class from these two subsets:
Figure FDA0003875985500000021
s22. When a set is given
Figure FDA0003875985500000022
And an objective function t: R l → R, the segmentation rule is defined as:
Figure FDA0003875985500000023
where M is used to evaluate X 1 ,..,X k Optimality of the segmentation rule r above;
s23, obtaining a prediction model M i,j Wherein M is i,j (i) The representation is based on permutation σ r The result of the ith sample of the first j samples, in each iteration t, is from { σ [ ] 1 ,...,σ S Construction of a Tree T t And its gradient is calculated:
Figure FDA0003875985500000024
s24, calculating gradient grad of each sample i r,σ(i)-1 (i) When all possible pairs of contributions have been predicted, the gradient grad of the sample previously belonging to the same leaf as the sample i is calculated r,σ(i)-1 (i) The average value of the values of the leaf nodes of the sample i is obtained, and a tree structure T is established t Thereafter, unknown ligand-receptor interactions were classified.
5. The method for predicting cell communication based on Boosting and deep forest and single cell sequencing data according to claim 1, wherein M can be defined as:
Figure FDA0003875985500000025
wherein
Figure FDA0003875985500000026
Represents X i Target score set of the middle sample.
6. The method of claim 1, wherein the Boosting, deep forest and single cell sequencing data-based cellular communication prediction method is characterized in that for a sample x, LRI-NGboost, P is distributed by a condition θ Predicting the label y thereof, wherein the parameter theta is formed by the initial theta (0) And M basic classifier outputs, two basic classifiers for each stage for normal distribution with parameters of μ and log σ
Figure FDA0003875985500000027
And
Figure FDA0003875985500000028
Figure FDA0003875985500000029
Figure FDA00038759855000000210
the predicted output is composed of a stepwise scaling factor p (m) And a learning rate η, wherein the scaling factor ρ (m) Is a single scalar:
Figure FDA00038759855000000211
7. the method of claim 1, wherein random forests and additional trees are selected as basic classifiers, and for a ligand-receptor interaction feature, each predictor calculates the ratio of feature samples corresponding to positive and negative classes in each layer, and generates a class vector from the class probabilities obtained from all predictors, the class vector being connected to the original ligand-receptor interaction feature vector and being used as input for the next layer of deep forest;
when the prediction performance is better than that of all the previous layers, adding a new layer in the model; when the performance of the latter two layers is not improved, training is terminated, and finally the average interaction probability values are calculated for each ligand-receptor pair belonging to the positive and negative classes, respectively, and the class having the larger average interaction probability is taken as the final class.
8. The cell communication prediction method based on Boosting, deep forest and single cell sequencing data as claimed in claim 1, wherein the scoring method is a combination of an expression product method and an expression threshold method, and the cell communication score calculation method is as follows:
Figure FDA0003875985500000031
wherein, f 1 (k 1 ,k 2 ) Fraction of cell communication, g, for expression product method 1 (k 1 ,k 2 ) The cell communication score for expression thresholding.
9. The method for predicting cellular communication based on Boosting and deep forest and single cell sequencing data according to claim 8, wherein the cellular communication score calculated based on the expression product method is as follows:
Figure FDA0003875985500000032
the cell communication score calculated based on the expression threshold method is as follows:
Figure FDA0003875985500000033
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003875985500000034
for cell types mediated by ligand i-receptor j interactions calculated based on the expression product method
Figure FDA0003875985500000035
And
Figure FDA0003875985500000036
the score of the communication strength of (a) is obtained,
Figure FDA0003875985500000037
for cell types mediated by ligand i-receptor j interactions calculated based on expression thresholding
Figure FDA0003875985500000038
And
Figure FDA0003875985500000039
the communication strength score of (c).
10. The method of claim 1, applied to cellular communication prediction in human tumor microenvironment.
CN202211213760.1A 2022-09-30 2022-09-30 Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application Pending CN115691680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211213760.1A CN115691680A (en) 2022-09-30 2022-09-30 Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211213760.1A CN115691680A (en) 2022-09-30 2022-09-30 Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application

Publications (1)

Publication Number Publication Date
CN115691680A true CN115691680A (en) 2023-02-03

Family

ID=85063940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211213760.1A Pending CN115691680A (en) 2022-09-30 2022-09-30 Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application

Country Status (1)

Country Link
CN (1) CN115691680A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116469461A (en) * 2023-06-01 2023-07-21 中国农业科学院作物科学研究所 Data analysis method in gene prediction process

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116469461A (en) * 2023-06-01 2023-07-21 中国农业科学院作物科学研究所 Data analysis method in gene prediction process

Similar Documents

Publication Publication Date Title
Xue et al. A multi-objective evolutionary algorithm with interval based initialization and self-adaptive crossover operator for large-scale feature selection in classification
Bhuyan et al. Feature and subfeature selection for classification using correlation coefficient and fuzzy model
Niknam et al. An efficient hybrid algorithm based on modified imperialist competitive algorithm and K-means for data clustering
Jörnsten Clustering and classification based on the L1 data depth
CN114496092B (en) MiRNA and disease association relation prediction method based on graph rolling network
CN107368707B (en) Gene chip expression data analysis system and method based on US-E L M
Robbins et al. The ant colony algorithm for feature selection in high-dimension gene expression data for disease classification
CN116741397B (en) Cancer typing method, system and storage medium based on multi-group data fusion
Mohammadi et al. Improving linear discriminant analysis with artificial immune system-based evolutionary algorithms
Abasabadi et al. Hybrid feature selection based on SLI and genetic algorithm for microarray datasets
Yu et al. A recognition method of soybean leaf diseases based on an improved deep learning model
CN115691680A (en) Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application
Tamilmani et al. Cancer MiRNA biomarker classification based on improved generative adversarial network optimized with mayfly optimization algorithm
Zhou et al. Personal credit default prediction model based on convolution neural network
CN117611974B (en) Image recognition method and system based on searching of multiple group alternative evolutionary neural structures
Babu et al. A simplex method-based bacterial colony optimization algorithm for data clustering analysis
CN115985503B (en) Cancer prediction system based on ensemble learning
Lafta et al. Classification of medical datasets using back propagation neural network powered by genetic-based features elector
Hu et al. Differential evolution based on network structure for feature selection
Sa’adah et al. Knowledge discovery from gene expression dataset using bagging lasso decision tree
CN114141306B (en) Distant metastasis identification method based on gene interaction mode optimization graph representation
CN113313167B (en) Method for predicting lncRNA-protein interaction based on deep learning dual neural network structure
CN111816259B (en) Incomplete multi-study data integration method based on network representation learning
Sheet et al. Identification of Cancer Mediating Biomarkers using Stacked Denoising Autoencoder Model-An Application on Human Lung Data
Bazan et al. Comparison of aggregation classes in ensemble classifiers for high dimensional datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination