CN115691680A

CN115691680A - Cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application

Info

Publication number: CN115691680A
Application number: CN202211213760.1A
Authority: CN
Inventors: 彭利红; 刘龙龙; 王钊; 周立前
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-02-03

Abstract

The invention discloses a cell communication prediction method based on Boosting, deep forest and single cell sequencing data and application thereof. And then designing an integrated framework to predict the ligand-receptor interaction based on a class characteristic gradient lifting algorithm, a natural gradient lifting algorithm and a deep forest model. And combining the single cell sequencing data of the tumor tissue to filter the known and predicted ligand-receptor interaction data. And predicting the cell communication in the tumor microenvironment by combining an expression product method and an expression threshold value method according to the filtered ligand-receptor interaction and single cell sequencing data. The method can improve the prediction effect of cell communication, can be applied to cell communication prediction in human tumor tissues, and solves the problem of low accuracy of predicting the cell communication strength based on ligand-receptor interaction in the existing method.

Description

Cell communication prediction method based on Boosting and deep forest and single cell sequencing data and application

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a cell communication prediction method based on Boosting and deep forest and single cell sequencing data and application thereof.

Background

In multicellular organisms, cellular communication coordinates the activities of various cell types, thereby forming tissues, organs, and systems, and further performing various biological functions. Cellular communication is also essential for complex bodily processes, such as immune response, growth, and homeostasis in healthy or diseased conditions. To understand the biological function of each cell type in its tissues, we need to understand the protein information transmitted by each cell type.

The single cell sequencing technology can accurately quantify the copy number of the gene in a single cell nucleus. Since the deletion or amplification of the genome part in cancer cells causes the deletion or overexpression of key genes, which interferes with the growth of normal cells, the method can be used for analyzing the copy number of genes, thereby having wide application in cancer diagnosis. Single cell sequencing can often provide a large amount of gene data, and how to screen out key interrelations among cells is helpful to reveal a regulation mechanism among communication cells, and improve the prediction accuracy of researchers on the functions of tissues in a steady state and the disease change. A cell communication analysis method and system disclosed in CN202011620086.X uses cell communication prediction and ligand-target gene regulation prediction; the cell communication prediction comprises the analysis of the expression abundance of the ligand-receptor pairs, the analysis of the number of the significantly enriched ligand-receptor pairs and the construction of a cell interaction network diagram; ligand-target gene regulation prediction includes ligand activity analysis and ligand-target gene regulation potential analysis to describe the relationship between cells. Although the cell communication analysis process of the patent is more efficient and comprehensive. However, the method has low performance, fails to visualize the prediction result, lacks analysis of tumor microenvironment, and has certain limitation on the accuracy of the prediction of ligand-receptor interaction for the regulation of interaction between secreted ligand and plasma membrane receptor of intracellular communication, i.e. ligand-receptor interaction.

Disclosure of Invention

The invention aims to solve the technical problems that the accuracy of the cell communication prediction mediated by the ligand-receptor interaction is insufficient and needs to be improved, and provides a cell communication prediction method based on Boosting and deep forest and single cell sequencing data.

Another technical problem of the present invention is to provide an application of the cell communication prediction method based on Boosting and deep forest and single cell sequencing data.

The purpose of the invention is realized by the following technical scheme:

a cell communication prediction method based on Boosting, deep forest and single cell sequencing data comprises the following steps:

s1, extracting biological characteristics of sequences of ligands and receptors, and selecting the biological characteristics of each ligand-receptor pair by using a limit gradient algorithm;

s2, classifying the ligand-receptor pairs according to the biological characteristics of the ligand-receptor pairs by adopting a gradient lifting algorithm LRI-Catboost;

s3, classifying the ligand-receptor pairs according to the biological characteristics of the ligand-receptor pairs by adopting a natural gradient-based lifting model LRI-NGboost;

s4, adopting a deep forest algorithm to divide the biological characteristics of the ligand-receptor into a positive class and a negative class, respectively calculating and selecting the class with higher probability as a final class;

s5, filtering known and predicted ligand-receptor interaction data;

and S6, calculating according to the filtered ligand-receptor interaction, the single cell sequencing data and a scoring method to obtain the final cell communication strength.

Further, the biological features include 400-dimensional monoMono, 8000-dimensional monoDi, 8000-dimensional diMono, 147-dimensional CTD, and 80-dimensional PseudoAAC.

Further, the extreme gradient algorithm is:

wherein I is the ith sample, I _L Representing the number of samples in the node space on the left，g _i Is the first partial derivative, h _i For the second partial derivative, λ and γ represent regularization parameters.

Further, the step of classifying the LRI-Catboost algorithm comprises the following steps of:

s21. A top-down greedy algorithm is used to implement a symmetric decision tree, each decision rule R is composed of a feature i e { 1., l } and a threshold v e R, at each level of the tree, the decision rule R partitions k disjoint sets into 2k disjoint subsets, and k = 2k disjoint subsets for a complete binary tree with k' levels ^k′ A set of feature vectors X e R is divided into two completely independent subsets (X) ^L And X ^R ) For each X ∈ X, LRI-CatBoost determines its class from these two subsets:

s22. When a set is given

And an objective function t: R ^l → R, the segmentation rule is defined as:

where M is used to evaluate X ₁ ,..,X _k Optimality of the segmentation rule r above;

s23, obtaining a prediction model M _i,j Wherein M is _i,j (i) The representation is based on permutation σ _r The result of the ith sample of the first j samples, in each iteration t, is from { σ } ₁ ,...,σ _S Construction of a Tree T _t And its gradient is calculated:

s24, calculating gradient grad of each sample i _r,σ(i)-1 (i) When all can beAfter the pairs of energy contributions are all predicted, the leaf value of sample i is calculated by calculating the gradient grad of the samples previously belonging to the same leaf as sample i _r,σ(i)-1 (i) Is obtained, a tree structure T is established _t The unknown ligand-receptor pairs are then classified.

Further, M may be defined as:

wherein

Is shown with respect to X _i Target score set for the sample.

Further, the LRI-NGBoost model consists of three parts: basic learners, parametric probability distributions, and prediction rules. For one sample x, LRI-NGBoost passes through conditional distribution P _θ Predicting its label y, where the parameter theta is derived from the initial theta ⁽⁰⁾ And M base classifier outputs. For normal distributions with parameters μ and log σ, there are two basic classifiers for each stage

And

thus, it is possible to provide

Further, for one sample x, LRI-NGBoost passes through conditional distribution P _θ Predicting its label y, where the parameter theta is derived from the initial theta ⁽⁰⁾ And M basic classifier outputs, two basic classifiers for each stage for normal distributions with parameters μ and log σ

And

the predicted output is composed of a stepwise scaling factor p ^(m) And a learning rate η, wherein the scaling factor ρ ^(m) Is a single scalar:

further, selecting random forests and additional trees as base classifiers, calculating the ratio of feature samples corresponding to positive classes and negative classes in each layer by each predictor for a ligand-receptor interaction feature, generating a class vector from the class probabilities obtained by all the predictors, and connecting the class vector with the original ligand-receptor interaction feature vector to be used as the input of the deep forest of the next layer;

when the prediction performance is better than that of all the previous layers, adding a new layer in the model; when the performance of the latter two layers is not improved, training is terminated, and finally the average of the interaction probabilities is calculated for each ligand-receptor pair belonging to the positive and negative classes, respectively, and the class with the larger average interaction probability is taken as the final class.

Furthermore, the scoring method is the combination of an expression product method and an expression threshold value method, and the cell communication score calculation method comprises the following steps:

wherein f is ₁ (k ₁ ,k ₂ ) Cell communication fraction, g, calculated based on the expression product method ₁ (k ₁ ,k ₂ ) Is a cellular communication score calculated based on expression thresholding.

Further, the cellular communication score calculated based on the expression product method is:

the cell communication score calculated based on the expression threshold method is as follows:

wherein the content of the first and second substances,

for cell types mediated by ligand i-receptor j interactions calculated based on the expression product method

And

the communication strength of the communication is scored and,

for cell types mediated by ligand i-receptor j interactions calculated based on expression thresholding

And

the communication strength score of (1).

Further, the present invention can also visualize the outcome of cellular communication prediction.

The cell communication prediction method based on Boosting, deep forest and single cell sequencing data is applied to prediction of cell communication in human tumor tissues.

Compared with the prior art, the beneficial effects are:

the invention designs a limit gradient lifting algorithm to select the characteristics of the ligand-receptor pair on the basis of extracting the biological characteristics of the ligand and the receptor. And then designing an integrated framework to predict ligand-receptor interaction based on a class characteristic gradient lifting algorithm, a natural gradient lifting algorithm and a deep forest model. And then filtering known and predicted ligand-receptor interaction according to single cell sequencing data, and predicting cell communication under the tumor microenvironment by combining an expression product method and an expression threshold method. The method of the invention can improve the prediction effect of cell communication.

Drawings

FIG. 1 is a flow chart of cellular communication prediction;

FIG. 2 is a block diagram of a framework for predicting ligand-receptor interactions;

FIG. 3 is a graph of AUC in datasets 1-4 for the method of the invention;

FIG. 4 is an AUPR plot of the method of the present invention on data sets 1-4;

wherein a is a data set 1, b is a data set 2, c is a data set 3, and d is a data set 4;

FIG. 5 is a thermodynamic diagram of the cell communication ligand-receptor interaction in human squamous cell carcinoma of head and neck tissue;

FIG. 6 is a thermodynamic diagram of the intensity of cellular communication in human squamous cell carcinoma of the head and neck;

FIG. 7 is a network of cellular communication intensity in human squamous cell carcinoma of head and neck;

FIG. 8 is a thermodynamic diagram of cell communication ligand-receptor interactions in human breast cancer tissue;

FIG. 9 is a thermodynamic diagram of the intensity of cellular communication in human breast cancer tissue;

FIG. 10 is a network of cellular communication intensity in human breast cancer tissue.

Detailed Description

The following examples are further explained and illustrated, but the present invention is not limited in any way by the specific examples. Unless otherwise specified, the methods and equipment used in the examples are conventional and the starting materials used are conventional commercial materials.

Example 1

As shown in fig. 1-2, the present embodiment provides a cell communication prediction method based on Boosting and deep forest and single cell sequencing data, which specifically includes the steps of:

s1, performing biological feature extraction on sequences of a ligand and a receptor to obtain 400-dimensional monoMono, 8000-dimensional monoDi, 8000-dimensional diMono, 147-dimensional CTD and 80-dimensional pseudoAAC. Each ligand or receptor can be described as a 16,627 dimensional vector and a ligand-receptor pair can be represented as a 33,254 dimensional vector. The biological characteristics of each ligand-receptor pair are selected using a limiting gradient algorithm. The extreme gradient algorithm is as follows:

wherein, I _L Representing the number of samples in the node space on the left. λ and γ represent regularization parameters.

Higher feature gains mean more efficient and important features. After feature selection, each ligand-receptor pair is described as a d-dimensional vector.

S2, classifying by adopting a gradient lifting algorithm LRI-Catboost based on biological characteristics of ligand-receptor interaction;

let D = (X, Y) denote a dataset with n ligand-receptor pairs, where X denotes a training sample with D-dimensional feature vectors and Y ∈ Y denotes its label. For the ith ligand-receptor pair x _i If it interacts, y _i =1, otherwise y _i ＝0。

A symmetric decision tree is implemented using a top-down greedy algorithm, each decision rule R consisting of a feature i ∈ { 1., l } and a threshold v ∈ R, at each level of the tree, the decision rule R partitions k disjoint sets into 2k disjoint subsets. In particular, k =2 for a complete binary tree with k' levels ^k′ A set of feature vectors X e R is divided into two completely independent subsets (X) ^L And X ^R ). For each X e X, LRI-CatBoost may determine its class from these two subsets:

thus, any k mutually incoherent sets based on the segmentation rule

Can be used to implement 2k mutually incoherent sets

When a set is given

And an objective function t: R ^l → R, the segmentation rule is defined as:

where M is used to evaluate X ₁ ,..,X _k The optimality of the above segmentation rule r. M may be defined as:

wherein

Is shown with respect to X _i Target score set of the middle sample.

Obtaining a prediction model M _i,j Wherein M is _i,j (i) The representation is based on the permutation σ _r The result of the ith sample of the first j samples. In each iteration t, from { σ } ₁ ,...,σ _S Construction of a tree T _t And its gradient is calculated:

for each sample i, its gradient grad _r,σ(i)-1 (i) Can be calculated out. When all possible pairs of contributions have been predicted, the leaf value of sample i can be calculated by computing the gradient grad of the samples previously belonging to the same leaf as sample i _r,σ(i)-1 (i) The average value of (a) is obtained. When tree structure T _t After establishment, the unknown ligand-receptor interaction data can be classified.

S3, predicting the interaction probability of each ligand-receptor pair by adopting a natural gradient lifting model LRI-NGboost;

the LRI-NGboost model consists of three parts: base classifier (f), parameter probability distribution (P) _θ ) And a prediction rule (S). For one sample x, LRI-NGBoost passes conditional distribution P _θ Predicting its label y, where the parameter theta is derived from the initial theta ⁽⁰⁾ And M base classifier outputs. For normal distributions with parameters μ and log σ, there are two base classifiers for each stage

And

thus, the device

s4, adopting a deep forest algorithm to divide the biological characteristics of the ligand-receptor into a positive class and a negative class, and respectively calculating and selecting the class with larger average interaction probability as a final class;

random forests and extra trees are selected as base classifiers, and each cascade layer consists of 2 random forests and 2 extra trees. Each predictor consists of 100 decision trees. For one ligand-receptor interaction feature, each predictor calculates the ratio of feature samples corresponding to positive and negative classes in each layer. The class probabilities from all predictors yield a class vector. This vector is concatenated with the original ligand-receptor interaction feature vector and serves as input to the underlying deep forest.

When the prediction performance is better than all previous layers, we add a new layer in the model. Training will terminate when the performance of the next two layers is not improved. Finally, the mean values were calculated for the probability of interaction of each ligand-receptor pair belonging to the positive and negative classes, respectively. The class with the larger average probability of interaction is taken as the final class.

Finally, we obtained the final classification of each ligand-receptor pair by integrating the results of LRI-CatBoost, LRI-NGBoost, and LRI-DF.

S5, filtering the known and recognized ligand-receptor interaction. If a ligand or receptor in a certain ligand-receptor interaction is not expressed in the cells of the single cell sequencing data, the ligand-receptor interaction is excluded from the corresponding cellular communication.

And S6, calculating according to the filtered ligand-receptor interaction, the single cell sequencing data and a scoring method to obtain a final communication score.

The scoring method adopts a combination of an expression product method and an expression threshold value method.

(1) Expression product method: prediction of ligand i and receptor j and two cell types

And

score of interaction, wherein

Indicates that ligand i and receptor j are present in the cell type

The arithmetic mean of (1):

and

fraction f of cell communication therebetween ₁ (k ₁ ,k ₂ ) It can be calculated that:

(2) Expression threshold method: prediction of ligand i and receptor j and two cell types

And

the interaction score of (1), wherein σ _i And σ _j Represents the standard deviation:

and

fraction of cell communication between g ₁ (k ₁ ,k ₂ ) It can be calculated that:

calculated based on the expressproduct method and expressthreshold method

And

fraction of cellular communication therebetween f ₁ (k ₁ ,k ₂ ) And g ₁ (k ₁ ,k ₂ ) And combined to obtain the final cellular communication score. That is to say that the temperature of the molten steel is,

and

the cellular communication fraction therebetween can be calculated by the following formula:

example 2

The embodiment provides the cell communication prediction algorithm and four representative protein interaction prediction methods, namely a limit gradient lifting algorithm, a support vector machine, a distributed gradient lifting framework based on a decision tree algorithm and a cyclic convolution neural network algorithm based on ordinal regression, wherein the performances are evaluated by 20 times of 5-fold cross validation, and the AUC and the aucr are used as evaluation indexes, and the higher the AUC and the aucr values are, the better the algorithm performance is.

And setting parameters of the extreme gradient boost algorithm, the support vector machine and the distributed gradient boost framework based on the decision tree algorithm as default values. For the ordinal regression-based cyclic convolution neural network algorithm, the parameters are set as follows: left _ rate =0.01, n \\estimators =20, max_depth =3, criterion = friedman _mse, loss = default, min _samples _split =2. For the cellular communication prediction algorithm provided by the invention, the boosting type, max _ depth and n _ estimators in LRI-Catboost are respectively set to Ordered, 10 and 2000; learning rates, natural gradients, frac and eval in LRI-NGBoost are set to 0.01, true, 1.0 and 100; n _ trees in LRI-DF is set to 100 and predictor is set to forest. The dimension of the ligand-receptor interaction feature vector after dimensionality reduction was set to 300.

In this experiment, we collected four different ligand-receptor interaction datasets. Data sets 1 and 2 are both from the CellTalk database. Data set 3 was constructed by Skelly et al. Data set 4 was constructed by Ximerirkis et al. The specific data set conditions are shown in table 1 below:

TABLE 1

Data set	Ligands	Receptors	Ligand-receptor interactions
				Data set 1	812	780	3390
Data set 2	650	588	2031
				Data set 3	574	559	2006
Data set 4	1129	1335	6585

The properties obtained according to the different processes described above are shown in table 2 below:

TABLE 2

As can be seen from table 2 above and fig. 3-4, the ligand-receptor interaction prediction algorithm of the present invention achieves the best AUC and aucr on the four data sets, which are 0.8533, 0.8316, 0.8150 and 0.8434 respectively, which are 1.39%, 3.29%, 3.59% and 1.89% higher than the performance of the second distributed gradient lifting framework based on the decision tree algorithm. Meanwhile, the optimal AUPR is obtained on the four data sets, namely 0.8681, 0.8442, 0.8259 and 0.8632, which are respectively 1.11%, 2.19%, 2.11% and 1.54% higher than the performance of the second distributed gradient lifting frame based on the decision tree algorithm. The ligand-receptor interaction prediction algorithm LRI-CNbDP of the invention obtains the best AUC and AUPR on the four data sets used in the experiment, and proves the strong ligand-receptor interaction prediction performance.

Example 3

This example provides the predicted application of the scheme of the present invention in practice, and downloads the relevant sequencing data of human head and neck squamous carcinoma tissue, cell types including head and neck squamous carcinoma cell, fibroblast, B cell, muscle cell, macrophage, endothelial cell, T cell, dendritic cell and mast cell, from the GEO database, and combines the filtered ligand-receptor interaction and single cell sequencing data of the present invention to establish the cell communication network related to breast cancer, and makes the cell communication prediction in human tissue. As shown in fig. 5-7, the methods of the present invention found that in human head and neck squamous cells, the intensity of communication between fibroblasts and human head and neck squamous cell carcinoma cells was higher.

Example 4

This example provides the prediction application of the scheme of the present invention in practice, and downloads the relevant sequencing data in the cell cancer tissue in the human breast tissue from the GEO database, and establishes a cell communication network related to breast cancer by combining the filtered ligand-receptor interaction and single cell sequencing data in the present invention, so as to predict the cell communication in the human tissue. As shown in fig. 8-10, the probability of communication between immune cells and breast cancer cells was higher in human breast cancer tissues.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A cell communication prediction method based on Boosting, deep forest and single cell sequencing data is characterized by comprising the following steps:

s5, filtering known and predicted ligand-receptor interaction data;

2. The method of cellular communication prediction based on Boosting and deep forest and single cell sequencing data of claim 1, wherein the biological features include 400-dimensional monoMono, 8000-dimensional monoDi, 8000-dimensional diMono, 147-dimensional CTD and 80-dimensional PseudoAAC.

3. The cell communication prediction method based on Boosting and deep forest and single cell sequencing data according to claim 1, wherein the extreme gradient algorithm is as follows:

wherein I is the ith sample, I _L Representing the number of samples in the left nodal space, g _i Is the first partial derivative, h _i For the second partial derivative, λ and γ represent regularization parameters.

4. The cell communication prediction method based on Boosting, deep forest and single cell sequencing data according to claim 1, wherein the step of classifying the LRI-Catboost algorithm comprises the steps of:

s21. Carrying out decision tree induction by using a top-down greedy algorithm, wherein each decision rule R consists of a characteristic i epsilon { 1., l } and a threshold v epsilon R, and at each layer of the tree, the decision rule R divides k disjoint sets into 2 disjoint subsets, and k =2 disjoint subsets for a complete binary tree with k' level ^k′ A set of feature vectors X e R is divided into two completely independent subsets (X) ^L And X ^R )，For each X ∈ X, LRI-CatBoost determines its class from these two subsets:

s22. When a set is given

And an objective function t: R ^l → R, the segmentation rule is defined as:

s23, obtaining a prediction model M _i,j Wherein M is _i,j (i) The representation is based on permutation σ _r The result of the ith sample of the first j samples, in each iteration t, is from { σ [ ] ₁ ,...,σ _S Construction of a Tree T _t And its gradient is calculated:

s24, calculating gradient grad of each sample i _r,σ(i)-1 (i) When all possible pairs of contributions have been predicted, the gradient grad of the sample previously belonging to the same leaf as the sample i is calculated _r,σ(i)-1 (i) The average value of the values of the leaf nodes of the sample i is obtained, and a tree structure T is established _t Thereafter, unknown ligand-receptor interactions were classified.

5. The method for predicting cell communication based on Boosting and deep forest and single cell sequencing data according to claim 1, wherein M can be defined as:

wherein

Represents X _i Target score set of the middle sample.

6. The method of claim 1, wherein the Boosting, deep forest and single cell sequencing data-based cellular communication prediction method is characterized in that for a sample x, LRI-NGboost, P is distributed by a condition _θ Predicting the label y thereof, wherein the parameter theta is formed by the initial theta ⁽⁰⁾ And M basic classifier outputs, two basic classifiers for each stage for normal distribution with parameters of μ and log σ

And

7. the method of claim 1, wherein random forests and additional trees are selected as basic classifiers, and for a ligand-receptor interaction feature, each predictor calculates the ratio of feature samples corresponding to positive and negative classes in each layer, and generates a class vector from the class probabilities obtained from all predictors, the class vector being connected to the original ligand-receptor interaction feature vector and being used as input for the next layer of deep forest;

when the prediction performance is better than that of all the previous layers, adding a new layer in the model; when the performance of the latter two layers is not improved, training is terminated, and finally the average interaction probability values are calculated for each ligand-receptor pair belonging to the positive and negative classes, respectively, and the class having the larger average interaction probability is taken as the final class.

8. The cell communication prediction method based on Boosting, deep forest and single cell sequencing data as claimed in claim 1, wherein the scoring method is a combination of an expression product method and an expression threshold method, and the cell communication score calculation method is as follows:

wherein, f ₁ (k ₁ ,k ₂ ) Fraction of cell communication, g, for expression product method ₁ (k ₁ ,k ₂ ) The cell communication score for expression thresholding.

9. The method for predicting cellular communication based on Boosting and deep forest and single cell sequencing data according to claim 8, wherein the cellular communication score calculated based on the expression product method is as follows:

wherein, the first and the second end of the pipe are connected with each other,

And

the score of the communication strength of (a) is obtained,

And

the communication strength score of (c).

10. The method of claim 1, applied to cellular communication prediction in human tumor microenvironment.