CN115359840B - Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees - Google Patents

Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees Download PDF

Info

Publication number
CN115359840B
CN115359840B CN202211042461.6A CN202211042461A CN115359840B CN 115359840 B CN115359840 B CN 115359840B CN 202211042461 A CN202211042461 A CN 202211042461A CN 115359840 B CN115359840 B CN 115359840B
Authority
CN
China
Prior art keywords
cell
tree
lineage
data
alternative splicing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211042461.6A
Other languages
Chinese (zh)
Other versions
CN115359840A (en
Inventor
徐云刚
郭茂祖
杨娟
邹权
郭琛
李如风
姚宇飞
李亚晨
李月森
邵锦瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202211042461.6A priority Critical patent/CN115359840B/en
Publication of CN115359840A publication Critical patent/CN115359840A/en
Application granted granted Critical
Publication of CN115359840B publication Critical patent/CN115359840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the technical field of biology, and particularly relates to a method for identifying key regulatory factors for determining cell fate of branch points in lineage trees, which comprises the following steps: obtaining high-dimensional data for all cell types in the lineage tree; establishing a manifold learning-based computing model REFIT, wherein the REFIT is used for mapping high-dimensional data of various cell types in a tree structure to a low-dimensional space; high throughput sequencing data was input into the REFIT model to identify key regulatory factors for branch point cell fate decisions in lineage trees. The invention innovatively represents cell types as points of alternative splicing events in a high-dimensional space composed of a plurality of apparent modifications and RNA sequence features, and then identifies key regulatory factors for branch point cell fate decisions in lineage trees by means of manifold learning dimension reduction.

Description

Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees
Technical Field
The invention belongs to the technical field of biology, and particularly relates to a method for identifying key regulatory factors for determining cell fate of branch points in lineage trees.
Background
The lineage tree (cell lineage tree) is a tree structure consisting of the process of cell differentiation and the relationships between the daughter cells that it forms. The lineage tree is one of the most important phenotypes of multicellular organisms, and provides a high-efficiency data representation and analysis framework for tracking cell division, differentiation and time sequence change of cell states; it is not only the key to solve a number of important problems associated with development in life sciences, but also provides an important means for developing computational techniques and informatics methods to study biological development. Among them, stem cell (stem cell) lineage trees are the most important tools for studying developmental and cell fate decisions.
Stem cells are a type of pluripotent cells that can differentiate into cell populations of varying morphological structure, functional characteristics, and thus form human tissues, organs, and systems. Sustained self-renewal and multilineage differentiation of stem cells are the basis for tissue organogenesis and ontogenesis. The stem cells not only are important research objects of developmental biology, but also have wide prospects in the field of clinical application. How stem cells determine their fate in the differentiation process and the regulatory mechanisms behind them are one of the most critical disciplines in the areas of stem cell development and organ regeneration. Therefore, the comprehensive, systematic and scientific research on the fate decision mechanism of stem cells is helpful for the deep understanding of the biological processes of organogenesis and ontogenesis, and provides a theoretical basis for cell engineering, regenerative medicine and clinical application thereof.
With the growing development of cell lineage tree determination technology, especially the combined application of gene editing technology and single cell sequencing technology in recent years, the data of the cell lineage tree presents a rapidly accumulated situation, and cell state modeling and dynamic transfer analysis based on the lineage tree are particularly important, because the cell lineage tree is a key for researching a cell fate decision mechanism and is directly related to research on developmental biology and clinical application. Multiple groups of high throughput sequencing technologies, particularly single cell sequencing technologies that have been developed in recent years, provide valuable resources for the use of information technology to study cell differentiation and fate decisions.
In view of the important role of alternative splicing in stem cell self-renewal, directed differentiation, elucidation of its precise regulatory mechanisms will help to further reveal stem cell fate decisions and provide a theoretical basis for cell and tissue engineering and regenerative medicine. Epigenetic modifications provide epigenetic memory for the splicing pattern, enabling the splicing pattern to be transferred during stem cell self-renewal; meanwhile, when stem cells are directionally differentiated and a new splicing mode is needed, the memory can be modified without establishing a new splicing rule, and specific splicing results can be obtained.
The key to studying the cell differentiation fate decision mechanism in lineage trees is to reveal key regulatory factors that determine their directional differentiation fate (branch selection). Traditionally studied cell fate decisions have focused on the prediction of cell fate between self-renewal and directed differentiation, but this applies only to simple lineage trees, high-throughput sequencing data for complex massive multi-sets of genealogy, and more complex lineage trees, how to learn a representation of its low-dimensional space from high-dimensional multi-sets of genealogy data, and deriving key regulatory factors for branch point cell fate decisions based thereon becomes a technical problem to be solved.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for identifying key regulatory factors for determining the fate of branch point cells in lineage trees.
The object of the present invention is to provide a method for identifying key regulatory factors for branch point cell fate decisions in lineage trees, comprising:
obtaining high-dimensional data of all cell types in the lineage tree, the high-dimensional data being high-throughput sequencing data of cells;
establishing a manifold learning-based computing model REFIT, wherein the REFIT is used for mapping high-dimensional data of various cell types in a tree structure to a low-dimensional space;
high throughput sequencing data was input into the REFIT model to identify key regulatory factors for branch point cell fate decisions in lineage trees.
Preferably, the above method for identifying key regulatory factors for branch point cell fate decisions in lineage trees is used to obtain a dendrogram composed of points of alternative splicing events, wherein each point corresponds to an alternative splicing event and the spatial location reflects its key apparent regulatory factor.
Preferably, the above method of identifying key regulatory factors for branch point cell fate decisions in lineage trees finds the differentiation pathway between any two parent-child cells via alternative splicing events based on the dendrogram, and determines the points (key regulatory factors) that play a key role in cell differentiation fate decisions.
Preferably, the above method of identifying key regulatory factors for branch point cell fate decisions in lineage trees, which are human embryonic stem cell differentiation lineage trees, hematopoietic stem cell differentiation lineage trees, or induced pluripotent stem cell differentiation lineage trees.
Preferably, the above method of identifying key regulatory factors for branch point cell fate decisions in lineage trees, the high throughput sequencing data includes high throughput sequencing data for genomes, transcriptomes and epigenetic groups.
Preferably, the above method of identifying key regulatory factors for branch point cell fate decisions in lineage trees, the high dimensional data is high throughput sequencing data of histone modifications and RNA sequences.
Preferably, the method for identifying key regulatory factors for branch point cell fate decisions in lineage trees described above pre-processes the high throughput sequencing data prior to constructing a computational model REFIT.
Preferably, the method for identifying key regulatory factors for branch point cell fate decisions in lineage trees is as follows: RNA-seq data for each cell type in the lineage tree was obtained, the alternative splicing event for each cell type in the lineage tree was detected using rMATs software, the percent cut-in (percent spliced in, PSI) for each alternative splicing region was recorded, and the alternative splicing events identified for all cell types were pooled so that each cell type in the lineage tree shared a set of identical alternative splicing event lists.
Preferably, the method for identifying key regulatory factors for branch point cell fate decisions in lineage trees is as follows: obtaining aligned to genomic histone-modified ChIP-seq data for each cell type in the lineage tree, detecting histone-modified signal peaks for each cell type in the lineage tree using MACS2 software, combining the detected signal peaks in all cell types for each histone modification, thereby obtaining a unified set of histone-modified signal peak lists;
based on the distance d between the signal peak and the 5' splice site and the height h of the peak, the intensity of histone modification in the alternative splice region is defined as hm=h/d, and after the same treatment is carried out on ChIP-seq data of each cell type, a data table is obtained, wherein ' row ' represents an alternative splicing event, ' column ' represents a histone modification type, and ' value ' represents the intensity of the signal peak of a certain alternative histone modification in a certain alternative splice region.
Preferably, the method for identifying key regulatory factors for branch point cell fate decisions in lineage trees is as follows: the alternative splice site and the 150bp intervals on the upstream and downstream of the alternative splice site are selected, then the base sequence of the interval is extracted from the genome FASTA file, and the base sequence is converted into a binary 4×n two-dimensional vector by using a single-hot coding method, wherein n represents the length of the sequence.
Compared with the prior art, the invention has the following beneficial effects:
in order to fully describe the stem cell multi-generation differentiation process, the invention fully considers the structural information of the human embryonic stem cell lineage tree, the hematopoietic stem cell lineage tree and the lineage tree of the iPSC differentiated into nerve cells by considering the structural information of the tree, develops a biological informatics calculation method of a system, researches the low-dimensional space representation of high-dimensional multi-group data in the stem cell multi-generation differentiation lineage tree, identifies key regulatory factors for determining the cell fate of the branch point based on the low-dimensional space representation, and further reveals alternative splicing regulatory mechanisms participating in the cell fate determination on key nodes.
We propose the following assumptions: given the dynamic changes in the splicing module and its regulatory codons during differentiation and the lineage tree structure, and more importantly, that these information are represented in high-dimensional observations, popular learning (manifold learning) can be used to map such high-dimensional data into low-dimensional space but preserve their inherent geometric constraints, thereby facilitating identification of the most critical regulatory factors involved in fate decisions.
The invention mainly considers the topological structure of the lineage tree and the interplanar relation among cells, integrates high-flux multi-group data, maps the high-dimensional data into a low-dimensional space but keeps the inherent geometric constraint relation, and further identifies key regulatory factors for determining the cell fate of branch points in the lineage tree.
The invention aims at the specificity of stem cell fate determination, in particular to branch point cell fate determination, namely the selection problem of which daughter cell type the cells are directed to after directed differentiation, creatively expresses the cell type as a point of an alternative splicing event in a high-dimensional space formed by various apparent modifications and RNA sequence characteristics, then maps the cell differentiation process expressed in the high-dimensional space into manifold in a 2-dimensional space by means of manifold learning dimension reduction effect, and constrains the manifold structure of the cell type to be as close to the original lineage tree structure as possible. The tree structure is taken as the center, a complete bioinformatic analysis framework is formed, and splicing epigenetic codes and fate decision mechanisms of stem cell differentiation in the complex lineage tree are systematically researched. The invention provides a complete bioinformatics analysis framework and provides a theoretical basis for stem cell-based regenerative medicine and clinical application thereof.
Drawings
FIG. 1 is a schematic diagram of a lineage tree;
FIG. 2 is a diagram of a hematopoietic stem cell (A) differentiated lineage tree and a pluripotent stem cell (B) inducing differentiated lineage tree;
in fig. 2, C represents histone modification and alternative splicing type of hematopoietic stem cells, and D represents histone modification and alternative splicing type of induced pluripotent stem cells;
FIG. 3 is a diagram of raw data and a preprocessing method thereof;
a, a cell differentiation lineage tree, B, RNA-seq data, C, chIP-seq data, D, RNA sequence data;
FIG. 4 shows a common alternative splice type (A) and cell fate decisions (B);
FIG. 5 shows REFIT manifold learning (A) and cell fate determination key identification (B).
Detailed Description
In order that those skilled in the art will better understand the technical scheme of the present invention, the present invention will be further described with reference to specific embodiments and drawings.
In the description of the present invention, unless otherwise specified, all reagents are commercially available and methods are conventional in the art.
The prior art is directed to the study of cell fate decisions and is only applicable to the prediction of fate of paired cells between self-renewal and directed differentiation. Aiming at a more complex multi-layer cell differentiation lineage tree and massive high-dimensional multi-group chemical high-throughput sequencing data, the prediction of multi-layer differentiation and fate decisions based on the lineage tree needs to be solved, so that the splicing and epigenetic mechanisms of stem cells differentiated into different tissue cells are more systematically researched. Therefore, the key scientific problem to be solved by the present invention is to propose a new computational method to identify key regulatory factors for branch point cell fate decisions on complex lineage trees. The specific method comprises the following steps:
1. experimental data and pretreatment
(1) Cell differentiation lineage tree
The individual cell types and their interrelationships during cell differentiation are represented as a tree (tree), called lineage tree (cell lineage tree). As with the definition of a traditional tree, a lineage tree is a directed acyclic graph (Directed Acyclic Graph, GAD). Referring to fig. 1, lineage trees record the cells produced by differentiation at each particular time during development, as well as the precursor cells that produced these cells; wherein each node of the tree represents a cell type, each bifurcation (fork) of the tree represents a cell division event, each branch (branch) of the tree represents a differentiation event, and the terminal leaf node of the tree represents each terminally differentiated cell of the adult organism; for a whole cell lineage tree of a certain organism, the root node (root node) represents fertilized egg (zygate); however, for complex organisms (such as humans and mice), we have not yet obtained a full view of their entire developmental process and its lineage tree; thus, the lineage tree, which is commonly referred to, may describe only a small portion or stage of the biological complete developmental process, i.e., a subtree representing the local developmental process or stage (e.g., branches in the left-hand dashed box of FIG. 1); the root node often refers to a cell with differentiation potential, such as a stem cell or precursor cell (progenitor). Furthermore, another feature of the lineage tree is that it is not necessarily a classical binary tree, i.e., some precursor cells can divide into more than two cell types (e.g., the lower right branch of FIG. 1).
The lineage tree is one of the input data of the present invention, defined as T (V, E); wherein V represents a node of the tree, i.e. a cell type; e represents the edge (edge) of the tree, i.e., the differentiation path. The degree of emergence of other nodes (namely, cells in an intermediate differentiation state) except leaf nodes (namely, terminally differentiated cells) is more than or equal to 1, and the degree of emergence represents the number of child nodes (namely, next-stage nodes) of a certain node; the ingress of other nodes except the root node is 0, is not less than 1, and is the number of parent nodes (i.e. the nodes of the upper layer). Thus, some precursor cells may divide into more than two daughter cell types, and some differentiated cells may also be from different parent cell types (as shown in FIG. 1 or FIG. 2).
Embodiments of the present invention will use human embryonic stem cell (hESC) differentiation lineage trees, hematopoietic Stem Cell (HSC) differentiation lineage trees, and Induced Pluripotent Stem Cell (iPSC) differentiation lineage trees for research and validation of subsequent computational methods (fig. 2).
(2) Multiunit data and preprocessing
The present invention uses transcriptome, epigenetic group and genomic data of all cell types in the lineage tree as an initial data source, after appropriate pretreatment, for subsequent computer model construction and analysis.
The preprocessing method of each data is summarized as shown in fig. 3, and specifically comprises the following steps:
1) RNA-seq data and alternative splicing assays
RNA-seq data, i.e., BAM/SAM files (including biological repeats), aligned to the transcriptome for each cell type in the lineage tree is obtained. Alternative splicing events were detected for each cell type in the lineage tree using the rMATs software, and the percent cut (percent spliced in, PSI) for each alternative splicing region (exon or remaining intron) was recorded. The alternative splicing events identified for all cell types were pooled so that each cell type in the lineage tree shared a set of identical alternative splicing event lists (fig. 3A, 3B). As shown in fig. 4A, alternative splicing events can be classified into 7 types according to the cause of their occurrence. However, for simplicity of description, the following will describe the study scheme taking as an example the alternative splicing type of the Skip Exon (SE).
2) Histone modification (ChIP-seq) data processing (epigenetic group data)
The aligned histone modified ChIP-seq data, i.e., BAM/SAM file (including biological repeats), for each cell type in the lineage tree is obtained. Histone modification signal peaks (narrow peak) were detected for each cell type in the lineage tree using MACS2 software. For each histone modification, the signal peaks detected in all cell types are pooled, resulting in a unified set of histone modification signal peak lists. Based on the distance (d) of the signal peak from the 5' Splice Site (SS) and the peak height (h), we define the intensity of histone modifications in the alternative splice interval to be hm=h/d (fig. 3C). After the same processing of ChIP-seq data for each cell type, a data table will be obtained for each, wherein "row" indicates alternative splicing events, "column" indicates histone modification species, and "value" indicates signal peak intensity for a particular alternative histone modification in a particular alternative splicing region.
3) RNA sequence data
To use genomic sequence information, we selected alternative splice sites and 150bp intervals upstream and downstream thereof, then extracted the base sequence of the interval from the genomic FASTA file, and converted the base sequence into a binary 4×n two-dimensional vector using One-Hot (One-Hot) coding method, where n represents the length of the sequence (fig. 3D).
2. Identification of key regulatory factors for cell fate decisions in lineage trees based on manifold learning
The prior art research models still have difficulty locking a particular, critical regulatory factor (i.e., a variable splicing event + histone modification) for cell fate decisions in lineage trees, and are difficult to use for further downstream biological experimental validation. A key scientific problem to address this need is data dimension reduction.
Whereas the stem cell differentiation lineage tree differs from a single study object in a traditional study in that it shows that a plurality of interrelated cell types have a tree-like structure of interrelationships, the dimension reduction process should be able to be applied simultaneously to all cell types in the tree. That is, the data of each cell can be considered as a plurality of sub-data related to each other in one larger dataset. The dimension reduction processing needs to keep the geometric relationship inside each sub-data set and the tree structure relationship among different sub-sets. Among the numerous dimension reduction methods, popular learning (manifold learning) has such properties and is therefore widely used in cell differentiation trajectory (trajectory) inference based on single cell sequencing.
We assume here that the distribution of high-dimensional data of a plurality of cell types having a tree-like structure relationship in a high-dimensional space has a certain geometry, i.e., is concentrated near a certain manifold of a low dimension. Thus, we consider that alternative splicing and epigenetic modification data during cell differentiation exist and that a low dimensional manifold corresponding to the tree structure in 2-dimensional space can be learned. Based on this, we propose a manifold learning based calculation method REFIT (Regulatory Factor Identification on Tree using manifold learning) to map high-dimensional data of multiple cell types in a tree structure to low-dimensional space and thereby discover alternative splicing events and their apparent regulatory factors that play a key role in cell fate decisions (fig. 5).
(1) Definition of REFIT manifold learning
Each cell type in the lineage tree (node of the tree) is represented as a set of several alternative splicing events (data points) distributed in a high-dimensional space consisting of histone modifications (i.e., histone modified ChIP-seq data whose values represent the strength of the modification of a certain histone at a specific position) and RNA sequences (left side of fig. 5A). Thus, manifold learning for this high-dimensional data (middle part of fig. 5A) is to find a mapping from the high-dimensional space to the low-dimensional hidden space, so that the points in this low-dimensional space can also represent the internal properties of each cell type (i.e. alternative splicing pattern and apparent modification thereof) and the relationships between cells (i.e. parent-child relationships defined by the tree structure) (right side of fig. 5A).
In particular, the goal of REFIT is to learn a low-dimensional manifold of high-dimensional input data, i.e., a set of hidden points Z= { Z 1 ,...,z N And an undirected graph connecting the points
Figure BDA0003821365660000091
Wherein N represents the number of alternative splices. Hidden point Z in low-dimensional space corresponds to input data in high-dimensional space +.>
Figure BDA0003821365660000092
Map of low-dimensional space->
Figure BDA0003821365660000093
From a set of vertices v= { V 1 ,...,V N Sum weighted edges, each fixed point V i One point z corresponding to a low-dimensional space i . Data x for alternative splicing and epigenetic regulation of this project study i Is a feature vector, and corresponds to each group of protein modification and RNA sequence feature of the ith alternative splicing event. Let b ij Representing edge (V) i ,V j ) Weights of b ij A > 0 indicates a graph->
Figure BDA0003821365660000094
In the presence of V i To V j And vice versa, the edge is not present. We define +.>
Figure BDA0003821365660000095
For point z i Reverse mapping function to original high-dimensional space, learn +.>
Figure BDA0003821365660000096
And
Figure BDA0003821365660000097
the procedure of (1) is to optimize the following objective function, equation (3):
Figure BDA0003821365660000098
wherein G is b A set of possible graph structures is shown,
Figure BDA0003821365660000099
representing a set of functions that can map points of a low-dimensional space back to the original high-dimensional space.
The above optimization process will learn a map of a low dimensional space
Figure BDA00038213656600000910
However, it cannot be guaranteed that hidden points in the low-dimensional space can accurately reflect the distribution and relationship of the observed values in the original high-dimensional space. In order to make a diagram learned in a low-dimensional space +.>
Figure BDA00038213656600000911
Can reflect the distribution of the original data in a high-dimensional space, and REFIT needs hidden points z capable of ensuring a low-dimensional space i In the reverse direction of the mapping function->
Figure BDA00038213656600000912
As close as possible to the corresponding raw data point x i . To this end we add a constraint term, making the optimization process expressed as:
Figure BDA00038213656600000913
where λ is a weight parameter used to adjust the sum of the two terms.
(2) Identification of key regulatory factors for cell fate decisions based on REFIT manifold
Through REFIT manifold learning, a tree manifold graph (shown in the left side of FIG. 5B) consisting of points of alternative splicing events can be obtained, wherein each point corresponds to one alternative splicing event, and the spatial position reflects the key apparent regulatory factors, namely the main components of the original high-dimensional space histone modification and RNA sequence characteristics. With this dendrogram we can find the differentiation pathway between any two parent-child cells through these alternative splicing events (right side of FIG. 5B) and determine the points that play a key role in differentiation fate decisions (i.e., splicing events and apparent modifications). As above, we will apply REFIT to the human HSC lineage and iPSC lineage induced differentiation into neural cells shown in fig. 2, ultimately identifying key regulatory factors for 1-2 cell fate decisions, respectively.
It should be noted that, when numerical ranges are referred to in the present invention, it should be understood that two endpoints of each numerical range and any numerical value between the two endpoints are optional, and because the adopted step method is the same as the embodiment, in order to prevent redundancy, the present invention describes a preferred embodiment. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (9)

1. A method of identifying key regulatory factors for branch point cell fate decisions in lineage trees, comprising:
obtaining high-dimensional data of all cell types in the lineage tree, the high-dimensional data being high-throughput sequencing data of cells;
establishing a manifold learning-based computing model REFIT, wherein the REFIT is used for mapping high-dimensional data of various cell types in a tree structure to a low-dimensional space;
inputting high-throughput sequencing data into a REFIT model, and identifying key regulatory factors for determining the cell fate of branch points in a lineage tree;
the REFIT is used for obtaining a tree manifold graph formed by taking alternative splicing events as points, wherein each point corresponds to one alternative splicing event, and the space position reflects a key apparent regulatory factor;
the method comprises the steps of mapping high-dimensional data of various cell types in a tree structure to a low-dimensional space based on a manifold learning computing method REFIT, and finding alternative splicing events and apparent regulatory factors thereof which play a key role in cell fate decision;
REFIT manifold learning is defined as follows:
each cell type in the lineage tree, i.e., a node of the tree, is represented as a set of several alternative splicing events; these alternative splicing events, i.e., data points, are distributed in a high-dimensional space consisting of histone modifications and RNA sequences; therefore, manifold learning for the high-dimensional data is to find a mapping from the high-dimensional space to the low-dimensional hidden space, so that the points in the low-dimensional space can represent the alternative splicing mode of each cell type and the apparent modification thereof as well as the relationship among cells, wherein the relationship among cells is the parent-child relationship defined by the tree structure;
the goal of REFIT is to learn the low-dimensional manifold of the high-dimensional input data;
the method for identifying key regulatory factors for cell fate decisions based on REFIT manifold is as follows:
and obtaining a tree manifold graph formed by taking alternative splicing events as points through REFIT manifold learning, wherein each point corresponds to one alternative splicing event, the space position reflects key apparent regulatory factors, namely main components of original high-dimensional space histone modification and RNA sequence characteristics, finding differentiation paths between any two parent and child cell through the alternative splicing events based on the tree manifold graph, and determining points which play a key role in determination of the differentiation fate, namely the splicing events and apparent modification.
2. The method of claim 1, wherein the method comprises finding a differentiation pathway between any two parent and child cells via alternative splicing events based on the dendrogram, and determining the key regulator that plays a key role in cell differentiation fate decisions.
3. The method of claim 1, wherein the lineage tree is a human embryonic stem cell differentiation lineage tree, a hematopoietic stem cell differentiation lineage tree, or an induced pluripotent stem cell differentiation lineage tree.
4. The method of identifying key regulatory factors for branch point cell fate decisions in lineage trees according to claim 1, wherein the high throughput sequencing data includes high throughput sequencing data for genomes, transcriptomes, and epigenetic groups.
5. The method of identifying key regulatory factors for branch point cell fate decisions in lineage trees according to claim 4, wherein the high dimensional data is high throughput sequencing data of histone modifications and RNA sequences.
6. The method of claim 5, wherein the high throughput sequencing data is pre-processed prior to establishing a computational model REFIT.
7. The method of claim 6, wherein the high throughput sequencing data preprocessing is performed as follows: RNA-seq data for each cell type in the lineage tree was obtained, the alternative splicing event for each cell type in the lineage tree was detected using rMATs software, the percent cut-in for each alternative splicing region was recorded, and the alternative splicing events identified for all cell types were pooled so that each cell type in the lineage tree shared a set of identical alternative splicing event lists.
8. The method of claim 6, wherein the high throughput sequencing data preprocessing is performed as follows: obtaining aligned to genomic histone-modified ChIP-seq data for each cell type in the lineage tree, detecting histone-modified signal peaks for each cell type in the lineage tree using MACS2 software, combining the detected signal peaks in all cell types for each histone modification, thereby obtaining a unified set of histone-modified signal peak lists;
based on the distance d between the signal peak and the 5' splice site and the height h of the peak, the intensity of histone modification in the alternative splice region is defined as hm=h/d, and after the same treatment is carried out on ChIP-seq data of each cell type, a data table is obtained, wherein ' row ' represents an alternative splicing event, ' column ' represents a histone modification type, and ' value ' represents the intensity of the signal peak of a certain alternative histone modification in a certain alternative splice region.
9. The method of claim 6, wherein the high throughput sequencing data preprocessing is performed as follows: the alternative splice site and the 150bp intervals on the upstream and downstream of the alternative splice site are selected, then the base sequence of the interval is extracted from the genome FASTA file, and the base sequence is converted into a binary 4×n two-dimensional vector by using a single-hot coding method, wherein n represents the length of the sequence.
CN202211042461.6A 2022-08-29 2022-08-29 Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees Active CN115359840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211042461.6A CN115359840B (en) 2022-08-29 2022-08-29 Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211042461.6A CN115359840B (en) 2022-08-29 2022-08-29 Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees

Publications (2)

Publication Number Publication Date
CN115359840A CN115359840A (en) 2022-11-18
CN115359840B true CN115359840B (en) 2023-04-21

Family

ID=84004286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211042461.6A Active CN115359840B (en) 2022-08-29 2022-08-29 Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees

Country Status (1)

Country Link
CN (1) CN115359840B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103459592A (en) * 2010-12-09 2013-12-18 银丰生物工程技术有限公司 Sub-totipotent stem cell product and apparent hereditary modifying label thereof

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105693842B (en) * 2016-01-29 2019-09-24 中国科学院广州生物医药与健康研究院 NCoR/SMRT protein complexes are adjusting the application in cell fate transformation
CN105622743B (en) * 2016-02-22 2019-08-23 西安交通大学 A kind of autism serum polypeptide marker PF4-A and its application
US10510150B2 (en) * 2017-06-20 2019-12-17 International Business Machines Corporation Searching trees: live time-lapse cell-cycle progression modeling and analysis
EP4107256A4 (en) * 2020-02-21 2024-03-20 Mission Bio Inc Using machine learning to optimize assays for single cell targeted sequencing
CN112768001A (en) * 2021-01-27 2021-05-07 湖南大学 Single cell trajectory inference method based on manifold learning and main curve

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103459592A (en) * 2010-12-09 2013-12-18 银丰生物工程技术有限公司 Sub-totipotent stem cell product and apparent hereditary modifying label thereof

Also Published As

Publication number Publication date
CN115359840A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
Shrikumar et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5. 6.5
Burton et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions
Milone et al. * omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants
Feng et al. Estimation of cell lineage trees by maximum-likelihood phylogenetics
D’Agaro Artificial intelligence used in genome analysis studies
CN103164631B (en) A kind of intelligent coordinate expression gene analyser
Qiu et al. Spateo: multidimensional spatiotemporal modeling of single-cell spatial transcriptomics
CN114783526A (en) Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
Zhao et al. CStreet: a computed C ell S tate tr ajectory inf ere nce method for t ime-series single-cell RNA sequencing data
Wang et al. Reconstruct high-resolution 3D genome structures for diverse cell-types using FLAMINGO
CN115273966B (en) Analysis method of alternative splicing patterns and dynamic change of chromatin state in lineage tree
Khor Application of graph colouring to biological networks
JPWO2002025489A1 (en) Gene data display method and recording medium
CN115359840B (en) Method for identifying key regulatory factors for branch point cell fate decisions in lineage trees
Zheng et al. Identifying individual-specific microbial DNA fingerprints from skin microbiomes
Zhang et al. Multi-hierarchical profiling: an emerging and quantitative approach to characterizing diverse biological networks
CN115273978B (en) Method for obtaining splicing epigenetic code suitable for multilayer pedigree tree
Joehanes Network analysis of gene expression
Maki et al. An integrated comprehensive workbench for inferring genetic networks: voyagene
Krishnan et al. Integrative approaches for mining transcriptional regulatory programs in Arabidopsis
Makarenkov et al. Inferring multiple consensus trees and supertrees using clustering: A review
Pan et al. Studying temporal dynamics of single cells: expression, lineage and regulatory networks
Senapati et al. Single-Cell RNA Sequence Data Analysing Using Fuzzy de Based Clustering Technique
Song et al. Detecting spatially co-expressed gene clusters with functional coherence by graph-regularized convolutional neural network
Papetti et al. Barcode demultiplexing of nanopore sequencing raw signals by unsupervised machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant