CN115588465B - Screening method and system for character related genes - Google Patents
Screening method and system for character related genes Download PDFInfo
- Publication number
- CN115588465B CN115588465B CN202211277659.2A CN202211277659A CN115588465B CN 115588465 B CN115588465 B CN 115588465B CN 202211277659 A CN202211277659 A CN 202211277659A CN 115588465 B CN115588465 B CN 115588465B
- Authority
- CN
- China
- Prior art keywords
- pathway
- cell
- gene
- data
- genetic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 156
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000012216 screening Methods 0.000 title claims abstract description 47
- 230000037361 pathway Effects 0.000 claims abstract description 158
- 230000002068 genetic effect Effects 0.000 claims abstract description 92
- 239000011159 matrix material Substances 0.000 claims abstract description 60
- 230000000694 effects Effects 0.000 claims abstract description 58
- 230000014509 gene expression Effects 0.000 claims abstract description 42
- 238000012163 sequencing technique Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 33
- 238000010801 machine learning Methods 0.000 claims abstract description 15
- 238000010219 correlation analysis Methods 0.000 claims abstract description 8
- 210000004027 cell Anatomy 0.000 claims description 163
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 238000013077 scoring method Methods 0.000 claims description 6
- 108700028369 Alleles Proteins 0.000 claims description 4
- 230000007613 environmental effect Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 210000003765 sex chromosome Anatomy 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 3
- 230000003234 polygenic effect Effects 0.000 claims description 2
- 238000012174 single-cell RNA sequencing Methods 0.000 description 17
- 201000010099 disease Diseases 0.000 description 13
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 13
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 5
- 210000004556 brain Anatomy 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 210000001519 tissue Anatomy 0.000 description 3
- 238000003559 RNA-seq method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 210000002475 olfactory pathway Anatomy 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000003819 peripheral blood mononuclear cell Anatomy 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 108700018351 Major Histocompatibility Complex Proteins 0.000 description 1
- 241000699666 Mus <mouse, genus> Species 0.000 description 1
- 241000699660 Mus musculus Species 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 210000004460 N cell Anatomy 0.000 description 1
- 244000124765 Salsola kali Species 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000000601 blood cell Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000011712 cell development Effects 0.000 description 1
- 230000004186 co-expression Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008303 genetic mechanism Effects 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 210000002865 immune cell Anatomy 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000012177 large-scale sequencing Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 210000004092 somatosensory cortex Anatomy 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 230000020382 suppression by virus of host antigen processing and presentation of peptide antigen via MHC class I Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/40—Population genetics; Linkage disequilibrium
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Ecology (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Physiology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a screening method, a screening system, screening equipment and a screening computer-readable storage medium of a trait related gene, wherein the screening method comprises the following steps: acquiring single cell sequencing data; processing the single cell sequencing data and the pathway data by adopting a machine learning method to obtain PAS scoring matrix of the cell pathway and PAS of the cell pathway; acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data; carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient; processing the estimated coefficient and PAS to obtain genetic related pathway activity score gPAS of the cell; and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
Description
Technical Field
The invention relates to the technical field of gene sequencing, in particular to a screening method and a screening system of a character related gene.
Background
The use of single cell RNA sequencing (scRNA-seq) technology to identify key cell subsets associated with complex diseases or features is critical to understanding the mechanisms of complex diseases. However, scRNA-seq data do not allow large-scale sequencing due to their high cost and low throughput characteristics, and most single cell-based research samples currently do not exceed 20, resulting in limited statistical efficacy and failure to accurately reveal a subset of risks associated with disease or features in a cell subpopulation. In addition, the scRNA-seq data is characterized by high sparsity, technical noise and variance instability at the genetic level.
Whole genome association studies (GWAS) are widely used to study different complex diseases or traits, and correlating scRNA-seq data with phenotype-associated genetic information of GWAS from large-scale samples is considered to be a practical and efficient method to reveal the genetic mechanisms of complex diseases or traits at single cell resolution.
Methods combining GWAS with scRNA-seq data to identify cell types associated with complex diseases, including such as LDSC-SEG, MAGMA, rolyPoly, require extensive adjustment of parameters in order to annotate cell types with known marker genes and largely ignore the internal heterogeneity of each cell type. Furthermore, the prior art can identify genes with high expression levels, but has the potential disadvantage that overattention to high expression genes underestimates the functional role of genes whose expression levels are relatively low but important in revealing cell fate.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a screening method and a screening system of a character related gene; the method disclosed by the invention is used for deeply mining the life rule hidden behind single-cell data by combining the scRNA-seq data and the genetic association data to infer genes, cells and the like related to the characters through a grading method based on the single-cell pathway, so that the related life science problem is solved.
The application discloses a screening method of a trait related gene, comprising the following steps:
acquiring single cell sequencing data;
processing the single cell sequencing data and the pathway data by adopting a machine learning method to obtain PAS scoring matrix of the cell pathway and PAS of the cell pathway;
acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data;
carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient;
processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
The acquisition formula of the genetic effect value is as follows:; wherein ,/>Theoretical effect size vector representing m SNPs, < ->Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
wherein ,an estimated coefficient representing pathway i in cell j, +.>Representing intercept term, < ->Variance indicating the magnitude of SNP effect in the pathway, +.>Representing weighted PAS;
the step of processing the estimated coefficient and the PAS to obtain a genetically related pathway activity score, gPAS, of the cell comprises: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
wherein ,gPj Scoring the genetically related pathway activity gPAS, saidThe estimated coefficients are optimized;
the step of processing the single cell sequencing data and the pathway data by using a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway comprises the following steps:
acquiring single cell sequencing data and pathway data;
carrying out standardization treatment on the gene-cell matrix in the single-cell sequencing data to obtain a standardized gene-cell matrix;
based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;
optionally, optimizing the pathway activity score PAS to obtain the weighted PAS;
wherein ,representing weighted PAS->Representing normalized expression of gene g in optimized cell j,/I>A pathway activity score PAS representing cell j pathway i;
Optionally, the machine learning method includes a Singular Value Decomposition (SVD) method.
Annotating SNPs in the genetic association data into pathway data comprises:
screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;
optionally, the step of obtaining SNPs of the single gene includes: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of the single gene;
the SNPs of the single gene are collected to obtain SNPs of all genes.
The method further comprises the steps of: calculating a trait related score TRS for each cell according to the N trait related genes; optionally, calculating the trait related score TRS of the N trait related genes using a cell scoring method.
Optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes ordered according to a descending or ascending order of relevance rule.
An application comprising any one of:
obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell;
alternatively, trait-related cell types or subpopulations are obtained based on the block boot method block bootstrap method;
optionally, sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway according to the sorting result and the P value of the pathway on the cell type level;
a screening apparatus for a trait-related gene, the apparatus comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is used for calling program instructions, and when the program instructions are executed, the processor is used for executing the screening method of the character related genes.
A screening system for a trait-related gene, comprising:
an acquisition unit for acquiring single cell sequencing data;
the first processing unit is used for processing the single-cell sequencing data and the pathway data by adopting a machine learning method to obtain a PAS scoring matrix of the cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data, annotating SNPs in the genetic association data into the path data based on the path data, and obtaining genetic effect values of all SNPs in single path data;
the third processing unit is used for carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing the multi-gene regression model of the genetic association data to obtain estimation coefficients;
a fourth processing unit for processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and a fifth processing unit, configured to perform correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression amount of each cell, and screen out N personality-related genes.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described trait-related gene screening method.
The application has the following beneficial effects:
1. the application creatively discloses a scoring method based on a single cell pathway, which adopts a polygene regression model to reveal genes and cell subgroups related to traits by utilizing scRNA-seq data converted by pathway activity and genetic association research data; the method effectively solves the problems that the identification of the genes and cell subsets related to the polygenic risk of the complex diseases is greatly hindered by small sample size and high sparsity in the scRNA-seq data, so that the statistical efficiency is limited, and the risk subsets related to the diseases or traits in the cell subsets can not be accurately revealed. The method is used for deep mining of life rules hidden behind single-cell sequencing data, and deep analysis of multiple dimensions such as population genetics mutation and disease relation, single-cell sequencing gene abundance information and the like, so that the accuracy and depth of data analysis are greatly improved.
2. The method combines the scRNA-seq data with the genetic association data based on large-scale simulation and real data, so that the problem that a large amount of adjustment parameters are needed for conveniently annotating cell types with known marker genes in the prior art and the internal heterogeneity of each cell type can be ignored to a great extent can be effectively overcome; there is no functional role of genes whose expression levels are relatively low, but important for revealing cell fate, underestimation due to overconcerns about high-expression genes, helping to identify disease-related early developmental events or progenitor cells, such as key transcription factors related to cell development, by aggregating the role of genes whose average expression levels are low; meanwhile, the sparsity and technical noise of the scRNA-seq data can be effectively reduced, and the method has good robustness and capability in the aspect of identifying cell types and sub-populations related to characteristics.
3. The application creatively discloses a screening method of trait related genes based on single-cell scoring paths, which fuses the functional actions of different genes participating in the same biological path to obtain stable cell states, and remarkably increases the statistical efficiency, biological interpretability and result repeatability; overcomes the limitation of the known annotation cell types, and can discover the new genetic related subgroup and the key genes or channels of the cell types, thereby having wide application and strong practicability.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an analysis schematic flow chart of a screening method of a trait related gene provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of screening equipment for trait related genes provided by an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a screening system for trait related genes provided by an embodiment of the present invention;
fig. 4 is a schematic diagram of obtaining the gPAS by the scoring method based on the single cell pathway and outputting TRS, the property-related genes, the property-related cells, the property-related cell types/sub-populations and the property-type tubular pathway by using the gPAS according to the embodiment of the invention.
Detailed Description
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.
In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments according to the invention without any creative effort, are within the protection scope of the invention.
Fig. 1 is a schematic flow chart of a screening method of a trait related gene provided by the embodiment of the invention, specifically, the method comprises the following steps:
101: acquiring single cell sequencing data;
in one embodiment, single cell sequencing data comprises seven independent single cell RNA-seq (scRNA-seq) or single core RNA-seq (snRNA-seq) datasets covering 139 ten thousand cells from humans (homo sapiens) and mice (mus musculus). For blood cells, two scRNA-seq datasets based on human BMMC (n= 35,582 cells) and human PBMCs (n= 97,039 cells) were collected to reveal trait-related cell subsets or types. For diseases/features related to immunity/metabolism, a pseudo-tissue (pseudo-bulk) expression profile and a preferential risk tissue related to disease/feature was constructed for each tissue using the scRNA-seq dataset from human cells (HCL, n= 513,707 cells in 35 adult tissues).
In one embodiment, three single cell datasets are collected for brain-related diseases: the mouse brain scRNAseq dataset (n= 160,796 cells), the human brain olfactory cortex snRNA-seq dataset (n= 11,786 cells), and the human brain snRNA-seq dataset comprising both regions of the olfactory cortex and somatosensory cortex (n= 101,906 cells).
In one example, to discover immune cell populations associated with severe covd-19, a large-scale PBMC scRNA-seq dataset (n= 469,453 cells) was collected containing 254 peripheral blood samples with varying degrees of covd-19 severity (mild n=109 samples, moderate n=102 samples, severe n=50) and 16 healthy controls.
102: processing single-cell sequencing data and pathway data by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
in one embodiment, the step of processing single cell sequencing data and pathway data to obtain a PAS scoring matrix of a cell pathway and obtaining PAS of the cell pathway by using a machine learning method comprises the following steps:
acquiring single cell sequencing data and pathway data;
carrying out standardization treatment on a gene-cell matrix in single-cell sequencing data to obtain a standardized gene-cell matrix; specifically, the sparse gene-cell matrix in the scRNA-seq data was normalized using a variance stabilizing transformation parameter with a scale factor of 10,000, resulting in normalized expression of a single gene in a single cell; the normalized formula is:, wherein ,/>Representing the original expression of gene g in cell j, < >>Represents the normalized expression of gene g in cell j;
based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scoring PAS of single cells in a single pathway;
in one embodiment, the pathway data is KEGG pathway data, and the pathway from the KEGG database is the default for evaluating PASThe gene set converts the standardized gene-cell matrix into a channel-cell matrix by utilizing a Singular Value Decomposition (SVD) method; usingRepresenting the gene set in pathway i, for each pathway i, matrix A was selected from the normalized gene-cell matrix A i Wherein matrix A i Is the whole N cells, and the row is the pathway gene set +.>Middle->Gene, obtained from SVDWherein U represents +.>Orthogonal matrix, < >>Representing a diagonal matrix with all zeros except the main diagonal element>Representation->An orthogonal matrix; right orthogonal matrix->Column t vector->Representing the t-th principal component, reflecting the co-expression variability of genes in single cell data in the pathway; since the first principal component PC1 represents the largest variance variation, the projection of the cell j feature on PC1 represents the +.>The method comprises the steps of carrying out a first treatment on the surface of the For cell j, use is made of a pathwayAll expression variances in i are used as weight adjustment originals +.>The method comprises the steps of carrying out a first treatment on the surface of the For gene g in pathway i, readjusting gene expression using min-max scaling>Regulated Gene expression +.>。
In one embodiment, the pathway activity score PAS is optimized to obtain a weighted PAS;
wherein ,representing weighted PAS->Representing normalized expression of gene g in optimized cell j,/I>A pathway activity score PAS representing cell j pathway i;
in one embodiment of the present invention, in one embodiment,the acquisition mode of (1) comprises the following steps: />
wherein ,represents the normalized expression of gene g in cell j, < >>Represents the maximum value of gene expression in pathway i,/->Represents the minimum value of gene expression in pathway i.
Optionally, the method of machine learning includes a method of Singular Value Decomposition (SVD); the SVD method greatly improves the calculation efficiency of analysis sparse matrix, and can obtain the characteristic value under the condition of not calculating variance matrix; and (3) the standardized gene-cell matrix is sublimated into a path-cell matrix in a low-dimensional space by utilizing a singular value decomposition method.
103: acquiring genetic association data, annotating SNPs in the genetic association data into the path data based on the path data, and obtaining genetic effect values of all SNPs in single path data;
in one embodiment, the step of annotating the SNPs in the genetic association data into the pathway data comprises:
screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on channel data to obtain channels with SNPs annotation;
alternatively, the SNPs of a single gene may be obtained by the steps of: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of a single gene;
the SNPs of a single gene are collected to obtain SNPs of all genes. Specifically, the genetic association data is GWAS data, and SNPs in the GWAS summary statistical data are allocated to related genes by taking 20kb as a default parameter; using symbolsA gene g with SNP k, wherein a plurality of single SNPs correspond to a plurality of genes by distribution of SNP gene pairs; since the whole process requires the parameter to be inferred from thousands of snps, butSNPs of the single SNPs corresponding to a plurality of genes have no effect on the inference process, so that the repeated genes need to be treated in a correlated manner as independent SNP genes; preserving SNPs with Minor Allele Frequency (MAF) greater than 0.1, deleting SNPs on sex chromosomes, and finally obtaining SNPs of related genes;
annotating genes with associated SNPs into the pathway based on the pathway in the KEGG database, and representing the set of SNPs in pathway i using si=formula 2; calculating linkage disequilibrium LD (linkage disequilibrium) on SNPs extracted from the GWAS summary data by using the 3 rd stage data of the thousand genome project; the present protocol provides a collection of functional genes such as GO, reactiomer, and MSigDB as alternative options. In addition, the region of the major histocompatibility complex where the broad LD exists, chr6:25-35 Mbp, was deleted.
In one embodiment, GWAS data has given a phenotype, and the phenotypic annotation of the given phenotype includes dichotomy, continuous dependency characteristics, or intra-phenotype and center measurements.
104: parameter estimation is carried out on the distribution of the genetic effect values based on the genetic effect values of each SNP in PAS and channel data by utilizing a polygene regression model of the genetic association data, so as to obtain an estimation coefficient;
in one embodiment, the genetic effect value is obtained by the formula:; wherein ,/>Theoretical effect size vector representing m SNPs, < ->Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
in one embodiment of the present invention, in one embodiment,SNP sets representing all SNPs contained in the localized genes of each pathway i, multiple gene models assuming a priori pathway iThe effect size of all SNPs follows a multivariate normal distribution, wherein +.>Variance indicating magnitude of SNPs effect in pathway, +.>Representation->A unit matrix;
wherein ,representing an estimated coefficient of pathway i in cell j, the estimated coefficient reflecting the effect of cell-specific PAS on GWAS effect size variance, i.e., the effect of inheritance on response; />Representing intercept term, < ->Variance indicating the magnitude of SNP effect in the pathway, +.>Representing weighted PAS;
in one embodiment, the genetic effect value is based on previous assumptionsThe distribution of (2) is estimated using the following formula: />The method comprises the steps of carrying out a first treatment on the surface of the Optimizing the estimation coefficient by using the formula;
in one embodiment, to optimize the estimation coefficients for each path in the multiple gene regression model, a method is used that significantly improves computational efficiency and estimationOptimizing a polygene regression model by a method-of-models approach; then, the observed and expected squaring effects of SNPs associated with each pathway are fitted and the expected values are estimated by the following formula:wherein Tr represents a matrix track.
105: processing the estimated coefficient and PAS to obtain genetic related pathway activity score gPAS of the cell;
in one embodiment, the step of processing the estimated coefficients and PAS to obtain a genetically related pathway activity score, gPAS, of the cell comprises: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
106: and carrying out correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
Optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and the total score of gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain N personality-related genes; specifically, to maximize efficacy, the expression of each gene g is inversely weighted by its gene-specific technical noise level estimated by modeling the mean variance relationship between genes in the scRNA-seq data;
optionally, the N character related genes are the first 1000 or the last 1000 character related genes sequenced according to a descending order or an ascending order of the relativity; n is not limited to 1000, and N is a natural number integer.
In one embodiment, the method further comprises: calculating a trait related score TRS of each cell according to the N trait related genes; alternatively, the trait related scores TRS for the N genes are calculated using a cell scoring method of an AddModuleScore function in the semoat; the expression for obtaining the trait related score TRS for each cell is: trs=average RE (GS) -average RE (CG); wherein average RE (GS) is the average relative expression value of N personality-related gene sets in a given cell, and average RE (CG) is the average relative expression value of the same number of control gene sets randomly extracted from the existing gene library; RE is relative expression; GS is gene set; CG is control gene set;
in one embodiment, the statistical significance of individual cells is determined by calculating the rank distribution of trait-related genes to further evaluate whether the cells are significantly related to the shape of interest; specifically, the percentage grade of the trait-related gene in the cell is obtained,, wherein ,/>Expressing the expression level of gene G in cell j, G representing the number of genes associated with the specified trait; the gene percentage grade follows the normal distribution +.>Obtaining a statistical value ++for each cell under the null assumption that there is no correlation between the percentage classes of genes>The formula is obtained as follows: />。
Based on the large number of cells in single cell data, the central limit theorem was used to deriveIs a distribution of: /> Wherein N is the total number of cells; the assumption for the significance test is: />The method comprises the steps of carrying out a first treatment on the surface of the The P value for each cell j is:。
an application, the application comprising any one of:
obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell (representing whether individual cell levels are related or not);
alternatively, a trait-related cell type or subpopulation (determining whether a cell type to which an individual cell belongs is related) is obtained based on the block bootstrapping block bootstrap method; specifically, a group of cells is considered as a pseudo-tissue (pseudo-bulk) transcriptome profile, and the amount of gene expression across cells within a given cell type is averaged; for the associated cell types, standard error was estimated with block bootstrap method and t statistics were calculated for each cell type corresponding to the P value. Whereas the goal of the bootstrapping approach is to maintain the data structure as sampling is distributed from experience, the pathway of the KEGG database is utilized to divide the genome into multiple biologically significant blocks and to perform a replacement sampling of the pathway-based blocks described above. Under default parameters, 200 block-guided iterations are performed for each cell type association analysis, and the default parameters may be modified when specifically performed.
Optionally, sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway according to the sorting result (the pathway ranked at the top in the sorting result is selected) and the P value of the pathway on the cell type level; specifically, the gPAS is ordered based on the central limit theorem; using symbolsRepresenting the cell type t, calculating +.>Percent passage rating for each cell j within: />, wherein ,/>gPAS grade of pathway i in cell j, M represents total number of pathways; similarly, the statistical significance of each pathway i in cell type t is calculated using the following formula, wherein ,/> The assumption is that: />The method comprises the steps of carrying out a first treatment on the surface of the The P value for each pathway i in cell type t is: />。
FIG. 2 is a schematic diagram of a conventional deviceThe embodiment of the invention provides a screening device schematic flow chart of a character related gene, which comprises the following steps: a memory and a processor;
the memory is used for storing program instructions; the processor is used for calling program instructions, and when the program instructions are executed, the screening method of the character related genes is executed.
FIG. 3 is a schematic diagram of a preferred embodiment of the present inventionThe screening system schematic flow chart of the trait related genes provided by the embodiment of the invention comprises the following steps:
an acquisition unit 301 for acquiring single cell sequencing data;
a first processing unit 302, configured to process single-cell sequencing data and pathway data by using a machine learning method, so as to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
a second processing unit 303, configured to obtain genetic association data, and annotate SNPs in the genetic association data into the pathway data based on the pathway data, so as to obtain genetic effect values of all SNPs in the single pathway data;
a third processing unit 304, configured to perform parameter estimation on the distribution of genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by using a multiple-gene regression model of the genetic association data, to obtain an estimation coefficient;
a fourth processing unit 305 for processing the estimation coefficient and PAS to obtain a genetic related pathway activity score gPAS of the cell;
and a fifth processing unit 306, configured to perform correlation analysis on the genetic related pathway activity score gPAS and the gene expression level of each cell, and sort the genetic related pathway activity score gPAS and the gene expression level of each cell, so as to screen out N personality-related genes.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described trait-related-gene screening method.
FIG. 4 is a schematic diagram of a preferred embodiment of the present inventionThe scoring method based on the single cell pathway provided by the embodiment of the invention obtains gPAS, and utilizes gPAS to output TRS, character related genes, character related cells, character related cell types/subgroups and outline diagrams of the character related pathway;
wherein A represents a method of converting a gene-cell matrix into a pathway-cell matrix by using singular value decomposition, and PC1 represents PAS of each pathway; b denotes annotating SNPs in GWAS data into corresponding pathways; c represents a polygene regression model; wherein the top graph represents the estimated coefficients in each path using a multiple gene regression model, and the gPAS is calculated using the estimated coefficients and the corresponding PAS, and the bottom graph represents the Pearson correlation model for combining the gPAS of each cell with the genes of all individual cellsCorrelating to rank the trait-related genes; using a solution in SeroatAddModuleScoreThe function yields the top N personality-related genes (top 1000 by default). To calculate a trait related score TRS for each cell; d represents an output, including four outputs, respectively: trait-related cells, trait-related cell types, trait-related pathways, and trait-related genes.
The results of the verification of the present verification embodiment show that assigning an inherent weight to an indication may moderately improve the performance of the present method relative to the default settings.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.
While the foregoing describes a computer device provided by the present invention in detail, those skilled in the art will appreciate that the foregoing description is not meant to limit the invention thereto, as long as the scope of the invention is defined by the claims appended hereto.
Claims (16)
1. A method for screening a trait-related gene, comprising:
acquiring single cell sequencing data and pathway data;
carrying out standardization treatment on the gene-cell matrix in the single-cell sequencing data to obtain a standardized gene-cell matrix;
based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;
acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data; the annotating SNPs in the genetic association data into pathway data comprises: screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;
carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient; the acquisition formula of the genetic effect value is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Theoretical effect size vector representing m SNPs, < ->Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
2. The method for screening a gene related to a trait according to claim 1,
3. The method according to claim 1, wherein the step of processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell comprises: multiplying the estimated coefficient by the PAS and then summing to obtain the genetic related pathway activity score gPAS of the cell.
4. The method for screening a gene related to a trait according to claim 3,
5. The method according to claim 1, wherein the step of obtaining a PAS score matrix of the cell pathway by converting the normalized gene-cell matrix into a pathway-cell matrix using a machine learning method, the PAS score matrix comprising pathway activity scores PAS of individual cells in a single pathway comprises optimizing the pathway activity scores PAS to obtain weighted PAS;
6. The method for screening a gene related to a trait according to claim 5, wherein the method comprisesThe acquisition mode of (1) comprises the following steps: />
7. The method for screening a gene related to a trait according to claim 1, wherein the method for machine learning comprises a method for Singular Value Decomposition (SVD).
8. The method for screening gene related to trait according to claim 1, wherein the step of obtaining SNPs of the single gene comprises: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with minor allele frequencies greater than 0.1 in the distribution results; deleting SNPs on sex chromosomes; SNPs of the single gene were obtained.
9. The method for screening a gene related to a trait according to claim 1, wherein the method further comprises: calculating the character related score TRS of each cell according to the N character related genes.
10. The method for screening a gene related to a trait according to claim 9,
calculating the trait related score TRS of the N trait related genes by using a cell scoring method.
11. The method for screening a gene related to a trait according to claim 1,
the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: and determining the correlation between the expression of the single gene and the gPAS through the Pearson correlation coefficient, and sequencing the genes according to the correlation to obtain the N personality-related genes.
12. The method for screening a gene related to a trait according to claim 11,
the N character related genes are the first 1000 or the last 1000 character related genes which are sequenced according to a descending order or an ascending order of the relativity.
13. Use of a method according to any one of claims 1-12, comprising any one of the following:
obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell;
obtaining a trait-related cell type or subpopulation based on the block boot method block bootstrap method;
and sequencing the genetic related pathway activity scores gPAS, and obtaining the trait related pathway according to the sequencing result and the P value of the pathway on the cell type level.
14. A screening apparatus for a trait-related gene, the apparatus comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is configured to invoke program instructions, which when executed, are configured to perform the screening method of the trait-related gene of any one of claims 1-12.
15. A screening system for a trait-related gene, comprising:
an acquisition unit for acquiring single cell sequencing data and pathway data;
the first processing unit is used for carrying out standardization processing on the gene-cell matrix in the single-cell sequencing data to obtain the standardized gene-cell matrix; based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;
the second processing unit is used for acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data to obtain genetic effect values of all SNPs in single pathway data, and the step of annotating the SNPs in the genetic association data into the pathway data comprises the following steps: screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;
the third processing unit is configured to perform parameter estimation on the distribution of the genetic effect values based on the genetic effect value of each SNP in the PAS and the pathway data by using a polygenic regression model of the genetic association data, to obtain an estimation coefficient, where an acquisition formula of the genetic effect value is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Theoretical effect size vector representing m SNPs, < ->Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
a fourth processing unit for processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and a fifth processing unit, configured to perform correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression amount of each cell, and screen out N personality-related genes.
16. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the trait-related gene screening method of any one of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211277659.2A CN115588465B (en) | 2022-10-19 | 2022-10-19 | Screening method and system for character related genes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211277659.2A CN115588465B (en) | 2022-10-19 | 2022-10-19 | Screening method and system for character related genes |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115588465A CN115588465A (en) | 2023-01-10 |
CN115588465B true CN115588465B (en) | 2023-05-23 |
Family
ID=84779173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211277659.2A Active CN115588465B (en) | 2022-10-19 | 2022-10-19 | Screening method and system for character related genes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115588465B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117253549B (en) * | 2023-11-15 | 2024-02-09 | 苏州元脑智能科技有限公司 | Determination method and device of path correlation, storage medium and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109524059A (en) * | 2018-12-28 | 2019-03-26 | 华中农业大学 | A kind of animal individual genomic breeding value appraisal procedure of fast and stable |
WO2020234666A1 (en) * | 2019-05-23 | 2020-11-26 | King Abdullah University Of Science And Technology | Deep learning based system and method for prediction of alternative polyadenylation site |
CN114783613A (en) * | 2022-04-13 | 2022-07-22 | 温州医科大学附属眼视光医院 | Myopia prediction analysis method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170277826A1 (en) * | 2016-03-27 | 2017-09-28 | Insilico Medicine, Inc. | System, method and software for robust transcriptomic data analysis |
US20210071255A1 (en) * | 2019-09-06 | 2021-03-11 | The Broad Institute, Inc. | Methods for identification of genes and genetic variants for complex phenotypes using single cell atlases and uses of the genes and variants thereof |
-
2022
- 2022-10-19 CN CN202211277659.2A patent/CN115588465B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109524059A (en) * | 2018-12-28 | 2019-03-26 | 华中农业大学 | A kind of animal individual genomic breeding value appraisal procedure of fast and stable |
WO2020234666A1 (en) * | 2019-05-23 | 2020-11-26 | King Abdullah University Of Science And Technology | Deep learning based system and method for prediction of alternative polyadenylation site |
CN114783613A (en) * | 2022-04-13 | 2022-07-22 | 温州医科大学附属眼视光医院 | Myopia prediction analysis method |
Also Published As
Publication number | Publication date |
---|---|
CN115588465A (en) | 2023-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Differential expression analysis of complex RNA-seq experiments using edgeR | |
Weiss et al. | Inference of population history using a likelihood approach | |
Revell et al. | Phylogenetic analysis of the evolutionary correlation using likelihood | |
Yuan et al. | IntSIM: an integrated simulator of next-generation sequencing data | |
Alexander et al. | Quantifying age-dependent extinction from species phylogenies | |
CN111312334B (en) | Receptor-ligand system analysis method for influencing intercellular communication | |
Gratton et al. | Testing classical species properties with contemporary data: how “bad species” in the brassy ringlets (Erebia tyndarus complex, Lepidoptera) turned good | |
CN115588465B (en) | Screening method and system for character related genes | |
CN106682454A (en) | Method and device for data classification of metagenome | |
Vavoulis et al. | DGEclust: differential expression analysis of clustered count data | |
Cartwright et al. | A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data | |
Wilton et al. | Population structure and coalescence in pedigrees: comparisons to the structured coalescent and a framework for inference | |
Zhang et al. | PBLR: an accurate single cell RNA-seq data imputation tool considering cell heterogeneity and prior expression level of dropouts | |
He et al. | Estimation of natural selection and allele age from time series allele frequency data using a novel likelihood-based approach | |
Vaughan et al. | The use of plasmodes as a supplement to simulations: a simple example evaluating individual admixture estimation methodologies | |
CN115472219B (en) | Alzheimer's disease data processing method and system | |
CN113035275B (en) | Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMMC algorithm | |
Wu et al. | Nonparametric Bayesian two-level clustering for subject-level single-cell expression data | |
CN116486911A (en) | Processing method and system for respiratory disease data | |
CN113035274A (en) | NMF-based tumor gene point mutation characteristic map extraction algorithm | |
Xie et al. | Robust statistical inference for cell type deconvolution | |
CN111816259A (en) | Incomplete omics data integration method based on network representation learning | |
Tahir et al. | ESREEM: efficient short reads error estimation computational model for next-generation genome sequencing | |
Gentry et al. | Missingness adapted group informed clustered (MAGIC)-LASSO: A novel paradigm for prediction in data with widespread non-random missingness | |
Temple et al. | Modeling recent positive selection in Americans of European ancestry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |