CN115588465B - Screening method and system for character related genes - Google Patents

Screening method and system for character related genes Download PDF

Info

Publication number
CN115588465B
CN115588465B CN202211277659.2A CN202211277659A CN115588465B CN 115588465 B CN115588465 B CN 115588465B CN 202211277659 A CN202211277659 A CN 202211277659A CN 115588465 B CN115588465 B CN 115588465B
Authority
CN
China
Prior art keywords
pathway
cell
gene
data
genetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211277659.2A
Other languages
Chinese (zh)
Other versions
CN115588465A (en
Inventor
苏建忠
马云龙
邓春玉
瞿佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Medical University
Original Assignee
Wenzhou Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Medical University filed Critical Wenzhou Medical University
Priority to CN202211277659.2A priority Critical patent/CN115588465B/en
Publication of CN115588465A publication Critical patent/CN115588465A/en
Application granted granted Critical
Publication of CN115588465B publication Critical patent/CN115588465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Ecology (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physiology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a screening method, a screening system, screening equipment and a screening computer-readable storage medium of a trait related gene, wherein the screening method comprises the following steps: acquiring single cell sequencing data; processing the single cell sequencing data and the pathway data by adopting a machine learning method to obtain PAS scoring matrix of the cell pathway and PAS of the cell pathway; acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data; carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient; processing the estimated coefficient and PAS to obtain genetic related pathway activity score gPAS of the cell; and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.

Description

Screening method and system for character related genes
Technical Field
The invention relates to the technical field of gene sequencing, in particular to a screening method and a screening system of a character related gene.
Background
The use of single cell RNA sequencing (scRNA-seq) technology to identify key cell subsets associated with complex diseases or features is critical to understanding the mechanisms of complex diseases. However, scRNA-seq data do not allow large-scale sequencing due to their high cost and low throughput characteristics, and most single cell-based research samples currently do not exceed 20, resulting in limited statistical efficacy and failure to accurately reveal a subset of risks associated with disease or features in a cell subpopulation. In addition, the scRNA-seq data is characterized by high sparsity, technical noise and variance instability at the genetic level.
Whole genome association studies (GWAS) are widely used to study different complex diseases or traits, and correlating scRNA-seq data with phenotype-associated genetic information of GWAS from large-scale samples is considered to be a practical and efficient method to reveal the genetic mechanisms of complex diseases or traits at single cell resolution.
Methods combining GWAS with scRNA-seq data to identify cell types associated with complex diseases, including such as LDSC-SEG, MAGMA, rolyPoly, require extensive adjustment of parameters in order to annotate cell types with known marker genes and largely ignore the internal heterogeneity of each cell type. Furthermore, the prior art can identify genes with high expression levels, but has the potential disadvantage that overattention to high expression genes underestimates the functional role of genes whose expression levels are relatively low but important in revealing cell fate.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a screening method and a screening system of a character related gene; the method disclosed by the invention is used for deeply mining the life rule hidden behind single-cell data by combining the scRNA-seq data and the genetic association data to infer genes, cells and the like related to the characters through a grading method based on the single-cell pathway, so that the related life science problem is solved.
The application discloses a screening method of a trait related gene, comprising the following steps:
acquiring single cell sequencing data;
processing the single cell sequencing data and the pathway data by adopting a machine learning method to obtain PAS scoring matrix of the cell pathway and PAS of the cell pathway;
acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data;
carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient;
processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
The acquisition formula of the genetic effect value is as follows:
Figure SMS_1
; wherein ,/>
Figure SMS_2
Theoretical effect size vector representing m SNPs, < ->
Figure SMS_3
Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
Figure SMS_4
wherein ,
Figure SMS_5
an estimated coefficient representing pathway i in cell j, +.>
Figure SMS_6
Representing intercept term, < ->
Figure SMS_7
Variance indicating the magnitude of SNP effect in the pathway, +.>
Figure SMS_8
Representing weighted PAS;
the step of processing the estimated coefficient and the PAS to obtain a genetically related pathway activity score, gPAS, of the cell comprises: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
the genetically related pathway activity score gPAS (gPj) is obtained by
Figure SMS_9
wherein ,gPj Scoring the genetically related pathway activity gPAS, said
Figure SMS_10
The estimated coefficients are optimized;
the step of processing the single cell sequencing data and the pathway data by using a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway comprises the following steps:
acquiring single cell sequencing data and pathway data;
carrying out standardization treatment on the gene-cell matrix in the single-cell sequencing data to obtain a standardized gene-cell matrix;
based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;
optionally, optimizing the pathway activity score PAS to obtain the weighted PAS;
the weighted PAS acquisition mode comprises the following steps:
Figure SMS_11
wherein ,
Figure SMS_12
representing weighted PAS->
Figure SMS_13
Representing normalized expression of gene g in optimized cell j,/I>
Figure SMS_14
A pathway activity score PAS representing cell j pathway i;
optionally, the
Figure SMS_15
The acquisition mode of (1) comprises the following steps: />
Figure SMS_16
Optionally, the machine learning method includes a Singular Value Decomposition (SVD) method.
Annotating SNPs in the genetic association data into pathway data comprises:
screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;
optionally, the step of obtaining SNPs of the single gene includes: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of the single gene;
the SNPs of the single gene are collected to obtain SNPs of all genes.
The method further comprises the steps of: calculating a trait related score TRS for each cell according to the N trait related genes; optionally, calculating the trait related score TRS of the N trait related genes using a cell scoring method.
Optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes ordered according to a descending or ascending order of relevance rule.
An application comprising any one of:
obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell;
alternatively, trait-related cell types or subpopulations are obtained based on the block boot method block bootstrap method;
optionally, sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway according to the sorting result and the P value of the pathway on the cell type level;
a screening apparatus for a trait-related gene, the apparatus comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is used for calling program instructions, and when the program instructions are executed, the processor is used for executing the screening method of the character related genes.
A screening system for a trait-related gene, comprising:
an acquisition unit for acquiring single cell sequencing data;
the first processing unit is used for processing the single-cell sequencing data and the pathway data by adopting a machine learning method to obtain a PAS scoring matrix of the cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data, annotating SNPs in the genetic association data into the path data based on the path data, and obtaining genetic effect values of all SNPs in single path data;
the third processing unit is used for carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing the multi-gene regression model of the genetic association data to obtain estimation coefficients;
a fourth processing unit for processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and a fifth processing unit, configured to perform correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression amount of each cell, and screen out N personality-related genes.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described trait-related gene screening method.
The application has the following beneficial effects:
1. the application creatively discloses a scoring method based on a single cell pathway, which adopts a polygene regression model to reveal genes and cell subgroups related to traits by utilizing scRNA-seq data converted by pathway activity and genetic association research data; the method effectively solves the problems that the identification of the genes and cell subsets related to the polygenic risk of the complex diseases is greatly hindered by small sample size and high sparsity in the scRNA-seq data, so that the statistical efficiency is limited, and the risk subsets related to the diseases or traits in the cell subsets can not be accurately revealed. The method is used for deep mining of life rules hidden behind single-cell sequencing data, and deep analysis of multiple dimensions such as population genetics mutation and disease relation, single-cell sequencing gene abundance information and the like, so that the accuracy and depth of data analysis are greatly improved.
2. The method combines the scRNA-seq data with the genetic association data based on large-scale simulation and real data, so that the problem that a large amount of adjustment parameters are needed for conveniently annotating cell types with known marker genes in the prior art and the internal heterogeneity of each cell type can be ignored to a great extent can be effectively overcome; there is no functional role of genes whose expression levels are relatively low, but important for revealing cell fate, underestimation due to overconcerns about high-expression genes, helping to identify disease-related early developmental events or progenitor cells, such as key transcription factors related to cell development, by aggregating the role of genes whose average expression levels are low; meanwhile, the sparsity and technical noise of the scRNA-seq data can be effectively reduced, and the method has good robustness and capability in the aspect of identifying cell types and sub-populations related to characteristics.
3. The application creatively discloses a screening method of trait related genes based on single-cell scoring paths, which fuses the functional actions of different genes participating in the same biological path to obtain stable cell states, and remarkably increases the statistical efficiency, biological interpretability and result repeatability; overcomes the limitation of the known annotation cell types, and can discover the new genetic related subgroup and the key genes or channels of the cell types, thereby having wide application and strong practicability.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an analysis schematic flow chart of a screening method of a trait related gene provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of screening equipment for trait related genes provided by an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a screening system for trait related genes provided by an embodiment of the present invention;
fig. 4 is a schematic diagram of obtaining the gPAS by the scoring method based on the single cell pathway and outputting TRS, the property-related genes, the property-related cells, the property-related cell types/sub-populations and the property-type tubular pathway by using the gPAS according to the embodiment of the invention.
Detailed Description
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.
In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments according to the invention without any creative effort, are within the protection scope of the invention.
Fig. 1 is a schematic flow chart of a screening method of a trait related gene provided by the embodiment of the invention, specifically, the method comprises the following steps:
101: acquiring single cell sequencing data;
in one embodiment, single cell sequencing data comprises seven independent single cell RNA-seq (scRNA-seq) or single core RNA-seq (snRNA-seq) datasets covering 139 ten thousand cells from humans (homo sapiens) and mice (mus musculus). For blood cells, two scRNA-seq datasets based on human BMMC (n= 35,582 cells) and human PBMCs (n= 97,039 cells) were collected to reveal trait-related cell subsets or types. For diseases/features related to immunity/metabolism, a pseudo-tissue (pseudo-bulk) expression profile and a preferential risk tissue related to disease/feature was constructed for each tissue using the scRNA-seq dataset from human cells (HCL, n= 513,707 cells in 35 adult tissues).
In one embodiment, three single cell datasets are collected for brain-related diseases: the mouse brain scRNAseq dataset (n= 160,796 cells), the human brain olfactory cortex snRNA-seq dataset (n= 11,786 cells), and the human brain snRNA-seq dataset comprising both regions of the olfactory cortex and somatosensory cortex (n= 101,906 cells).
In one example, to discover immune cell populations associated with severe covd-19, a large-scale PBMC scRNA-seq dataset (n= 469,453 cells) was collected containing 254 peripheral blood samples with varying degrees of covd-19 severity (mild n=109 samples, moderate n=102 samples, severe n=50) and 16 healthy controls.
102: processing single-cell sequencing data and pathway data by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
in one embodiment, the step of processing single cell sequencing data and pathway data to obtain a PAS scoring matrix of a cell pathway and obtaining PAS of the cell pathway by using a machine learning method comprises the following steps:
acquiring single cell sequencing data and pathway data;
carrying out standardization treatment on a gene-cell matrix in single-cell sequencing data to obtain a standardized gene-cell matrix; specifically, the sparse gene-cell matrix in the scRNA-seq data was normalized using a variance stabilizing transformation parameter with a scale factor of 10,000, resulting in normalized expression of a single gene in a single cell; the normalized formula is:
Figure SMS_17
, wherein ,/>
Figure SMS_18
Representing the original expression of gene g in cell j, < >>
Figure SMS_19
Represents the normalized expression of gene g in cell j;
based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scoring PAS of single cells in a single pathway;
in one embodiment, the pathway data is KEGG pathway data, and the pathway from the KEGG database is the default for evaluating PASThe gene set converts the standardized gene-cell matrix into a channel-cell matrix by utilizing a Singular Value Decomposition (SVD) method; using
Figure SMS_21
Representing the gene set in pathway i, for each pathway i, matrix A was selected from the normalized gene-cell matrix A i Wherein matrix A i Is the whole N cells, and the row is the pathway gene set +.>
Figure SMS_26
Middle->
Figure SMS_31
Gene, obtained from SVD
Figure SMS_23
Wherein U represents +.>
Figure SMS_25
Orthogonal matrix, < >>
Figure SMS_29
Representing a diagonal matrix with all zeros except the main diagonal element>
Figure SMS_33
Representation->
Figure SMS_20
An orthogonal matrix; right orthogonal matrix->
Figure SMS_24
Column t vector->
Figure SMS_28
Representing the t-th principal component, reflecting the co-expression variability of genes in single cell data in the pathway; since the first principal component PC1 represents the largest variance variation, the projection of the cell j feature on PC1 represents the +.>
Figure SMS_32
The method comprises the steps of carrying out a first treatment on the surface of the For cell j, use is made of a pathwayAll expression variances in i are used as weight adjustment originals +.>
Figure SMS_22
The method comprises the steps of carrying out a first treatment on the surface of the For gene g in pathway i, readjusting gene expression using min-max scaling>
Figure SMS_27
Regulated Gene expression +.>
Figure SMS_30
In one embodiment, the pathway activity score PAS is optimized to obtain a weighted PAS;
the acquisition mode of the weighted PAS comprises
Figure SMS_34
wherein ,
Figure SMS_35
representing weighted PAS->
Figure SMS_36
Representing normalized expression of gene g in optimized cell j,/I>
Figure SMS_37
A pathway activity score PAS representing cell j pathway i;
in one embodiment of the present invention, in one embodiment,
Figure SMS_38
the acquisition mode of (1) comprises the following steps: />
Figure SMS_39
wherein ,
Figure SMS_40
represents the normalized expression of gene g in cell j, < >>
Figure SMS_41
Represents the maximum value of gene expression in pathway i,/->
Figure SMS_42
Represents the minimum value of gene expression in pathway i.
Optionally, the method of machine learning includes a method of Singular Value Decomposition (SVD); the SVD method greatly improves the calculation efficiency of analysis sparse matrix, and can obtain the characteristic value under the condition of not calculating variance matrix; and (3) the standardized gene-cell matrix is sublimated into a path-cell matrix in a low-dimensional space by utilizing a singular value decomposition method.
103: acquiring genetic association data, annotating SNPs in the genetic association data into the path data based on the path data, and obtaining genetic effect values of all SNPs in single path data;
in one embodiment, the step of annotating the SNPs in the genetic association data into the pathway data comprises:
screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on channel data to obtain channels with SNPs annotation;
alternatively, the SNPs of a single gene may be obtained by the steps of: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of a single gene;
the SNPs of a single gene are collected to obtain SNPs of all genes. Specifically, the genetic association data is GWAS data, and SNPs in the GWAS summary statistical data are allocated to related genes by taking 20kb as a default parameter; using symbols
Figure SMS_43
A gene g with SNP k, wherein a plurality of single SNPs correspond to a plurality of genes by distribution of SNP gene pairs; since the whole process requires the parameter to be inferred from thousands of snps, butSNPs of the single SNPs corresponding to a plurality of genes have no effect on the inference process, so that the repeated genes need to be treated in a correlated manner as independent SNP genes; preserving SNPs with Minor Allele Frequency (MAF) greater than 0.1, deleting SNPs on sex chromosomes, and finally obtaining SNPs of related genes;
annotating genes with associated SNPs into the pathway based on the pathway in the KEGG database, and representing the set of SNPs in pathway i using si=formula 2; calculating linkage disequilibrium LD (linkage disequilibrium) on SNPs extracted from the GWAS summary data by using the 3 rd stage data of the thousand genome project; the present protocol provides a collection of functional genes such as GO, reactiomer, and MSigDB as alternative options. In addition, the region of the major histocompatibility complex where the broad LD exists, chr6:25-35 Mbp, was deleted.
In one embodiment, GWAS data has given a phenotype, and the phenotypic annotation of the given phenotype includes dichotomy, continuous dependency characteristics, or intra-phenotype and center measurements.
104: parameter estimation is carried out on the distribution of the genetic effect values based on the genetic effect values of each SNP in PAS and channel data by utilizing a polygene regression model of the genetic association data, so as to obtain an estimation coefficient;
in one embodiment, the genetic effect value is obtained by the formula:
Figure SMS_44
; wherein ,/>
Figure SMS_45
Theoretical effect size vector representing m SNPs, < ->
Figure SMS_46
Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
in one embodiment of the present invention, in one embodiment,
Figure SMS_47
SNP sets representing all SNPs contained in the localized genes of each pathway i, multiple gene models assuming a priori pathway iThe effect size of all SNPs follows a multivariate normal distribution, wherein +.>
Figure SMS_48
Variance indicating magnitude of SNPs effect in pathway, +.>
Figure SMS_49
Representation->
Figure SMS_50
A unit matrix;
in one embodiment, the obtaining manner of the estimation coefficient includes:
Figure SMS_51
wherein ,
Figure SMS_52
representing an estimated coefficient of pathway i in cell j, the estimated coefficient reflecting the effect of cell-specific PAS on GWAS effect size variance, i.e., the effect of inheritance on response; />
Figure SMS_53
Representing intercept term, < ->
Figure SMS_54
Variance indicating the magnitude of SNP effect in the pathway, +.>
Figure SMS_55
Representing weighted PAS;
in one embodiment, the genetic effect value is based on previous assumptions
Figure SMS_56
The distribution of (2) is estimated using the following formula: />
Figure SMS_57
The method comprises the steps of carrying out a first treatment on the surface of the Optimizing the estimation coefficient by using the formula;
in one embodiment, to optimize the estimation coefficients for each path in the multiple gene regression model, a method is used that significantly improves computational efficiency and estimationOptimizing a polygene regression model by a method-of-models approach; then, the observed and expected squaring effects of SNPs associated with each pathway are fitted and the expected values are estimated by the following formula:
Figure SMS_58
wherein Tr represents a matrix track.
105: processing the estimated coefficient and PAS to obtain genetic related pathway activity score gPAS of the cell;
in one embodiment, the step of processing the estimated coefficients and PAS to obtain a genetically related pathway activity score, gPAS, of the cell comprises: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
the genetic related pathway activity score gPAS (gPj) was obtained by:
Figure SMS_59
wherein ,
Figure SMS_60
the estimated coefficients are optimized;
106: and carrying out correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
Optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and the total score of gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain N personality-related genes; specifically, to maximize efficacy, the expression of each gene g is inversely weighted by its gene-specific technical noise level estimated by modeling the mean variance relationship between genes in the scRNA-seq data;
optionally, the N character related genes are the first 1000 or the last 1000 character related genes sequenced according to a descending order or an ascending order of the relativity; n is not limited to 1000, and N is a natural number integer.
In one embodiment, the method further comprises: calculating a trait related score TRS of each cell according to the N trait related genes; alternatively, the trait related scores TRS for the N genes are calculated using a cell scoring method of an AddModuleScore function in the semoat; the expression for obtaining the trait related score TRS for each cell is: trs=average RE (GS) -average RE (CG); wherein average RE (GS) is the average relative expression value of N personality-related gene sets in a given cell, and average RE (CG) is the average relative expression value of the same number of control gene sets randomly extracted from the existing gene library; RE is relative expression; GS is gene set; CG is control gene set;
in one embodiment, the statistical significance of individual cells is determined by calculating the rank distribution of trait-related genes to further evaluate whether the cells are significantly related to the shape of interest; specifically, the percentage grade of the trait-related gene in the cell is obtained,
Figure SMS_61
, wherein ,/>
Figure SMS_62
Expressing the expression level of gene G in cell j, G representing the number of genes associated with the specified trait; the gene percentage grade follows the normal distribution +.>
Figure SMS_63
Obtaining a statistical value ++for each cell under the null assumption that there is no correlation between the percentage classes of genes>
Figure SMS_64
The formula is obtained as follows: />
Figure SMS_65
Based on the large number of cells in single cell data, the central limit theorem was used to derive
Figure SMS_66
Is a distribution of: />
Figure SMS_67
Figure SMS_68
Figure SMS_69
Wherein N is the total number of cells; the assumption for the significance test is: />
Figure SMS_70
The method comprises the steps of carrying out a first treatment on the surface of the The P value for each cell j is:
Figure SMS_71
an application, the application comprising any one of:
obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell (representing whether individual cell levels are related or not);
alternatively, a trait-related cell type or subpopulation (determining whether a cell type to which an individual cell belongs is related) is obtained based on the block bootstrapping block bootstrap method; specifically, a group of cells is considered as a pseudo-tissue (pseudo-bulk) transcriptome profile, and the amount of gene expression across cells within a given cell type is averaged; for the associated cell types, standard error was estimated with block bootstrap method and t statistics were calculated for each cell type corresponding to the P value. Whereas the goal of the bootstrapping approach is to maintain the data structure as sampling is distributed from experience, the pathway of the KEGG database is utilized to divide the genome into multiple biologically significant blocks and to perform a replacement sampling of the pathway-based blocks described above. Under default parameters, 200 block-guided iterations are performed for each cell type association analysis, and the default parameters may be modified when specifically performed.
Optionally, sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway according to the sorting result (the pathway ranked at the top in the sorting result is selected) and the P value of the pathway on the cell type level; specifically, the gPAS is ordered based on the central limit theorem; using symbols
Figure SMS_72
Representing the cell type t, calculating +.>
Figure SMS_75
Percent passage rating for each cell j within: />
Figure SMS_79
, wherein ,/>
Figure SMS_73
gPAS grade of pathway i in cell j, M represents total number of pathways; similarly, the statistical significance of each pathway i in cell type t is calculated using the following formula
Figure SMS_76
, wherein ,/>
Figure SMS_78
Figure SMS_81
Figure SMS_74
The assumption is that: />
Figure SMS_77
The method comprises the steps of carrying out a first treatment on the surface of the The P value for each pathway i in cell type t is: />
Figure SMS_80
FIG. 2 is a schematic diagram of a conventional deviceThe embodiment of the invention provides a screening device schematic flow chart of a character related gene, which comprises the following steps: a memory and a processor;
the memory is used for storing program instructions; the processor is used for calling program instructions, and when the program instructions are executed, the screening method of the character related genes is executed.
FIG. 3 is a schematic diagram of a preferred embodiment of the present inventionThe screening system schematic flow chart of the trait related genes provided by the embodiment of the invention comprises the following steps:
an acquisition unit 301 for acquiring single cell sequencing data;
a first processing unit 302, configured to process single-cell sequencing data and pathway data by using a machine learning method, so as to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
a second processing unit 303, configured to obtain genetic association data, and annotate SNPs in the genetic association data into the pathway data based on the pathway data, so as to obtain genetic effect values of all SNPs in the single pathway data;
a third processing unit 304, configured to perform parameter estimation on the distribution of genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by using a multiple-gene regression model of the genetic association data, to obtain an estimation coefficient;
a fourth processing unit 305 for processing the estimation coefficient and PAS to obtain a genetic related pathway activity score gPAS of the cell;
and a fifth processing unit 306, configured to perform correlation analysis on the genetic related pathway activity score gPAS and the gene expression level of each cell, and sort the genetic related pathway activity score gPAS and the gene expression level of each cell, so as to screen out N personality-related genes.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described trait-related-gene screening method.
FIG. 4 is a schematic diagram of a preferred embodiment of the present inventionThe scoring method based on the single cell pathway provided by the embodiment of the invention obtains gPAS, and utilizes gPAS to output TRS, character related genes, character related cells, character related cell types/subgroups and outline diagrams of the character related pathway;
wherein A represents a method of converting a gene-cell matrix into a pathway-cell matrix by using singular value decomposition, and PC1 represents PAS of each pathway; b denotes annotating SNPs in GWAS data into corresponding pathways; c represents a polygene regression model; wherein the top graph represents the estimated coefficients in each path using a multiple gene regression model, and the gPAS is calculated using the estimated coefficients and the corresponding PAS, and the bottom graph represents the Pearson correlation model for combining the gPAS of each cell with the genes of all individual cellsCorrelating to rank the trait-related genes; using a solution in SeroatAddModuleScoreThe function yields the top N personality-related genes (top 1000 by default). To calculate a trait related score TRS for each cell; d represents an output, including four outputs, respectively: trait-related cells, trait-related cell types, trait-related pathways, and trait-related genes.
The results of the verification of the present verification embodiment show that assigning an inherent weight to an indication may moderately improve the performance of the present method relative to the default settings.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.
While the foregoing describes a computer device provided by the present invention in detail, those skilled in the art will appreciate that the foregoing description is not meant to limit the invention thereto, as long as the scope of the invention is defined by the claims appended hereto.

Claims (16)

1. A method for screening a trait-related gene, comprising:
acquiring single cell sequencing data and pathway data;
carrying out standardization treatment on the gene-cell matrix in the single-cell sequencing data to obtain a standardized gene-cell matrix;
based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;
acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data; the annotating SNPs in the genetic association data into pathway data comprises: screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;
carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient; the acquisition formula of the genetic effect value is as follows:
Figure QLYQS_1
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_2
Theoretical effect size vector representing m SNPs, < ->
Figure QLYQS_3
Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.
2. The method for screening a gene related to a trait according to claim 1,
the obtaining mode of the estimation coefficient comprises the following steps:
Figure QLYQS_4
wherein ,
Figure QLYQS_5
representing the estimated coefficients of pathway i in cell j,/>
Figure QLYQS_6
representing intercept term, < ->
Figure QLYQS_7
Variance indicating the magnitude of SNP effect in the pathway, +.>
Figure QLYQS_8
Representing a weighted PAS.
3. The method according to claim 1, wherein the step of processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell comprises: multiplying the estimated coefficient by the PAS and then summing to obtain the genetic related pathway activity score gPAS of the cell.
4. The method for screening a gene related to a trait according to claim 3,
the genetic related pathway activity score gPAS is obtained by the following steps:
Figure QLYQS_9
wherein the gP j Scoring the genetically related pathway activity gPAS, said
Figure QLYQS_10
For the optimized estimation coefficients, said +.>
Figure QLYQS_11
Representing a weighted PAS.
5. The method according to claim 1, wherein the step of obtaining a PAS score matrix of the cell pathway by converting the normalized gene-cell matrix into a pathway-cell matrix using a machine learning method, the PAS score matrix comprising pathway activity scores PAS of individual cells in a single pathway comprises optimizing the pathway activity scores PAS to obtain weighted PAS;
the acquisition mode of the weighted PAS comprises the following steps of
Figure QLYQS_12
wherein ,
Figure QLYQS_13
representing weighted PAS->
Figure QLYQS_14
Representing normalized expression of gene g in optimized cell j,/I>
Figure QLYQS_15
The pathway activity score PAS of cell j pathway i.
6. The method for screening a gene related to a trait according to claim 5, wherein the method comprises
Figure QLYQS_16
The acquisition mode of (1) comprises the following steps: />
Figure QLYQS_17
Wherein the said
Figure QLYQS_18
Represents the normalized expression of gene g in cell j, < >>
Figure QLYQS_19
Represents the maximum value of gene expression in cell j pathway i, < >>
Figure QLYQS_20
Represents the minimum value of gene expression in cell j pathway i.
7. The method for screening a gene related to a trait according to claim 1, wherein the method for machine learning comprises a method for Singular Value Decomposition (SVD).
8. The method for screening gene related to trait according to claim 1, wherein the step of obtaining SNPs of the single gene comprises: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with minor allele frequencies greater than 0.1 in the distribution results; deleting SNPs on sex chromosomes; SNPs of the single gene were obtained.
9. The method for screening a gene related to a trait according to claim 1, wherein the method further comprises: calculating the character related score TRS of each cell according to the N character related genes.
10. The method for screening a gene related to a trait according to claim 9,
calculating the trait related score TRS of the N trait related genes by using a cell scoring method.
11. The method for screening a gene related to a trait according to claim 1,
the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: and determining the correlation between the expression of the single gene and the gPAS through the Pearson correlation coefficient, and sequencing the genes according to the correlation to obtain the N personality-related genes.
12. The method for screening a gene related to a trait according to claim 11,
the N character related genes are the first 1000 or the last 1000 character related genes which are sequenced according to a descending order or an ascending order of the relativity.
13. Use of a method according to any one of claims 1-12, comprising any one of the following:
obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell;
obtaining a trait-related cell type or subpopulation based on the block boot method block bootstrap method;
and sequencing the genetic related pathway activity scores gPAS, and obtaining the trait related pathway according to the sequencing result and the P value of the pathway on the cell type level.
14. A screening apparatus for a trait-related gene, the apparatus comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is configured to invoke program instructions, which when executed, are configured to perform the screening method of the trait-related gene of any one of claims 1-12.
15. A screening system for a trait-related gene, comprising:
an acquisition unit for acquiring single cell sequencing data and pathway data;
the first processing unit is used for carrying out standardization processing on the gene-cell matrix in the single-cell sequencing data to obtain the standardized gene-cell matrix; based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;
the second processing unit is used for acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data to obtain genetic effect values of all SNPs in single pathway data, and the step of annotating the SNPs in the genetic association data into the pathway data comprises the following steps: screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;
the third processing unit is configured to perform parameter estimation on the distribution of the genetic effect values based on the genetic effect value of each SNP in the PAS and the pathway data by using a polygenic regression model of the genetic association data, to obtain an estimation coefficient, where an acquisition formula of the genetic effect value is as follows:
Figure QLYQS_21
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_22
Theoretical effect size vector representing m SNPs, < ->
Figure QLYQS_23
Represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
a fourth processing unit for processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;
and a fifth processing unit, configured to perform correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression amount of each cell, and screen out N personality-related genes.
16. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the trait-related gene screening method of any one of claims 1 to 11.
CN202211277659.2A 2022-10-19 2022-10-19 Screening method and system for character related genes Active CN115588465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211277659.2A CN115588465B (en) 2022-10-19 2022-10-19 Screening method and system for character related genes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211277659.2A CN115588465B (en) 2022-10-19 2022-10-19 Screening method and system for character related genes

Publications (2)

Publication Number Publication Date
CN115588465A CN115588465A (en) 2023-01-10
CN115588465B true CN115588465B (en) 2023-05-23

Family

ID=84779173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211277659.2A Active CN115588465B (en) 2022-10-19 2022-10-19 Screening method and system for character related genes

Country Status (1)

Country Link
CN (1) CN115588465B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117253549B (en) * 2023-11-15 2024-02-09 苏州元脑智能科技有限公司 Determination method and device of path correlation, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109524059A (en) * 2018-12-28 2019-03-26 华中农业大学 A kind of animal individual genomic breeding value appraisal procedure of fast and stable
WO2020234666A1 (en) * 2019-05-23 2020-11-26 King Abdullah University Of Science And Technology Deep learning based system and method for prediction of alternative polyadenylation site
CN114783613A (en) * 2022-04-13 2022-07-22 温州医科大学附属眼视光医院 Myopia prediction analysis method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170277826A1 (en) * 2016-03-27 2017-09-28 Insilico Medicine, Inc. System, method and software for robust transcriptomic data analysis
US20210071255A1 (en) * 2019-09-06 2021-03-11 The Broad Institute, Inc. Methods for identification of genes and genetic variants for complex phenotypes using single cell atlases and uses of the genes and variants thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109524059A (en) * 2018-12-28 2019-03-26 华中农业大学 A kind of animal individual genomic breeding value appraisal procedure of fast and stable
WO2020234666A1 (en) * 2019-05-23 2020-11-26 King Abdullah University Of Science And Technology Deep learning based system and method for prediction of alternative polyadenylation site
CN114783613A (en) * 2022-04-13 2022-07-22 温州医科大学附属眼视光医院 Myopia prediction analysis method

Also Published As

Publication number Publication date
CN115588465A (en) 2023-01-10

Similar Documents

Publication Publication Date Title
Chen et al. Differential expression analysis of complex RNA-seq experiments using edgeR
Weiss et al. Inference of population history using a likelihood approach
Revell et al. Phylogenetic analysis of the evolutionary correlation using likelihood
Yuan et al. IntSIM: an integrated simulator of next-generation sequencing data
Alexander et al. Quantifying age-dependent extinction from species phylogenies
CN111312334B (en) Receptor-ligand system analysis method for influencing intercellular communication
Gratton et al. Testing classical species properties with contemporary data: how “bad species” in the brassy ringlets (Erebia tyndarus complex, Lepidoptera) turned good
CN115588465B (en) Screening method and system for character related genes
CN106682454A (en) Method and device for data classification of metagenome
Vavoulis et al. DGEclust: differential expression analysis of clustered count data
Cartwright et al. A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data
Wilton et al. Population structure and coalescence in pedigrees: comparisons to the structured coalescent and a framework for inference
Zhang et al. PBLR: an accurate single cell RNA-seq data imputation tool considering cell heterogeneity and prior expression level of dropouts
He et al. Estimation of natural selection and allele age from time series allele frequency data using a novel likelihood-based approach
Vaughan et al. The use of plasmodes as a supplement to simulations: a simple example evaluating individual admixture estimation methodologies
CN115472219B (en) Alzheimer&#39;s disease data processing method and system
CN113035275B (en) Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMMC algorithm
Wu et al. Nonparametric Bayesian two-level clustering for subject-level single-cell expression data
CN116486911A (en) Processing method and system for respiratory disease data
CN113035274A (en) NMF-based tumor gene point mutation characteristic map extraction algorithm
Xie et al. Robust statistical inference for cell type deconvolution
CN111816259A (en) Incomplete omics data integration method based on network representation learning
Tahir et al. ESREEM: efficient short reads error estimation computational model for next-generation genome sequencing
Gentry et al. Missingness adapted group informed clustered (MAGIC)-LASSO: A novel paradigm for prediction in data with widespread non-random missingness
Temple et al. Modeling recent positive selection in Americans of European ancestry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant