CN115588465B

CN115588465B - Screening method and system for character related genes

Info

Publication number: CN115588465B
Application number: CN202211277659.2A
Authority: CN
Inventors: 苏建忠; 马云龙; 邓春玉; 瞿佳
Original assignee: Wenzhou Medical University
Current assignee: Wenzhou Medical University
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2023-05-23
Anticipated expiration: 2042-10-19
Also published as: CN115588465A

Abstract

The invention discloses a screening method, a screening system, screening equipment and a screening computer-readable storage medium of a trait related gene, wherein the screening method comprises the following steps: acquiring single cell sequencing data; processing the single cell sequencing data and the pathway data by adopting a machine learning method to obtain PAS scoring matrix of the cell pathway and PAS of the cell pathway; acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data; carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient; processing the estimated coefficient and PAS to obtain genetic related pathway activity score gPAS of the cell; and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.

Description

Screening method and system for character related genes

Technical Field

The invention relates to the technical field of gene sequencing, in particular to a screening method and a screening system of a character related gene.

Background

The use of single cell RNA sequencing (scRNA-seq) technology to identify key cell subsets associated with complex diseases or features is critical to understanding the mechanisms of complex diseases. However, scRNA-seq data do not allow large-scale sequencing due to their high cost and low throughput characteristics, and most single cell-based research samples currently do not exceed 20, resulting in limited statistical efficacy and failure to accurately reveal a subset of risks associated with disease or features in a cell subpopulation. In addition, the scRNA-seq data is characterized by high sparsity, technical noise and variance instability at the genetic level.

Whole genome association studies (GWAS) are widely used to study different complex diseases or traits, and correlating scRNA-seq data with phenotype-associated genetic information of GWAS from large-scale samples is considered to be a practical and efficient method to reveal the genetic mechanisms of complex diseases or traits at single cell resolution.

Methods combining GWAS with scRNA-seq data to identify cell types associated with complex diseases, including such as LDSC-SEG, MAGMA, rolyPoly, require extensive adjustment of parameters in order to annotate cell types with known marker genes and largely ignore the internal heterogeneity of each cell type. Furthermore, the prior art can identify genes with high expression levels, but has the potential disadvantage that overattention to high expression genes underestimates the functional role of genes whose expression levels are relatively low but important in revealing cell fate.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a screening method and a screening system of a character related gene; the method disclosed by the invention is used for deeply mining the life rule hidden behind single-cell data by combining the scRNA-seq data and the genetic association data to infer genes, cells and the like related to the characters through a grading method based on the single-cell pathway, so that the related life science problem is solved.

The application discloses a screening method of a trait related gene, comprising the following steps:

acquiring single cell sequencing data;

processing the single cell sequencing data and the pathway data by adopting a machine learning method to obtain PAS scoring matrix of the cell pathway and PAS of the cell pathway;

acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data;

carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient;

processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;

and carrying out correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.

The acquisition formula of the genetic effect value is as follows:

； wherein ,/>

Theoretical effect size vector representing m SNPs, < ->

Represents random environmental error, R represents LD matrix, X ^T Representing a standard genotype of SNPs in the genetically related data sample;

optionally, the obtaining manner of the estimation coefficient includes:

wherein ,

an estimated coefficient representing pathway i in cell j, +.>

Representing intercept term, < ->

Variance indicating the magnitude of SNP effect in the pathway, +.>

Representing weighted PAS;

the step of processing the estimated coefficient and the PAS to obtain a genetically related pathway activity score, gPAS, of the cell comprises: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;

the genetically related pathway activity score gPAS (gPj) is obtained by

wherein ,gP_j Scoring the genetically related pathway activity gPAS, said

The estimated coefficients are optimized;

the step of processing the single cell sequencing data and the pathway data by using a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway comprises the following steps:

acquiring single cell sequencing data and pathway data;

carrying out standardization treatment on the gene-cell matrix in the single-cell sequencing data to obtain a standardized gene-cell matrix;

based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;

optionally, optimizing the pathway activity score PAS to obtain the weighted PAS;

the weighted PAS acquisition mode comprises the following steps:

wherein ,

representing weighted PAS->

Representing normalized expression of gene g in optimized cell j,/I>

A pathway activity score PAS representing cell j pathway i;

optionally, the

The acquisition mode of (1) comprises the following steps: />

Optionally, the machine learning method includes a Singular Value Decomposition (SVD) method.

Annotating SNPs in the genetic association data into pathway data comprises:

screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;

optionally, the step of obtaining SNPs of the single gene includes: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;

respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of the single gene;

the SNPs of the single gene are collected to obtain SNPs of all genes.

The method further comprises the steps of: calculating a trait related score TRS for each cell according to the N trait related genes; optionally, calculating the trait related score TRS of the N trait related genes using a cell scoring method.

Optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;

optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes ordered according to a descending or ascending order of relevance rule.

An application comprising any one of:

obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell;

alternatively, trait-related cell types or subpopulations are obtained based on the block boot method block bootstrap method;

optionally, sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway according to the sorting result and the P value of the pathway on the cell type level;

a screening apparatus for a trait-related gene, the apparatus comprising: a memory and a processor;

the memory is used for storing program instructions; the processor is used for calling program instructions, and when the program instructions are executed, the processor is used for executing the screening method of the character related genes.

A screening system for a trait-related gene, comprising:

an acquisition unit for acquiring single cell sequencing data;

the first processing unit is used for processing the single-cell sequencing data and the pathway data by adopting a machine learning method to obtain a PAS scoring matrix of the cell pathway and PAS of the cell pathway;

the second processing unit is used for acquiring genetic association data, annotating SNPs in the genetic association data into the path data based on the path data, and obtaining genetic effect values of all SNPs in single path data;

the third processing unit is used for carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing the multi-gene regression model of the genetic association data to obtain estimation coefficients;

a fourth processing unit for processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell;

and a fifth processing unit, configured to perform correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression amount of each cell, and screen out N personality-related genes.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described trait-related gene screening method.

The application has the following beneficial effects:

1. the application creatively discloses a scoring method based on a single cell pathway, which adopts a polygene regression model to reveal genes and cell subgroups related to traits by utilizing scRNA-seq data converted by pathway activity and genetic association research data; the method effectively solves the problems that the identification of the genes and cell subsets related to the polygenic risk of the complex diseases is greatly hindered by small sample size and high sparsity in the scRNA-seq data, so that the statistical efficiency is limited, and the risk subsets related to the diseases or traits in the cell subsets can not be accurately revealed. The method is used for deep mining of life rules hidden behind single-cell sequencing data, and deep analysis of multiple dimensions such as population genetics mutation and disease relation, single-cell sequencing gene abundance information and the like, so that the accuracy and depth of data analysis are greatly improved.

2. The method combines the scRNA-seq data with the genetic association data based on large-scale simulation and real data, so that the problem that a large amount of adjustment parameters are needed for conveniently annotating cell types with known marker genes in the prior art and the internal heterogeneity of each cell type can be ignored to a great extent can be effectively overcome; there is no functional role of genes whose expression levels are relatively low, but important for revealing cell fate, underestimation due to overconcerns about high-expression genes, helping to identify disease-related early developmental events or progenitor cells, such as key transcription factors related to cell development, by aggregating the role of genes whose average expression levels are low; meanwhile, the sparsity and technical noise of the scRNA-seq data can be effectively reduced, and the method has good robustness and capability in the aspect of identifying cell types and sub-populations related to characteristics.

3. The application creatively discloses a screening method of trait related genes based on single-cell scoring paths, which fuses the functional actions of different genes participating in the same biological path to obtain stable cell states, and remarkably increases the statistical efficiency, biological interpretability and result repeatability; overcomes the limitation of the known annotation cell types, and can discover the new genetic related subgroup and the key genes or channels of the cell types, thereby having wide application and strong practicability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an analysis schematic flow chart of a screening method of a trait related gene provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of screening equipment for trait related genes provided by an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a screening system for trait related genes provided by an embodiment of the present invention;

fig. 4 is a schematic diagram of obtaining the gPAS by the scoring method based on the single cell pathway and outputting TRS, the property-related genes, the property-related cells, the property-related cell types/sub-populations and the property-type tubular pathway by using the gPAS according to the embodiment of the invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments according to the invention without any creative effort, are within the protection scope of the invention.

Fig. 1 is a schematic flow chart of a screening method of a trait related gene provided by the embodiment of the invention, specifically, the method comprises the following steps:

101: acquiring single cell sequencing data;

in one embodiment, single cell sequencing data comprises seven independent single cell RNA-seq (scRNA-seq) or single core RNA-seq (snRNA-seq) datasets covering 139 ten thousand cells from humans (homo sapiens) and mice (mus musculus). For blood cells, two scRNA-seq datasets based on human BMMC (n= 35,582 cells) and human PBMCs (n= 97,039 cells) were collected to reveal trait-related cell subsets or types. For diseases/features related to immunity/metabolism, a pseudo-tissue (pseudo-bulk) expression profile and a preferential risk tissue related to disease/feature was constructed for each tissue using the scRNA-seq dataset from human cells (HCL, n= 513,707 cells in 35 adult tissues).

In one embodiment, three single cell datasets are collected for brain-related diseases: the mouse brain scRNAseq dataset (n= 160,796 cells), the human brain olfactory cortex snRNA-seq dataset (n= 11,786 cells), and the human brain snRNA-seq dataset comprising both regions of the olfactory cortex and somatosensory cortex (n= 101,906 cells).

In one example, to discover immune cell populations associated with severe covd-19, a large-scale PBMC scRNA-seq dataset (n= 469,453 cells) was collected containing 254 peripheral blood samples with varying degrees of covd-19 severity (mild n=109 samples, moderate n=102 samples, severe n=50) and 16 healthy controls.

102: processing single-cell sequencing data and pathway data by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;

in one embodiment, the step of processing single cell sequencing data and pathway data to obtain a PAS scoring matrix of a cell pathway and obtaining PAS of the cell pathway by using a machine learning method comprises the following steps:

acquiring single cell sequencing data and pathway data;

carrying out standardization treatment on a gene-cell matrix in single-cell sequencing data to obtain a standardized gene-cell matrix; specifically, the sparse gene-cell matrix in the scRNA-seq data was normalized using a variance stabilizing transformation parameter with a scale factor of 10,000, resulting in normalized expression of a single gene in a single cell; the normalized formula is:

, wherein ,/>

Representing the original expression of gene g in cell j, < >>

Represents the normalized expression of gene g in cell j;

based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scoring PAS of single cells in a single pathway;

in one embodiment, the pathway data is KEGG pathway data, and the pathway from the KEGG database is the default for evaluating PASThe gene set converts the standardized gene-cell matrix into a channel-cell matrix by utilizing a Singular Value Decomposition (SVD) method; using

Representing the gene set in pathway i, for each pathway i, matrix A was selected from the normalized gene-cell matrix A _i Wherein matrix A _i Is the whole N cells, and the row is the pathway gene set +.>

Middle->

Gene, obtained from SVD

Wherein U represents +.>

Orthogonal matrix, < >>

Representing a diagonal matrix with all zeros except the main diagonal element>

Representation->

An orthogonal matrix; right orthogonal matrix->

Column t vector->

Representing the t-th principal component, reflecting the co-expression variability of genes in single cell data in the pathway; since the first principal component PC1 represents the largest variance variation, the projection of the cell j feature on PC1 represents the +.>

The method comprises the steps of carrying out a first treatment on the surface of the For cell j, use is made of a pathwayAll expression variances in i are used as weight adjustment originals +.>

The method comprises the steps of carrying out a first treatment on the surface of the For gene g in pathway i, readjusting gene expression using min-max scaling>

Regulated Gene expression +.>

。

In one embodiment, the pathway activity score PAS is optimized to obtain a weighted PAS;

the acquisition mode of the weighted PAS comprises

wherein ,

representing weighted PAS->

Representing normalized expression of gene g in optimized cell j,/I>

A pathway activity score PAS representing cell j pathway i;

in one embodiment of the present invention, in one embodiment,

the acquisition mode of (1) comprises the following steps: />

wherein ,

represents the normalized expression of gene g in cell j, < >>

Represents the maximum value of gene expression in pathway i,/->

Represents the minimum value of gene expression in pathway i.

Optionally, the method of machine learning includes a method of Singular Value Decomposition (SVD); the SVD method greatly improves the calculation efficiency of analysis sparse matrix, and can obtain the characteristic value under the condition of not calculating variance matrix; and (3) the standardized gene-cell matrix is sublimated into a path-cell matrix in a low-dimensional space by utilizing a singular value decomposition method.

103: acquiring genetic association data, annotating SNPs in the genetic association data into the path data based on the path data, and obtaining genetic effect values of all SNPs in single path data;

in one embodiment, the step of annotating the SNPs in the genetic association data into the pathway data comprises:

screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on channel data to obtain channels with SNPs annotation;

alternatively, the SNPs of a single gene may be obtained by the steps of: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;

respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of a single gene;

the SNPs of a single gene are collected to obtain SNPs of all genes. Specifically, the genetic association data is GWAS data, and SNPs in the GWAS summary statistical data are allocated to related genes by taking 20kb as a default parameter; using symbols

A gene g with SNP k, wherein a plurality of single SNPs correspond to a plurality of genes by distribution of SNP gene pairs; since the whole process requires the parameter to be inferred from thousands of snps, butSNPs of the single SNPs corresponding to a plurality of genes have no effect on the inference process, so that the repeated genes need to be treated in a correlated manner as independent SNP genes; preserving SNPs with Minor Allele Frequency (MAF) greater than 0.1, deleting SNPs on sex chromosomes, and finally obtaining SNPs of related genes;

annotating genes with associated SNPs into the pathway based on the pathway in the KEGG database, and representing the set of SNPs in pathway i using si=formula 2; calculating linkage disequilibrium LD (linkage disequilibrium) on SNPs extracted from the GWAS summary data by using the 3 rd stage data of the thousand genome project; the present protocol provides a collection of functional genes such as GO, reactiomer, and MSigDB as alternative options. In addition, the region of the major histocompatibility complex where the broad LD exists, chr6:25-35 Mbp, was deleted.

In one embodiment, GWAS data has given a phenotype, and the phenotypic annotation of the given phenotype includes dichotomy, continuous dependency characteristics, or intra-phenotype and center measurements.

104: parameter estimation is carried out on the distribution of the genetic effect values based on the genetic effect values of each SNP in PAS and channel data by utilizing a polygene regression model of the genetic association data, so as to obtain an estimation coefficient;

in one embodiment, the genetic effect value is obtained by the formula:

； wherein ,/>

Theoretical effect size vector representing m SNPs, < ->

in one embodiment of the present invention, in one embodiment,

SNP sets representing all SNPs contained in the localized genes of each pathway i, multiple gene models assuming a priori pathway iThe effect size of all SNPs follows a multivariate normal distribution, wherein +.>

Variance indicating magnitude of SNPs effect in pathway, +.>

Representation->

A unit matrix;

in one embodiment, the obtaining manner of the estimation coefficient includes:

wherein ,

representing an estimated coefficient of pathway i in cell j, the estimated coefficient reflecting the effect of cell-specific PAS on GWAS effect size variance, i.e., the effect of inheritance on response; />

Representing intercept term, < ->

Variance indicating the magnitude of SNP effect in the pathway, +.>

Representing weighted PAS;

in one embodiment, the genetic effect value is based on previous assumptions

The distribution of (2) is estimated using the following formula: />

The method comprises the steps of carrying out a first treatment on the surface of the Optimizing the estimation coefficient by using the formula;

in one embodiment, to optimize the estimation coefficients for each path in the multiple gene regression model, a method is used that significantly improves computational efficiency and estimationOptimizing a polygene regression model by a method-of-models approach; then, the observed and expected squaring effects of SNPs associated with each pathway are fitted and the expected values are estimated by the following formula:

wherein Tr represents a matrix track.

105: processing the estimated coefficient and PAS to obtain genetic related pathway activity score gPAS of the cell;

in one embodiment, the step of processing the estimated coefficients and PAS to obtain a genetically related pathway activity score, gPAS, of the cell comprises: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;

the genetic related pathway activity score gPAS (gPj) was obtained by:

wherein ,

the estimated coefficients are optimized;

106: and carrying out correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes.

Optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and the total score of gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain N personality-related genes; specifically, to maximize efficacy, the expression of each gene g is inversely weighted by its gene-specific technical noise level estimated by modeling the mean variance relationship between genes in the scRNA-seq data;

optionally, the N character related genes are the first 1000 or the last 1000 character related genes sequenced according to a descending order or an ascending order of the relativity; n is not limited to 1000, and N is a natural number integer.

In one embodiment, the method further comprises: calculating a trait related score TRS of each cell according to the N trait related genes; alternatively, the trait related scores TRS for the N genes are calculated using a cell scoring method of an AddModuleScore function in the semoat; the expression for obtaining the trait related score TRS for each cell is: trs=average RE (GS) -average RE (CG); wherein average RE (GS) is the average relative expression value of N personality-related gene sets in a given cell, and average RE (CG) is the average relative expression value of the same number of control gene sets randomly extracted from the existing gene library; RE is relative expression; GS is gene set; CG is control gene set;

in one embodiment, the statistical significance of individual cells is determined by calculating the rank distribution of trait-related genes to further evaluate whether the cells are significantly related to the shape of interest; specifically, the percentage grade of the trait-related gene in the cell is obtained,

, wherein ,/>

Expressing the expression level of gene G in cell j, G representing the number of genes associated with the specified trait; the gene percentage grade follows the normal distribution +.>

Obtaining a statistical value ++for each cell under the null assumption that there is no correlation between the percentage classes of genes>

The formula is obtained as follows: />

。

Based on the large number of cells in single cell data, the central limit theorem was used to derive

Is a distribution of: />

Wherein N is the total number of cells; the assumption for the significance test is: />

The method comprises the steps of carrying out a first treatment on the surface of the The P value for each cell j is:

。

an application, the application comprising any one of:

obtaining trait related cells based on the trait related score TRS for each cell and the level P value for each cell (representing whether individual cell levels are related or not);

alternatively, a trait-related cell type or subpopulation (determining whether a cell type to which an individual cell belongs is related) is obtained based on the block bootstrapping block bootstrap method; specifically, a group of cells is considered as a pseudo-tissue (pseudo-bulk) transcriptome profile, and the amount of gene expression across cells within a given cell type is averaged; for the associated cell types, standard error was estimated with block bootstrap method and t statistics were calculated for each cell type corresponding to the P value. Whereas the goal of the bootstrapping approach is to maintain the data structure as sampling is distributed from experience, the pathway of the KEGG database is utilized to divide the genome into multiple biologically significant blocks and to perform a replacement sampling of the pathway-based blocks described above. Under default parameters, 200 block-guided iterations are performed for each cell type association analysis, and the default parameters may be modified when specifically performed.

Optionally, sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway according to the sorting result (the pathway ranked at the top in the sorting result is selected) and the P value of the pathway on the cell type level; specifically, the gPAS is ordered based on the central limit theorem; using symbols

Representing the cell type t, calculating +.>

Percent passage rating for each cell j within: />

, wherein ,/>

gPAS grade of pathway i in cell j, M represents total number of pathways; similarly, the statistical significance of each pathway i in cell type t is calculated using the following formula

, wherein ,/>

The assumption is that: />

The method comprises the steps of carrying out a first treatment on the surface of the The P value for each pathway i in cell type t is: />

。

FIG. 2 is a schematic diagram of a conventional deviceThe embodiment of the invention provides a screening device schematic flow chart of a character related gene, which comprises the following steps: a memory and a processor;

the memory is used for storing program instructions; the processor is used for calling program instructions, and when the program instructions are executed, the screening method of the character related genes is executed.

FIG. 3 is a schematic diagram of a preferred embodiment of the present inventionThe screening system schematic flow chart of the trait related genes provided by the embodiment of the invention comprises the following steps:

an acquisition unit 301 for acquiring single cell sequencing data;

a first processing unit 302, configured to process single-cell sequencing data and pathway data by using a machine learning method, so as to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;

a second processing unit 303, configured to obtain genetic association data, and annotate SNPs in the genetic association data into the pathway data based on the pathway data, so as to obtain genetic effect values of all SNPs in the single pathway data;

a third processing unit 304, configured to perform parameter estimation on the distribution of genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by using a multiple-gene regression model of the genetic association data, to obtain an estimation coefficient;

a fourth processing unit 305 for processing the estimation coefficient and PAS to obtain a genetic related pathway activity score gPAS of the cell;

and a fifth processing unit 306, configured to perform correlation analysis on the genetic related pathway activity score gPAS and the gene expression level of each cell, and sort the genetic related pathway activity score gPAS and the gene expression level of each cell, so as to screen out N personality-related genes.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described trait-related-gene screening method.

FIG. 4 is a schematic diagram of a preferred embodiment of the present inventionThe scoring method based on the single cell pathway provided by the embodiment of the invention obtains gPAS, and utilizes gPAS to output TRS, character related genes, character related cells, character related cell types/subgroups and outline diagrams of the character related pathway;

wherein A represents a method of converting a gene-cell matrix into a pathway-cell matrix by using singular value decomposition, and PC1 represents PAS of each pathway; b denotes annotating SNPs in GWAS data into corresponding pathways; c represents a polygene regression model; wherein the top graph represents the estimated coefficients in each path using a multiple gene regression model, and the gPAS is calculated using the estimated coefficients and the corresponding PAS, and the bottom graph represents the Pearson correlation model for combining the gPAS of each cell with the genes of all individual cellsCorrelating to rank the trait-related genes; using a solution in SeroatAddModuleScoreThe function yields the top N personality-related genes (top 1000 by default). To calculate a trait related score TRS for each cell; d represents an output, including four outputs, respectively: trait-related cells, trait-related cell types, trait-related pathways, and trait-related genes.

The results of the verification of the present verification embodiment show that assigning an inherent weight to an indication may moderately improve the performance of the present method relative to the default settings.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.

While the foregoing describes a computer device provided by the present invention in detail, those skilled in the art will appreciate that the foregoing description is not meant to limit the invention thereto, as long as the scope of the invention is defined by the claims appended hereto.

Claims

1. A method for screening a trait-related gene, comprising:

acquiring single cell sequencing data and pathway data;

acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data, and obtaining genetic effect values of all SNPs in single pathway data; the annotating SNPs in the genetic association data into pathway data comprises: screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;

carrying out parameter estimation on the distribution of the genetic effect values based on the genetic effect values of each SNP in the PAS and the pathway data by utilizing a polygene regression model of the genetic association data to obtain an estimation coefficient; the acquisition formula of the genetic effect value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>

Theoretical effect size vector representing m SNPs, < ->

2. The method for screening a gene related to a trait according to claim 1,

the obtaining mode of the estimation coefficient comprises the following steps:

wherein ,

representing the estimated coefficients of pathway i in cell j,/>

representing intercept term, < ->

Variance indicating the magnitude of SNP effect in the pathway, +.>

Representing a weighted PAS.

3. The method according to claim 1, wherein the step of processing the estimation coefficient and the PAS to obtain a genetic related pathway activity score gPAS of the cell comprises: multiplying the estimated coefficient by the PAS and then summing to obtain the genetic related pathway activity score gPAS of the cell.

4. The method for screening a gene related to a trait according to claim 3,

the genetic related pathway activity score gPAS is obtained by the following steps:

wherein the gP _j Scoring the genetically related pathway activity gPAS, said

For the optimized estimation coefficients, said +.>

Representing a weighted PAS.

5. The method according to claim 1, wherein the step of obtaining a PAS score matrix of the cell pathway by converting the normalized gene-cell matrix into a pathway-cell matrix using a machine learning method, the PAS score matrix comprising pathway activity scores PAS of individual cells in a single pathway comprises optimizing the pathway activity scores PAS to obtain weighted PAS;

the acquisition mode of the weighted PAS comprises the following steps of

wherein ,

representing weighted PAS->

Representing normalized expression of gene g in optimized cell j,/I>

The pathway activity score PAS of cell j pathway i.

6. The method for screening a gene related to a trait according to claim 5, wherein the method comprises

The acquisition mode of (1) comprises the following steps: />

Wherein the said

Represents the normalized expression of gene g in cell j, < >>

Represents the maximum value of gene expression in cell j pathway i, < >>

Represents the minimum value of gene expression in cell j pathway i.

7. The method for screening a gene related to a trait according to claim 1, wherein the method for machine learning comprises a method for Singular Value Decomposition (SVD).

8. The method for screening gene related to trait according to claim 1, wherein the step of obtaining SNPs of the single gene comprises: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;

respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with minor allele frequencies greater than 0.1 in the distribution results; deleting SNPs on sex chromosomes; SNPs of the single gene were obtained.

9. The method for screening a gene related to a trait according to claim 1, wherein the method further comprises: calculating the character related score TRS of each cell according to the N character related genes.

10. The method for screening a gene related to a trait according to claim 9,

calculating the trait related score TRS of the N trait related genes by using a cell scoring method.

11. The method for screening a gene related to a trait according to claim 1,

the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: and determining the correlation between the expression of the single gene and the gPAS through the Pearson correlation coefficient, and sequencing the genes according to the correlation to obtain the N personality-related genes.

12. The method for screening a gene related to a trait according to claim 11,

the N character related genes are the first 1000 or the last 1000 character related genes which are sequenced according to a descending order or an ascending order of the relativity.

13. Use of a method according to any one of claims 1-12, comprising any one of the following:

obtaining a trait-related cell type or subpopulation based on the block boot method block bootstrap method;

and sequencing the genetic related pathway activity scores gPAS, and obtaining the trait related pathway according to the sequencing result and the P value of the pathway on the cell type level.

14. A screening apparatus for a trait-related gene, the apparatus comprising: a memory and a processor;

the memory is used for storing program instructions; the processor is configured to invoke program instructions, which when executed, are configured to perform the screening method of the trait-related gene of any one of claims 1-12.

15. A screening system for a trait-related gene, comprising:

an acquisition unit for acquiring single cell sequencing data and pathway data;

the first processing unit is used for carrying out standardization processing on the gene-cell matrix in the single-cell sequencing data to obtain the standardized gene-cell matrix; based on the pathway data, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scores PAS of single cells in a single pathway;

the second processing unit is used for acquiring genetic association data, annotating SNPs in the genetic association data into the pathway data based on the pathway data to obtain genetic effect values of all SNPs in single pathway data, and the step of annotating the SNPs in the genetic association data into the pathway data comprises the following steps: screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding channels based on the channel data to obtain channels with SNPs annotation;

the third processing unit is configured to perform parameter estimation on the distribution of the genetic effect values based on the genetic effect value of each SNP in the PAS and the pathway data by using a polygenic regression model of the genetic association data, to obtain an estimation coefficient, where an acquisition formula of the genetic effect value is as follows:

Theoretical effect size vector representing m SNPs, < ->

16. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the trait-related gene screening method of any one of claims 1 to 11.