CN116453597A

CN116453597A - Space variable gene identification method and system for space transcriptome data

Info

Publication number: CN116453597A
Application number: CN202310369928.6A
Authority: CN
Inventors: 俞章盛; 袁欣; 马嫣然
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-07-18

Abstract

The invention relates to a spatially variable gene identification method for spatially transcriptomic data, comprising: performing data conversion and feature extraction on the original data by a half-pooling method; performing stability test on output data obtained by half-pooling treatment; the results of the stability assays were combined to identify spatially variable genes. Compared with the prior art, the method has the advantages of high identification accuracy, high calculation speed and the like.

Description

Space variable gene identification method and system for space transcriptome data

Technical Field

The invention relates to the technical field of biological information, in particular to a space variable gene identification method and a system for space transcriptome data.

Background

The rapid development of spatial transcriptomics technology has driven research in the reconstruction, development, disease, etc. of tissue structures, and large-scale spatial transcriptomics research is becoming popular. One very important and unique problem in spatial transcriptomics analysis methods is the identification of spatially variable genes. The specific meaning of spatially variable genes is that genes whose expression has a certain spatial pattern in the spatial distribution of the tissue. From the data, the expression count of spatially variable genes has a specific relationship with spatial position.

The traditional space statistical model is faced with large quantity, complex structure, high dimension and sparse space transcriptome data often fail, so that a space variable gene identification method adapting to the characteristics of the space transcriptome data needs to be developed.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a space variable gene identification method and a system for space transcriptome data, which have high identification accuracy and high calculation speed.

The aim of the invention can be achieved by the following technical scheme:

according to a first aspect of the present invention there is provided a spatially variable gene identification method for spatial transcriptomics data, the method comprising the steps of:

s1, carrying out half-pooling treatment on the original gene expression data of each gene;

s2, performing stability test on the output data after half-pooling treatment;

s3, carrying out combined test on a plurality of stability test results;

and S4, judging whether the gene is a space variable gene according to the combined test result.

Preferably, the half-pooling treatment in the step S1 specifically includes: respectively carrying out average value calculation on the space transcriptome data according to the given K groups of half-pooling parameters, and rearranging the obtained output data into a one-dimensional sequence according to the space position; the half-pooling parameters comprise a direction parameter and a step size parameter.

Preferably, the half-pooling process includes four different sets of half-pooling parameters, respectively:

1) The direction is: row direction, step size: n is n _row ；

2) The direction is: row direction, step size:

3) The direction is: column direction, step size: n is n _col ；

4) The direction is: column direction, step size:

wherein n is _col For the number of columns, n, contained by the spatial transcriptome data _row For the number of rows contained in the spatial transcriptome data, [ - ]]Representing an integer.

Preferably, the stability test in step S2 is a Box-Pierce test, and is used for respectively performing stability tests on the output data processed by different half-pooling parameters.

Preferably, the parameter setting in the Box-Pierce test includes: maximum delay order parameter m= [ ln (T) ], wherein T is the output data length after half-pooling processing, and [ · ] represents an integer.

Preferably, the combination test in the step S3 adopts a Stouffer combination method, and the specific calculation mode is as follows:

wherein phi is ^-1 (. Cndot.) is the inverse of the cumulative distribution function of the standard normal distribution, K is the number of groups of half-pooling parameters, and N (0, 1) is the standard normal distribution.

Preferably, the step S4 further includes performing holm method correction on the combined test result.

According to a second aspect of the present invention there is provided a spatially variable gene recognition system based on spatial transcriptomics data, the system comprising:

the half-pooling processing module is used for performing half-pooling processing on the original gene expression data of each gene;

the stability checking module is used for performing stability checking on the output data after half-pooling treatment;

the combined test module is used for carrying out combined test on a plurality of stability test results;

and the space variable gene judging module is used for judging whether the space variable gene is the space variable gene according to the combined test result.

According to a third aspect of the present invention there is provided an electronic device comprising a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method of any one of the above when executing the program.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of any one of the above.

Compared with the prior art, the invention has the following advantages:

1) According to the invention, the data conversion and the feature extraction are carried out on the original data by a half-pooling method, the stability test is carried out on the output data obtained by the half-pooling treatment, and the combination test is carried out on the stability test result, so that the space variable genes are identified, and the method has the advantages of high identification accuracy and high calculation speed;

2) The invention adopts a half-pooling method containing direction parameters and step parameters to perform data conversion and feature extraction, and is used for large-scale space transcriptome data with large quantity, complex structure, high dimensionality and sparseness;

3) The stability test is carried out on the output data after half-pooling treatment by using Box-Pierce test, so that the accuracy is high;

4) A Stouffer combination method is adopted to carry out combination test on a plurality of stability test results, so that the accuracy of the test results is improved;

5) The P value of the combined test is corrected by a holm method, so that the false positive rate can be effectively controlled, and the identification accuracy is improved.

Drawings

FIG. 1 is a flow chart of a spatially variable gene recognition method of the present invention.

FIG. 2 is a schematic diagram showing the implementation of the half-pooling process step of the present invention.

FIG. 3 is a partial schematic of the results of an embodiment of the present invention, showing the identification of top-ranked 20 spatially variable genes in the embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Examples

The invention relates to a space variable gene identification method aiming at large-scale space transcriptome data, which aims at identifying space variable genes in the large-scale space transcriptome data, and comprises the following steps:

s2, performing stability test on the output data after half-pooling treatment;

s3, carrying out combined test on a plurality of stability test results;

Next, detailed description will be made of specific implementation of the method of the present embodiment.

This example uses spatial transcriptome data of a colorectal cancer organization, which is a freely available public data set (http:// www.cancerdiversity.asia/scclm /).

1. Dataset preprocessing

The original dataset was filtered for genes that were not expressed or that were low expressed, and the filtering criteria used in this example were: genes with expression ratios below 1% in all spots were filtered out. The filtered dataset included 15427 genes and 4124 shots, including 78 rows and 128 columns.

2. Semi-pooling process

Calculating an average value of the expression data of each gene in space according to given direction parameters and step parameters, wherein the specific four groups of parameters are as follows:

1) The direction is: row direction, step size: 78;

2)the direction is: row direction, step size:

3) The direction is: column direction, step size: 128.

4) The direction is: column direction, step size:

wherein, [. Cndot. ] represents an integer, and the semi-pooling treatment schematic diagram is shown in figure 2.

3. Stability test

According to the given four groups of half-pooling parameters, after half-pooling treatment of each gene, four new output sequences are obtained, and the lengths of the four output sequences are 128, 584, 78 and 390 respectively. For each half-pooled output sequence r= (r) ₁ ,..,r _t ,…,r _T ) ^T The stability test, i.e. the Box-Pierce test, which is a test method for testing the autocorrelation of sequence data, test statistic Q _m Obeying x with degree of freedom m ² The distribution, box-Pierce test statistic is calculated by:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing autocorrelation coefficients, +.>Representing auto-covariance, r= (r) ₁ ,..,r _t ,…,r _T ) ^T Is the data output after half-pooling processing, < >>Is the mean value of r, m= [ ln (T)]T is the length of the output sequence after half pooling treatment, []Representing an integer.

The degrees of freedom of the Box-Pierce test for the four output sequences are respectively: 4,5,6,6. And respectively carrying out stability test on the four half-pooling output sequences of each gene to obtain a corresponding P value.

4. Combination test

For the P values of 4 stability tests of each gene, a Stouffer combination method is used for carrying out combination test, the Stouffer combination method converts the P values of a plurality of independent hypothesis tests into a P value, and a total of h P values are assumed, wherein the specific calculation mode of the combination mode is as follows:

wherein phi is ^-1 (. Cndot.) is the inverse of the cumulative distribution function of the standard normal distribution, in the specific form:wherein erf ^-1 (x) Is the inverse of the error function, defined as finding a number y such that erf (y) =x, and the inverse of the error function does not have a simple analytical formula, and is typically calculated using a numerical method.

The P-value calculation mode of the combination test is as follows:

p _c ＝1-Φ(z _stouffer )

5. p value correction

To control false positive rate, a holm method was used to correct for the P-value of the combined test. Genes with P values less than 0.05 corrected by holm method were considered spatially variable genes. holm is a commonly used multiple ratio for controlling error incidenceMore corrective methods, in particular: all P values are first ordered from small to large: p is p ₍₁₎ ,p ₍₂₎ ,…,p _(rank) ,…,p _P And calculating a correction factor corresponding to each p value after sequencing: correction factor _(rank) =p-rank+1, corrected P value is:

in this example, 8020 spatially variable genes were identified, and the top 20 genes selected in this example are shown in fig. 3, which shows that the genes have obvious spatial expression patterns intuitively, indicating that this example can effectively identify genes with spatial expression patterns.

Taking gene B2M as an example, carrying out half-pooling treatment on the spatial expression data of the gene B2M in the example according to four given groups of parameters to obtain four new output sequences, respectively carrying out stability test on each output sequence, wherein the P values are respectively as follows: <2.2e-16, <2.2e-16, <2.2e-16, <2.2e-16. The P value of the combined test was 0 and the P value after correction was 0, so that B2M was a spatially variable gene, and as can also be seen in fig. 3, the gene expression of B2M had a distinct spatial pattern.

The electronic device of the present invention includes a Central Processing Unit (CPU) that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) or computer program instructions loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device can also be stored. The CPU, ROM and RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.

A plurality of components in a device are connected to an I/O interface, comprising: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communication unit allows the device to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processing unit performs the respective methods and processes described above, for example, the methods S1 to S4. For example, in some embodiments, methods S1-S4 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via the ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of the methods S1 to S4 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform methods S1-S4 by any other suitable means (e.g., by means of firmware).

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for spatially variable gene identification for spatial transcriptomics data, the method comprising the steps of:

s2, performing stability test on the output data after half-pooling treatment;

s3, carrying out combined test on a plurality of stability test results;

2. The method for spatially-variable gene identification of spatial transcriptomics data according to claim 1, wherein the half-pooling process in step S1 is specifically: respectively carrying out average value calculation on the space transcriptome data according to the given K groups of half-pooling parameters, and rearranging the obtained output data into a one-dimensional sequence according to the space position; the half-pooling parameters comprise a direction parameter and a step size parameter.

3. The method for spatially-variable gene identification of spatial transcriptomics data of claim 2, wherein the half-pooling process comprises four different sets of half-pooling parameters, each:

1) The direction is: row direction, step size: n is n _row ；

2) The direction is: row direction, step size:

3) The direction is: column direction, step size: n is n _col ；

4) The direction is: column direction, step size:

wherein n is _col For the number of columns, n, contained by the spatial transcriptome data _roe For the number of rows contained in the spatial transcriptome data, [ - ]]Representing an integer.

4. The method for spatially variable gene identification of spatial transcriptomics data according to claim 2, wherein the stability test in step S2 is a Box-Pierce test for separately performing stability tests on the output data processed by the different half-pooling parameters.

5. The method for spatially-variable gene identification of spatial transcriptomics data of claim 4, wherein the parameter settings in the Box-Pierce test comprise: maximum delay order parameter m= [ ln (T) ], wherein T is the output data length after half-pooling processing, and [ · ] represents an integer.

6. The method for identifying spatially-variable genes for spatially-transcriptomic data according to claim 2, wherein the combination test in step S3 adopts a Stouffer combination method, and the specific calculation method is as follows:

wherein phi is ^-1 (. Cndot.) is the inverse of the cumulative distribution function of the standard normal distribution, K is the number of groups of half-pooling parameters, and N (0, 1 is the standard normal distribution).

7. The method for spatially-variable gene identification of spatial transcriptomics data of claim 1, wherein step S4 further comprises performing holm method correction on the combined test results.

8. A spatially-variable gene recognition system for spatial transcriptomics data, using the method of claim 1, the system comprising:

9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor, when executing the program, implements the method according to any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-7.