CN116453597A - Space variable gene identification method and system for space transcriptome data - Google Patents

Space variable gene identification method and system for space transcriptome data Download PDF

Info

Publication number
CN116453597A
CN116453597A CN202310369928.6A CN202310369928A CN116453597A CN 116453597 A CN116453597 A CN 116453597A CN 202310369928 A CN202310369928 A CN 202310369928A CN 116453597 A CN116453597 A CN 116453597A
Authority
CN
China
Prior art keywords
pooling
data
spatially
variable gene
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310369928.6A
Other languages
Chinese (zh)
Inventor
俞章盛
袁欣
马嫣然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202310369928.6A priority Critical patent/CN116453597A/en
Publication of CN116453597A publication Critical patent/CN116453597A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a spatially variable gene identification method for spatially transcriptomic data, comprising: performing data conversion and feature extraction on the original data by a half-pooling method; performing stability test on output data obtained by half-pooling treatment; the results of the stability assays were combined to identify spatially variable genes. Compared with the prior art, the method has the advantages of high identification accuracy, high calculation speed and the like.

Description

Space variable gene identification method and system for space transcriptome data
Technical Field
The invention relates to the technical field of biological information, in particular to a space variable gene identification method and a system for space transcriptome data.
Background
The rapid development of spatial transcriptomics technology has driven research in the reconstruction, development, disease, etc. of tissue structures, and large-scale spatial transcriptomics research is becoming popular. One very important and unique problem in spatial transcriptomics analysis methods is the identification of spatially variable genes. The specific meaning of spatially variable genes is that genes whose expression has a certain spatial pattern in the spatial distribution of the tissue. From the data, the expression count of spatially variable genes has a specific relationship with spatial position.
The traditional space statistical model is faced with large quantity, complex structure, high dimension and sparse space transcriptome data often fail, so that a space variable gene identification method adapting to the characteristics of the space transcriptome data needs to be developed.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a space variable gene identification method and a system for space transcriptome data, which have high identification accuracy and high calculation speed.
The aim of the invention can be achieved by the following technical scheme:
according to a first aspect of the present invention there is provided a spatially variable gene identification method for spatial transcriptomics data, the method comprising the steps of:
s1, carrying out half-pooling treatment on the original gene expression data of each gene;
s2, performing stability test on the output data after half-pooling treatment;
s3, carrying out combined test on a plurality of stability test results;
and S4, judging whether the gene is a space variable gene according to the combined test result.
Preferably, the half-pooling treatment in the step S1 specifically includes: respectively carrying out average value calculation on the space transcriptome data according to the given K groups of half-pooling parameters, and rearranging the obtained output data into a one-dimensional sequence according to the space position; the half-pooling parameters comprise a direction parameter and a step size parameter.
Preferably, the half-pooling process includes four different sets of half-pooling parameters, respectively:
1) The direction is: row direction, step size: n is n row
2) The direction is: row direction, step size:
3) The direction is: column direction, step size: n is n col
4) The direction is: column direction, step size:
wherein n is col For the number of columns, n, contained by the spatial transcriptome data row For the number of rows contained in the spatial transcriptome data, [ - ]]Representing an integer.
Preferably, the stability test in step S2 is a Box-Pierce test, and is used for respectively performing stability tests on the output data processed by different half-pooling parameters.
Preferably, the parameter setting in the Box-Pierce test includes: maximum delay order parameter m= [ ln (T) ], wherein T is the output data length after half-pooling processing, and [ · ] represents an integer.
Preferably, the combination test in the step S3 adopts a Stouffer combination method, and the specific calculation mode is as follows:
wherein phi is -1 (. Cndot.) is the inverse of the cumulative distribution function of the standard normal distribution, K is the number of groups of half-pooling parameters, and N (0, 1) is the standard normal distribution.
Preferably, the step S4 further includes performing holm method correction on the combined test result.
According to a second aspect of the present invention there is provided a spatially variable gene recognition system based on spatial transcriptomics data, the system comprising:
the half-pooling processing module is used for performing half-pooling processing on the original gene expression data of each gene;
the stability checking module is used for performing stability checking on the output data after half-pooling treatment;
the combined test module is used for carrying out combined test on a plurality of stability test results;
and the space variable gene judging module is used for judging whether the space variable gene is the space variable gene according to the combined test result.
According to a third aspect of the present invention there is provided an electronic device comprising a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method of any one of the above when executing the program.
According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of any one of the above.
Compared with the prior art, the invention has the following advantages:
1) According to the invention, the data conversion and the feature extraction are carried out on the original data by a half-pooling method, the stability test is carried out on the output data obtained by the half-pooling treatment, and the combination test is carried out on the stability test result, so that the space variable genes are identified, and the method has the advantages of high identification accuracy and high calculation speed;
2) The invention adopts a half-pooling method containing direction parameters and step parameters to perform data conversion and feature extraction, and is used for large-scale space transcriptome data with large quantity, complex structure, high dimensionality and sparseness;
3) The stability test is carried out on the output data after half-pooling treatment by using Box-Pierce test, so that the accuracy is high;
4) A Stouffer combination method is adopted to carry out combination test on a plurality of stability test results, so that the accuracy of the test results is improved;
5) The P value of the combined test is corrected by a holm method, so that the false positive rate can be effectively controlled, and the identification accuracy is improved.
Drawings
FIG. 1 is a flow chart of a spatially variable gene recognition method of the present invention.
FIG. 2 is a schematic diagram showing the implementation of the half-pooling process step of the present invention.
FIG. 3 is a partial schematic of the results of an embodiment of the present invention, showing the identification of top-ranked 20 spatially variable genes in the embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Examples
The invention relates to a space variable gene identification method aiming at large-scale space transcriptome data, which aims at identifying space variable genes in the large-scale space transcriptome data, and comprises the following steps:
s1, carrying out half-pooling treatment on the original gene expression data of each gene;
s2, performing stability test on the output data after half-pooling treatment;
s3, carrying out combined test on a plurality of stability test results;
and S4, judging whether the gene is a space variable gene according to the combined test result.
Next, detailed description will be made of specific implementation of the method of the present embodiment.
This example uses spatial transcriptome data of a colorectal cancer organization, which is a freely available public data set (http:// www.cancerdiversity.asia/scclm /).
1. Dataset preprocessing
The original dataset was filtered for genes that were not expressed or that were low expressed, and the filtering criteria used in this example were: genes with expression ratios below 1% in all spots were filtered out. The filtered dataset included 15427 genes and 4124 shots, including 78 rows and 128 columns.
2. Semi-pooling process
Calculating an average value of the expression data of each gene in space according to given direction parameters and step parameters, wherein the specific four groups of parameters are as follows:
1) The direction is: row direction, step size: 78;
2)the direction is: row direction, step size:
3) The direction is: column direction, step size: 128.
4) The direction is: column direction, step size:
wherein, [. Cndot. ] represents an integer, and the semi-pooling treatment schematic diagram is shown in figure 2.
3. Stability test
According to the given four groups of half-pooling parameters, after half-pooling treatment of each gene, four new output sequences are obtained, and the lengths of the four output sequences are 128, 584, 78 and 390 respectively. For each half-pooled output sequence r= (r) 1 ,..,r t ,…,r T ) T The stability test, i.e. the Box-Pierce test, which is a test method for testing the autocorrelation of sequence data, test statistic Q m Obeying x with degree of freedom m 2 The distribution, box-Pierce test statistic is calculated by:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing autocorrelation coefficients, +.>Representing auto-covariance, r= (r) 1 ,..,r t ,…,r T ) T Is the data output after half-pooling processing, < >>Is the mean value of r, m= [ ln (T)]T is the length of the output sequence after half pooling treatment, []Representing an integer.
The degrees of freedom of the Box-Pierce test for the four output sequences are respectively: 4,5,6,6. And respectively carrying out stability test on the four half-pooling output sequences of each gene to obtain a corresponding P value.
4. Combination test
For the P values of 4 stability tests of each gene, a Stouffer combination method is used for carrying out combination test, the Stouffer combination method converts the P values of a plurality of independent hypothesis tests into a P value, and a total of h P values are assumed, wherein the specific calculation mode of the combination mode is as follows:
wherein phi is -1 (. Cndot.) is the inverse of the cumulative distribution function of the standard normal distribution, in the specific form:wherein erf -1 (x) Is the inverse of the error function, defined as finding a number y such that erf (y) =x, and the inverse of the error function does not have a simple analytical formula, and is typically calculated using a numerical method.
The P-value calculation mode of the combination test is as follows:
p c =1-Φ(z stouffer )
5. p value correction
To control false positive rate, a holm method was used to correct for the P-value of the combined test. Genes with P values less than 0.05 corrected by holm method were considered spatially variable genes. holm is a commonly used multiple ratio for controlling error incidenceMore corrective methods, in particular: all P values are first ordered from small to large: p is p (1) ,p (2) ,…,p (rank) ,…,p P And calculating a correction factor corresponding to each p value after sequencing: correction factor (rank) =p-rank+1, corrected P value is:
in this example, 8020 spatially variable genes were identified, and the top 20 genes selected in this example are shown in fig. 3, which shows that the genes have obvious spatial expression patterns intuitively, indicating that this example can effectively identify genes with spatial expression patterns.
Taking gene B2M as an example, carrying out half-pooling treatment on the spatial expression data of the gene B2M in the example according to four given groups of parameters to obtain four new output sequences, respectively carrying out stability test on each output sequence, wherein the P values are respectively as follows: <2.2e-16, <2.2e-16, <2.2e-16, <2.2e-16. The P value of the combined test was 0 and the P value after correction was 0, so that B2M was a spatially variable gene, and as can also be seen in fig. 3, the gene expression of B2M had a distinct spatial pattern.
The electronic device of the present invention includes a Central Processing Unit (CPU) that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) or computer program instructions loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device can also be stored. The CPU, ROM and RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.
A plurality of components in a device are connected to an I/O interface, comprising: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communication unit allows the device to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processing unit performs the respective methods and processes described above, for example, the methods S1 to S4. For example, in some embodiments, methods S1-S4 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via the ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of the methods S1 to S4 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform methods S1-S4 by any other suitable means (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.
Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A method for spatially variable gene identification for spatial transcriptomics data, the method comprising the steps of:
s1, carrying out half-pooling treatment on the original gene expression data of each gene;
s2, performing stability test on the output data after half-pooling treatment;
s3, carrying out combined test on a plurality of stability test results;
and S4, judging whether the gene is a space variable gene according to the combined test result.
2. The method for spatially-variable gene identification of spatial transcriptomics data according to claim 1, wherein the half-pooling process in step S1 is specifically: respectively carrying out average value calculation on the space transcriptome data according to the given K groups of half-pooling parameters, and rearranging the obtained output data into a one-dimensional sequence according to the space position; the half-pooling parameters comprise a direction parameter and a step size parameter.
3. The method for spatially-variable gene identification of spatial transcriptomics data of claim 2, wherein the half-pooling process comprises four different sets of half-pooling parameters, each:
1) The direction is: row direction, step size: n is n row
2) The direction is: row direction, step size:
3) The direction is: column direction, step size: n is n col
4) The direction is: column direction, step size:
wherein n is col For the number of columns, n, contained by the spatial transcriptome data roe For the number of rows contained in the spatial transcriptome data, [ - ]]Representing an integer.
4. The method for spatially variable gene identification of spatial transcriptomics data according to claim 2, wherein the stability test in step S2 is a Box-Pierce test for separately performing stability tests on the output data processed by the different half-pooling parameters.
5. The method for spatially-variable gene identification of spatial transcriptomics data of claim 4, wherein the parameter settings in the Box-Pierce test comprise: maximum delay order parameter m= [ ln (T) ], wherein T is the output data length after half-pooling processing, and [ · ] represents an integer.
6. The method for identifying spatially-variable genes for spatially-transcriptomic data according to claim 2, wherein the combination test in step S3 adopts a Stouffer combination method, and the specific calculation method is as follows:
wherein phi is -1 (. Cndot.) is the inverse of the cumulative distribution function of the standard normal distribution, K is the number of groups of half-pooling parameters, and N (0, 1 is the standard normal distribution).
7. The method for spatially-variable gene identification of spatial transcriptomics data of claim 1, wherein step S4 further comprises performing holm method correction on the combined test results.
8. A spatially-variable gene recognition system for spatial transcriptomics data, using the method of claim 1, the system comprising:
the half-pooling processing module is used for performing half-pooling processing on the original gene expression data of each gene;
the stability checking module is used for performing stability checking on the output data after half-pooling treatment;
the combined test module is used for carrying out combined test on a plurality of stability test results;
and the space variable gene judging module is used for judging whether the space variable gene is the space variable gene according to the combined test result.
9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor, when executing the program, implements the method according to any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-7.
CN202310369928.6A 2023-04-07 2023-04-07 Space variable gene identification method and system for space transcriptome data Pending CN116453597A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310369928.6A CN116453597A (en) 2023-04-07 2023-04-07 Space variable gene identification method and system for space transcriptome data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310369928.6A CN116453597A (en) 2023-04-07 2023-04-07 Space variable gene identification method and system for space transcriptome data

Publications (1)

Publication Number Publication Date
CN116453597A true CN116453597A (en) 2023-07-18

Family

ID=87121415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310369928.6A Pending CN116453597A (en) 2023-04-07 2023-04-07 Space variable gene identification method and system for space transcriptome data

Country Status (1)

Country Link
CN (1) CN116453597A (en)

Similar Documents

Publication Publication Date Title
CN110348412B (en) Key point positioning method and device, electronic equipment and storage medium
Huang et al. Evaluation of variant detection software for pooled next-generation sequence data
CN100435138C (en) Sparse convolution of multiple vectors in a digital signal processor
CN110891000B (en) GPU bandwidth performance detection method, system and related device
CN113113150A (en) Lymph node metastasis prediction model construction and training method, device, equipment and medium
CN114168318A (en) Training method of storage release model, storage release method and equipment
Orlando et al. Manipulating large-scale Arabidopsis microarray expression data: identifying dominant expression patterns and biological process enrichment
US8768499B2 (en) Production index information generating device, program therefore, and production information generating method
CN116453597A (en) Space variable gene identification method and system for space transcriptome data
CN112634991A (en) Genotyping method, genotyping device, electronic device, and storage medium
Li et al. sRNAminer: A multifunctional toolkit for next-generation sequencing small RNA data mining in plants
US20130289890A1 (en) Rank Normalization for Differential Expression Analysis of Transcriptome Sequencing Data
WO2020124275A1 (en) Method, system, and computing device for optimizing computing operations of gene sequencing system
CN103942403B (en) A kind of method and apparatus screened to magnanimity variable
WO2022061974A1 (en) Data processing method for rapid quantitative expression of transcriptome, device and storage medium
CN115511262A (en) Transformer quality detection method and device
CN114496068A (en) Protein secondary structure prediction method, device, equipment and storage medium
CN113779926A (en) Circuit detection method and device, electronic equipment and readable storage medium
CN108269004B (en) Product life analysis method and terminal equipment
CN113627611A (en) Model training method and device, electronic equipment and storage medium
CN111950778A (en) Hardware development workload estimation method, device, terminal and storage medium
CN112698877A (en) Data processing method and system
CN110797082A (en) Method and system for storing and reading gene sequencing data
CN115938353B (en) Voice sample distributed sampling method, system, storage medium and electronic equipment
CN116380149B (en) Method and system for testing rotation of instrument code wheel

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination