CN116705158A

CN116705158A - Clustering method, device, equipment and storage medium for space transcriptome data

Info

Publication number: CN116705158A
Application number: CN202310694736.2A
Authority: CN
Inventors: 刘沛文; 宫月; 侯睿; 杨家亮; 田埂
Original assignee: Beijing Yuanma Medical Laboratory Co ltd
Current assignee: Beijing Yuanma Medical Laboratory Co ltd
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-09-05

Abstract

The application provides a clustering method, device and equipment of space transcriptome data and a storage medium, and relates to the technical field of biological information. The method comprises the following steps: acquiring spatial transcriptome data of a preset biological sample, wherein the spatial transcriptome data comprises gene expression data of the preset biological sample at a plurality of spatial sampling sites; according to the gene expression data of a plurality of space sampling sites, a hypergraph model of a preset biological sample is constructed; constructing a simple undirected graph according to each superside in the supergraph model; and clustering the points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster. By adopting the spatial transcriptome data clustering method provided by the application, a tissue region with a specific biological function, which is formed by one or more cells contained in the biological sample of each sampling site, can be obtained, so that the analysis result of the clustering algorithm is more matched with the spatial transcriptome data structure.

Description

Clustering method, device, equipment and storage medium for space transcriptome data

Technical Field

The application relates to the technical field of biological information, in particular to a clustering method, device and equipment of space transcriptome data and a storage medium.

Background

A spatial transcriptome is an expression profile that analyzes and describes a particular cell type in a spatial dimension, and provides not only gene expression data at different spatial locations in a biological sample, but also spatial location information corresponding to the gene expression data.

If the spatial transcriptome data is required to be used for acquiring the information in the biological sample, a clustering algorithm is required to perform clustering analysis on a plurality of sampling sites of the spatial transcriptome data, but the existing clustering algorithm assumes that one sampling site belongs to only one cell type and one classification, which is not matched with the data structure of the spatial transcriptome, and misjudgment on the information in the biological sample can be caused.

Disclosure of Invention

The present application aims to solve the above-mentioned drawbacks of the prior art, and provides a method, a device and a storage medium for clustering spatial transcriptome data, so as to solve the problems of the prior art.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides a method for clustering spatial transcriptome data, including:

acquiring space transcriptome data of a preset biological sample; wherein the spatial transcriptome data includes gene expression data of the predetermined biological sample at a plurality of spatial sampling sites, the gene expression data of each spatial sampling site including: expression data of a plurality of genes of each spatial sampling site;

Constructing a hypergraph model of the preset biological sample according to the gene expression data of the plurality of spatial sampling sites, wherein the hypergraph model comprises the following components: at least one superside, each superside corresponding to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site;

according to each superside in the supergraph model, constructing a simple undirected graph, wherein each point in the simple undirected graph corresponds to one superside, and the weight of the undirected sides of the points is used for indicating the information of the common space sampling sites between the corresponding supersides;

and clustering the points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster.

In an embodiment, before the constructing the hypergraph model of the preset biological sample according to the gene expression data of the plurality of spatial sampling sites, the method further includes:

normalizing the gene expression data of the plurality of spatial sampling sites so that the total expression amounts of all genes in the plurality of spatial sampling sites are the same;

calculating the variance of the expression quantity of each gene at the plurality of space sampling sites according to the normalized gene expression data;

Selecting a preset number of target genes from the genes according to the variance of the expression amounts of the genes at the spatial sampling sites;

the constructing a hypergraph model of the preset biological sample according to the gene expression data of the plurality of spatial sampling sites comprises the following steps:

and constructing the hypergraph model according to the expression data of the plurality of spatial sampling sites in the target gene.

In an embodiment, the constructing a hypergraph model of the preset biological sample according to the gene expression data of the plurality of spatial sampling sites includes:

binarizing the gene expression data of the plurality of spatial sampling sites to obtain binarized data of the plurality of spatial sampling sites;

and constructing the hypergraph model according to the binarization data of the plurality of spatial sampling sites.

In an embodiment, before the simple undirected graph is constructed according to each superside in the supergraph model, the method further includes:

determining the continuity of the at least one superside corresponding spatial sampling site in the two-dimensional space where the supergraph model is located according to the position information of the plurality of spatial sampling sites;

segmenting supersides of which the continuity of the two-dimensional space does not meet preset conditions, so that the continuity of the space sampling sites corresponding to each segmented superside meets the preset conditions, and an optimized supergraph model is obtained;

The construction of the simple undirected graph according to each superside in the supergraph model comprises the following steps:

and constructing the simple undirected graph according to each superside in the optimized supergraph model.

In an embodiment, the constructing a simple undirected graph according to each superside in the supergraph model includes:

setting each superside in the supergraph model as a point in the simple undirected graph;

and setting the weight of the undirected edge between the two points corresponding to the two superedges in the simple undirected graph according to the number of the sampling sites of the public space between the two superedges in the supergraph model.

In an embodiment, the method further comprises:

and determining the marker gene of each point cluster and the cell type corresponding to each point cluster according to the gene expression clustering result of each point cluster in the at least one point cluster.

In one embodiment, the determining the marker gene of each point cluster and the cell type corresponding to each point cluster according to the gene expression clustering result of each point cluster includes:

obtaining the marker genes of each point cluster by adopting a preset difference analysis algorithm according to the gene expression clustering result of each point cluster;

And determining the cell type corresponding to each point cluster according to the marker gene of each point cluster and the corresponding relation between the preset marker gene and the cell type.

In a second aspect, an embodiment of the present application provides a spatial transcriptome data clustering apparatus, including:

the acquisition module is used for acquiring the space transcriptome data of the preset biological sample; wherein the spatial transcriptome data includes gene expression data of the predetermined biological sample at a plurality of spatial sampling sites, the gene expression data of each spatial sampling site including: expression data of a plurality of genes of each spatial sampling site;

the hypergraph model construction module is used for constructing a hypergraph model of the preset biological sample according to the gene expression data of the plurality of space sampling sites, wherein the hypergraph model comprises the following components: at least one superside, each superside corresponding to the expression data of the same gene at more than one spatial sampling site;

the undirected graph construction module is used for constructing a simple undirected graph according to each superside in the supergraph model, wherein each point in the simple undirected graph corresponds to one superside, and the weight of the undirected sides of the points is used for indicating the information of the public space sampling sites between the corresponding supersides;

And the clustering module is used for clustering the points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster.

In a third aspect, an embodiment of the present application provides a computer apparatus, including: the system comprises a processor, a storage medium and a bus, wherein the storage medium stores program instructions executable by the processor, when the computer device runs, the processor and the storage medium are communicated through the bus, and the processor executes the program instructions to execute the steps of the clustering method of the space transcriptome data according to the embodiment.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for clustering spatial transcriptome data as described in the above embodiments.

The beneficial effects of the application are as follows: the application provides a clustering method, a device, equipment and a storage medium of space transcriptome data, wherein the method comprises the following steps: firstly, acquiring space transcriptome data of a preset biological sample; wherein the spatial transcriptome data comprises gene expression data of a predetermined biological sample at a plurality of spatial sampling sites, the gene expression data of each spatial sampling site comprising: expression data of a plurality of genes at each spatial sampling site; secondly, according to gene expression data of a plurality of space sampling sites, a hypergraph model of a preset biological sample is constructed, wherein the hypergraph model comprises: at least one superside, each superside corresponding to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site; then, constructing a simple undirected graph according to each superside in the supergraph model, wherein each point in the simple undirected graph corresponds to one superside, and the weights of the points and the undirected sides of the points are used for indicating information of a public space sampling site between the corresponding supersides; and finally, clustering the points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster.

By adopting the clustering method of the space transcriptome data, which is provided by the application, the same sampling site of the preset biological sample can belong to a plurality of supersides at the same time, so that after the supersides are clustered to obtain the gene expression clustering result, certain sampling sites can simultaneously contain different gene expression clustering results, namely, the same sampling site simultaneously contains a plurality of different cell types, the detailed division of the types of a plurality of cells of the same sampling site in the space transcriptome data is realized, and the tissue region with specific biological functions, which is formed by one or a plurality of cells contained in the biological sample of each sampling site, can be obtained, so that the analysis result of a clustering algorithm is more matched with the space transcriptome data structure, and the misjudgment of information in the preset biological sample caused by mismatching is avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for clustering spatial transcriptome data according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a constructed hypergraph model according to one embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for spatial transcriptome data preprocessing according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for building a hypergraph model according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for optimizing a hypergraph model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of the result of optimizing a hypergraph model according to an embodiment of the present application;

FIG. 7 is a flow chart of a method for constructing a simple undirected graph according to one embodiment of the present application;

FIG. 8 is a schematic diagram of a simple undirected graph provided by one embodiment of the present application;

FIG. 9 is a flowchart of a method for obtaining cell types according to a clustering result according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a spatial transcriptome data clustering apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Furthermore, the terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

One key issue in biological analysis is the classification of cells in biological samples, and current research relies primarily on gene expression data provided by single cell sequencing techniques to classify cell functions, but single cell sequencing techniques cannot obtain spatial location information of cells. The space transcriptome technology is raised, so that not only the gene expression data of different space positions in the biological sample are provided, but also the space position information corresponding to the gene expression data are provided, and if the space transcriptome data can be used for acquiring the information in the biological sample, the space position information corresponding to the cells can be acquired while the cell classification is acquired, and the method has important significance for biological analysis.

However, at present, the clustering algorithm for performing cluster analysis on the spatial transcriptome data has defects that spatial position information and gene expression data cannot be fully integrated when classifying cells of a biological sample, and the main aspects are that: the spatial transcriptome data is obtained by performing data collection on a plurality of sampling sites of a biological sample, information of a plurality of cells exists in the same sampling site (generally, the diameter of the cells is 10-15 μm, and the diameter of one sampling site of the spatial transcriptome is 55 μm, and one sampling site actually comprises 2-10 cells), and after the spatial transcriptome data is subjected to cluster analysis by the existing clustering algorithm, the plurality of cells of each sampling site can only be classified into the same class, which does not conform to the data structure of the spatial transcriptome. Therefore, the application provides a clustering method of the space transcriptome data, and the method can be used for dividing the types of a plurality of cells in the same sampling site in detail in the space transcriptome data, so that the analysis result of the clustering algorithm is more matched with the structure of the space transcriptome data.

The clustering method of the spatial transcriptome data provided by the application is specifically illustrated by a plurality of examples with reference to the accompanying drawings.

Firstly, it should be noted that the embodiment of the present application provides a spatial transcriptome data clustering method, where the spatial transcriptome data clustering provided by the method may be generated by any computer device integrated with a preset spatial transcriptome data cluster generating algorithm, and the computer device may be, for example, a terminal-oriented computer device or a back-end server.

Fig. 1 is a flowchart of a method for clustering spatial transcriptome data according to an embodiment of the present application. As shown in fig. 1, the method includes:

s101, acquiring space transcriptome data of a preset biological sample.

In this embodiment, the spatial transcriptome of the biological sample is subjected to cluster analysis, so that spatial transcriptome data of a predetermined biological sample needs to be acquired before the cluster analysis.

The spatial transcriptome data comprises gene expression data of a preset biological sample at a plurality of spatial sampling sites, and the gene expression data of each spatial sampling site comprises: expression data of a plurality of genes at the spatial sampling site. The preset biological sample can be any biological sample to be identified, the number of the spatial sampling sites can be adjusted according to actual requirements, and the type of the preset biological sample and the number of the spatial sampling sites are not limited in the embodiment.

The spatial transcriptome data of the predetermined biological sample may be as shown in table 1, for example. In table 1, the row represents the sampling site name, the column represents the gene name, each numerical value represents the expression level of the gene corresponding to the column detected in the sampling site corresponding to the row, i.e., each row represents the expression level of each gene in the sampling site of the row, and since table 1 is an extremely sparse matrix, the expression level of most genes in table 1 is 0.

TABLE 1 Preset of spatial transcriptome data for biological samples

S102, constructing a hypergraph model of a preset biological sample according to gene expression data of a plurality of spatial sampling sites.

After the spatial transcriptome data of the preset biological sample is obtained, a hypergraph model of the preset biological sample can be constructed according to the gene expression data of the plurality of spatial sampling sites, and specific steps for constructing the hypergraph model can be described in the following embodiments, which are not repeated in the present embodiment.

In the hypergraph model established in this embodiment, there is at least one hyperedge, each hyperedge corresponds to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site (the hyperedge in the present application includes the expression data and the spatial continuity information of two or more sampling sites), fig. 2 is a schematic diagram of the established hypergraph model provided in one embodiment of the present application, as shown in fig. 2, each dot represents one sampling site, each closed circle represents one hyperedge, that is, there are 3 hyperedges in the hypergraph model shown in fig. 2, each hyperedge represents the expression of the same gene at different sampling sites (the hypergraph model in fig. 2 and the data in table 1 do not correspond to each other, fig. 2 in this embodiment is only an example, and in actual operation, the hypergraph model established according to the spatial transcriptome data corresponds to the spatial transcriptome data strictly.

As can be seen from fig. 2, the same sampling site may belong to multiple supersides at the same time, that is, the same sampling site may contain multiple different genes, which are matched with the structure of the spatial transcriptome data, so that the result obtained by performing cluster analysis on the spatial transcriptome data according to the method of the embodiment is completely matched with the structure of the spatial transcriptome data.

S103, constructing a simple undirected graph according to each superside in the supergraph model.

After the hypergraph model of the preset biological sample is built, a simple undirected graph can be built according to each hyperedge in the hypergraph model, and the cluster analysis of the space transcriptome data can be completed through the simple undirected graph. Specific steps for constructing the simple undirected graph can refer to the following specific descriptions of the embodiments, which are not repeated in this embodiment.

In this embodiment, each point in the constructed simple undirected graph corresponds to one superside (for example, there are 3 supersides in the supergraph model shown in fig. 2, and then there are 3 points in the simple undirected graph constructed according to the supergraph model in fig. 2), and the weights of the connection points and undirected sides of the points can be used to indicate the information of the sampling sites in the common space between the supersides corresponding to the points where the undirected sides are connected.

S104, clustering points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster.

Finally, after the simple undirected graph is constructed according to each superside in the supergraph model, clustering points in the simple undirected graph by using a preset clustering algorithm to obtain a gene expression clustering result of at least one point cluster, wherein the gene expression clustering result of the point cluster can be used for obtaining a corresponding cell type of the point cluster, one point cluster refers to a plurality of corresponding sampling sites in the same superside, and the preset clustering algorithm can be, for example, an unsupervised algorithm based on the graph, such as Louvain, leiden algorithm and the like. Specific methods for obtaining cell types using gene expression clustering results are described in the following examples of the present application.

In summary, this embodiment provides a clustering method for spatial transcriptome data, after the method provided by this embodiment is adopted to construct a hypergraph model for spatial transcriptome data of a preset biological sample, since the same sampling site of the preset biological sample can belong to multiple hyperedges at the same time, after a gene expression clustering result is obtained for the hyperedge clustering, some sampling sites can simultaneously contain different gene expression clustering results, that is, the same sampling site simultaneously contains multiple different cell types, so that detailed division of types of multiple cells of the same sampling site in the spatial transcriptome data is realized, a tissue region with specific biological functions formed by one or multiple cells contained in the biological sample of each sampling site can be obtained, so that an analysis result of a clustering algorithm is more matched with a spatial transcriptome data structure, and misjudgment on information in the preset biological sample caused by mismatching is avoided.

An embodiment of the present application further provides a method for preprocessing space transcriptome data, and fig. 3 is a schematic flow chart of a method for preprocessing space transcriptome data provided by an embodiment of the present application, as shown in fig. 3, in step S102 provided by the foregoing embodiment, before constructing a hypergraph model of a preset biological sample according to gene expression data of a plurality of space sampling sites, the method may further include:

s301, normalizing gene expression data of a plurality of spatial sampling sites so that the total expression quantity of all genes in the plurality of spatial sampling sites is the same.

Before the hypergraph model of the preset biological sample is constructed according to the gene expression data of the plurality of spatial sampling sites, the gene expression data of the spatial sampling sites can be preprocessed, so that the hypergraph model constructed according to the preprocessed data is more attached to the data, the accuracy of the hypergraph model constructed is improved, and the follow-up processing according to the hypergraph model is more convenient. The method of pretreatment is specifically set forth below.

First, after the spatial transcriptome data of the preset biological sample is obtained, the gene expression data of the plurality of spatial sampling sites may be normalized, where the normalization may make the total expression amounts of all the genes in the plurality of spatial sampling sites identical (i.e., the total amount of gene expression in each line in table 1 is identical, but the relative expression amounts between the genes in each line are not changed), for example, the gene expression amounts of each line are normalized to 1000000 (see formula 1).

The normalized formula is shown as the following formula (1), wherein G _i,j C for normalizing the expression level of the jth gene in the ith sampling site _i,j To normalize the expression level of the jth gene in the ith sampling site before, N is the number of all genes, i.e., the number of columns in table 1, e.g., 5 genes in table 1, N is 5.

G _i,j ＝C _i,j /∑ _j∈N C _i,j ×1000000 (1)

S302, calculating the variance of the expression quantity of each gene at a plurality of space sampling sites according to the normalized gene expression data.

After normalizing the gene expression data of the plurality of spatial sampling sites, the variance of the expression amount of each normalized gene in the plurality of spatial sampling sites, that is, the variance among a plurality of numerical values in each column after normalization in table 1, may be calculated, and the gene expression data may be further preprocessed through the variance.

S303, selecting a preset number of target genes from the genes according to the variance of the expression amounts of the genes at the spatial sampling sites.

After the variance of the expression amounts of the plurality of genes in the plurality of spatial sampling sites is obtained, a preset number of genes can be selected from the normalized plurality of genes to serve as target genes according to the variance, and further preprocessing of gene expression data is completed. The preset number may be 3000, for example, in actual operation, the preset number is not limited to 3000, and the preset number may be determined according to actual requirements, and the specific numerical value of the preset number is not limited in this embodiment.

Then, in step S102, a hypergraph model of a preset biological sample is constructed according to the gene expression data of the plurality of spatial sampling sites, which may include:

s304, constructing a hypergraph model according to the expression data of the target genes at a plurality of spatial sampling sites.

Step S102 may be: after the preset number of target genes are obtained, a hypergraph model is constructed according to the expression data of the target genes at a plurality of space sampling sites.

Since RNA from the current sampling site may flow to nearby sampling sites during actual operation, which may cause data pollution, in an alternative embodiment, the data from each sampling site may be further subjected to noise reduction before normalizing the gene expression data from a plurality of spatial sampling sites, so that transcript data flowing out of each sampling site is recorded only in the original sampling site, thereby reducing the data pollution. The noise reduction method may be, for example, noise reduction processing by a Spotclean or Sprod algorithm. The data structure before and after noise reduction is still the expression matrix of table 1, but some gene expression amounts at some positions are changed.

In summary, by performing pretreatment such as normalization and variance calculation on the space transcriptome data in the method of the embodiment, the total expression amounts of all genes in a plurality of space sampling sites can be the same, so that a hypergraph model constructed according to the pretreated data is more attached to the data of the space transcriptome, and the accuracy of the hypergraph model constructed is improved.

An embodiment of the present application provides a possible implementation manner of constructing a hypergraph model, and fig. 4 is a schematic flow chart of a method for constructing a hypergraph model according to an embodiment of the present application, as shown in fig. 4, in step S102, constructing a hypergraph model of a preset biological sample according to gene expression data of a plurality of spatial sampling sites may include:

s401, binarizing gene expression data of a plurality of space sampling sites to obtain binarized data of the plurality of space sampling sites.

In this embodiment, when the hypergraph model of the preset biological sample is constructed according to the gene expression data of the plurality of spatial sampling sites, the hypergraph model may be constructed according to the binarized data by binarizing the gene expression data of the plurality of spatial sampling sites.

The binarization operation comprises the following specific steps: the method comprises the steps of presetting a proper threshold value, dividing the expression quantity of normalized gene expression data in each sampling site according to the threshold value, setting a gene with the expression quantity larger than or equal to the threshold value as 1, setting a gene with the expression quantity smaller than the threshold value as 0, traversing all genes, and finishing binarization of the gene expression data of a plurality of spatial sampling sites to obtain binarized data. In the obtained binarized data, the gene expression data of each sampling site is marked as 0 or 1.

S402, constructing a hypergraph model according to binarization data of a plurality of space sampling sites.

After the binarization data of a plurality of space sampling sites are obtained, a hypergraph model can be constructed according to the binarization data of a plurality of space sampling sites, namely, sampling sites marked as 1 in the same gene are sequentially connected to obtain the hyperedge of the gene in the hypergraph model, for example, in fig. 2, the same closed circle represents the same gene with the gene expression quantity exceeding a threshold value and the same gene set as 1.

Optionally, if the sampling sites marked as 1 are sequentially connected, one sampling site is only included in a certain superside of the obtained plurality of supersides, and the superside is deleted, so that the probability that the same gene exists in only one sampling site is very small, and the false identification of other non-genetic factors can be avoided when the superside is established, and the accuracy and the reliability of the established supergraph model are further improved.

In this embodiment, the hypergraph model is constructed according to the obtained binarized data by binarizing the gene expression data, which is favorable for reducing the data amount required to be processed by the computer equipment for constructing the hypergraph model, so that the speed for constructing the hypergraph model can be increased, and the binarized data can enable the computer equipment to identify the data more accurately, so that the accuracy of constructing the hypergraph model by the computer equipment can be improved.

An embodiment of the present application further provides a possible implementation manner of optimizing a constructed hypergraph model, and fig. 5 is a schematic flow chart of a method for optimizing a hypergraph model according to an embodiment of the present application, as shown in fig. 5, before step S103, according to each hyperedge in the hypergraph model, the method may further include:

s501, determining the continuity of at least one superside corresponding spatial sampling site in the two-dimensional space where the supergraph model is located according to the position information of the plurality of spatial sampling sites.

The spatial transcriptome data may further include position information of the plurality of spatial sampling sites in the above embodiment, so after the hypergraph model is built, continuity of at least one hyperedge corresponding to the spatial sampling site in the two-dimensional space where the hypergraph model is located in the plurality of hyperedges may be determined according to the position information of the plurality of spatial sampling sites.

The position information of the plurality of spatial sampling sites comprises row and column position information of the sampling sites, namely coordinate position information of the plurality of sampling sites in a two-dimensional space.

S502, segmenting supersides of which the continuity of the two-dimensional space does not meet preset conditions, so that the continuity of the space sampling sites corresponding to each segmented superside meets the preset conditions, and an optimized supergraph model is obtained.

When the fact that the spatial sampling sites corresponding to a certain superside are discontinuous in the two-dimensional space where the supergraph model is located is determined, the superside can be divided into a plurality of supersides, so that the continuity of the spatial sampling sites corresponding to each divided superside in the two-dimensional space meets the preset condition, and the optimized supergraph model is obtained. The method comprises the steps of dividing the superside into a plurality of supersides, wherein the number of the supersides is determined by the continuity of the supersides before being divided.

Fig. 6 is a schematic diagram of a result of optimizing a hypergraph model according to an embodiment of the present application, taking fig. 2 and fig. 6 as examples, in the hypergraph model shown in fig. 2, there is a spatial sampling site corresponding to a hyperedge that is discontinuous in two dimensions where the hypergraph model is located (a region where a non-sampling site exists in the hyperedge, which may be considered as a region where a non-sampling site is not expressed by a gene, that is, the hyperedge is substantially two different hyperedges of the same gene, and the two hyperedges belong to the same gene but are not the same gene), so that the hypergraph model shown in fig. 6 may be obtained after the hyperedge is optimized by the optimization method provided by the embodiment, and in fig. 6, the continuity of the spatial sampling site corresponding to each hyperedge after the division satisfies a preset condition, so as to complete the optimization of the hypergraph model.

Optionally, if one of the obtained hyperedges of the optimized hypergraph model has only one sampling site after the hypergraph model is optimized, the hyperedge is deleted, so that the false identification of other non-genetic factors can be avoided when the hyperedge is built, and the accuracy and the reliability of the built hypergraph model are further improved.

On this basis, the construction of a simple undirected graph according to each superside in the supergraph model as described in the above embodiment may include:

s503, constructing a simple undirected graph according to each superside in the optimized supergraph model.

On the basis of the optimization method provided in this embodiment, the construction of a simple undirected graph according to each superside in the hypergraph model in the above embodiment may also be: according to each superside in the optimized supergraph model, a simple undirected graph is constructed, so that the constructed simple undirected graph is more attached to the space transcriptome data, and the clustering result is more accurate according to the simple undirected graph.

In the embodiment, the optimized hypergraph model is obtained by dividing the hyperedges of which the continuity in the two-dimensional space does not meet the preset condition, and the simple undirected graph is constructed according to the optimized hypergraph model, so that the hypergraph model, the simple undirected graph and the space transcriptome data are more attached, the clustering result obtained according to the hypergraph model and the simple undirected graph is more accurate, and the accuracy of clustering the space transcriptome data is improved.

An embodiment of the present application further provides a possible implementation manner of the simple undirected graph, and fig. 7 is a schematic flow chart of a method for constructing the simple undirected graph according to an embodiment of the present application, as shown in fig. 7, in S103 of the foregoing embodiment, the constructing the simple undirected graph according to each superside in the supergraph model may include:

s701, setting each superside in the supergraph model as a point in the simple undirected graph.

After the hypergraph model of the preset biological sample is constructed according to step S102, each hyperedge in the hypergraph model may be set to one point in the simple undirected graph.

Fig. 8 is a schematic diagram of a simple undirected graph provided in an embodiment of the present application, where the simple undirected graph shown in fig. 8 corresponds to the optimized hypergraph model shown in fig. 6, i.e. the upper left corner of fig. 6 contains 3 sampling sites and the upper left corner of fig. 8 corresponds to the upper left corner of fig. 8, the lower right corner of fig. 6 contains 2 sampling sites and the middle position of fig. 6 contains two hyperedges of 4 sampling sites, corresponding to the middle position and the lower left corner of fig. 8, respectively.

It should be noted that, in the hypergraph model of fig. 6, the number of sampling sites included in the hyperedge is irrelevant to the number of points in the simple undirected graph in fig. 8, and one hyperedge in fig. 6 strictly corresponds to one point in fig. 8, that is, the hyperedge in the hypergraph model corresponds to the point in the simple undirected graph one by one.

S702, setting the weight of the undirected edge between two points corresponding to two superedges in the simple undirected graph according to the number of the sampling sites of the public space between the two superedges in the supergraph model.

After the simple undirected graph is obtained, the weight of undirected edges between two points corresponding to two superedges in the simple undirected graph can be set according to the number of the sampling sites in the public space between the two superedges in the supergraph model.

Taking fig. 6 and 8 as an example, in fig. 6, the upper left corner includes 3 sampling sites and the middle position has one common sampling site (the common sampling site refers to the coincident sampling sites, that is, the sampling sites are simultaneously located in two different supersides), and in fig. 6, the middle position has one common sampling site, so in the simple undirected graph of fig. 8 constructed according to the supergraph model of fig. 6, the weight of the undirected edge between the point of the upper left corner and the point of the middle position is 1, and the weight of the undirected edge between the point of the middle position and the point of the lower left position is also 1 (the weight is determined by the number of the common sampling sites, and if the number of the common sampling sites is 2, the weight is 2).

It should be noted that if there is no common sampling point between the two supersides shown in fig. 6, in fig. 8 obtained according to fig. 6, there is no connection between the two points of the simple undirected graph (i.e., there is no connection between the upper right corner and the middle and lower left points in fig. 8).

In the embodiment, the hypergraph model is constructed into the simple undirected graph, and the points in the simple undirected graph are clustered to obtain the gene expression clustering result of the point cluster, and compared with the hypergraph model, the simple undirected graph has less data, so that the computer equipment performs cluster analysis according to the simple undirected graph, thereby being beneficial to reducing the data quantity of the computer equipment for cluster analysis, improving the speed of the cluster analysis, enabling the computer equipment to more accurately identify the data, and being beneficial to improving the accuracy of the cluster analysis of the computer equipment.

The embodiment of the application also provides a possible implementation method for obtaining the cell type according to the clustering result, and based on the spatial transcriptome clustering method provided by the embodiment, the spatial transcriptome clustering method provided by the application further comprises the following steps: after the gene expression clustering result of the point clusters is obtained, the marker gene of each point cluster and the cell type corresponding to each point cluster are determined according to the gene expression clustering result of each point cluster, and the method for determining the marker gene and the cell type is specifically shown in fig. 9.

Fig. 9 is a flowchart of a method for obtaining a cell type according to a clustering result according to an embodiment of the present application, where, as shown in fig. 9, the method specifically includes:

S901, obtaining the marker genes of each point cluster by adopting a preset difference analysis algorithm according to the gene expression clustering result of each point cluster.

After the gene expression clustering result of at least one point cluster is obtained according to step S104, a preset difference analysis algorithm may be adopted according to the gene expression clustering result of each point cluster to obtain a marker gene of each point cluster, and the marker gene is used to determine the cell type of the corresponding point cluster.

S902, determining the cell type corresponding to each point cluster according to the marker gene of each point cluster and the corresponding relation between the preset marker gene and the cell type.

In this embodiment, the correspondence between the marker gene and the cell type may be obtained in advance, and after the marker gene of each dot cluster is obtained according to step S901, the cell type corresponding to each dot cluster may be determined according to the marker gene of each dot cluster and the correspondence between the marker gene and the cell type.

For example, marker genes including CD3 and CD8A, are identified as cd8+ T cells; the marker gene has CD68 and is determined to be macrophage; marker genes including FGF7 and MME were identified as fibroblasts.

In this embodiment, by determining the marker gene of each dot cluster, and determining the cell type corresponding to each dot cluster according to the marker gene, compared with the gene expression clustering result, the cell type makes the clustering result more clear, so that a worker can conveniently perform subsequent processing through the cell type, for example, the niche of a preset biological sample corresponding to the space transcriptome data can be accurately identified through the cell type.

In an alternative embodiment, after obtaining the cell type corresponding to each point cluster, the computer device may further automatically annotate the cell type in the gene expression clustering result, so that a worker may conveniently view the cell type of the space transcriptome data.

The following further explains the device, the apparatus and the storage medium for executing the spatial transcriptome data clustering method provided in any of the foregoing embodiments of the present application, and specific implementation processes and technical effects thereof are the same as those of the foregoing corresponding method embodiments, and for brevity, no reference is made to corresponding contents in the method embodiments in this embodiment.

An embodiment of the present application further provides a spatial transcriptome data clustering apparatus, and fig. 10 is a schematic structural diagram of the spatial transcriptome data clustering apparatus according to an embodiment of the present application, as shown in fig. 10, where the apparatus includes:

an acquisition module 100 for acquiring spatial transcriptome data of a preset biological sample; wherein the spatial transcriptome data comprises gene expression data of a predetermined biological sample at a plurality of spatial sampling sites, the gene expression data of each spatial sampling site comprising: expression data of multiple genes per spatial sampling site.

The hypergraph model construction module 200 is configured to construct a hypergraph model of a preset biological sample according to gene expression data of a plurality of spatial sampling sites, where the hypergraph model has: at least one superside, each superside corresponding to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site.

The undirected graph construction module 300 is configured to construct a simple undirected graph according to the supersides in the supergraph model, where each point in the simple undirected graph corresponds to one superside, and weights of the undirected sides of the points are used to indicate information of a sampling site in a common space between the corresponding supersides.

And the clustering module 400 is used for clustering the points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster.

In one embodiment, the clustering device of the spatial transcriptome data further comprises a data processing module, configured to normalize gene expression data of the plurality of spatial sampling sites such that total expression amounts of all genes in the plurality of spatial sampling sites are the same; calculating the variance of the expression quantity of each gene at a plurality of space sampling sites according to the normalized gene expression data; and selecting a preset number of target genes from the genes according to the variance of the expression quantity of the genes at the spatial sampling sites.

The hypergraph model construction module 200 is further configured to construct a hypergraph model according to expression data of the target gene at a plurality of spatial sampling sites.

In an embodiment, the hypergraph model building module 200 is further configured to binarize the gene expression data of the plurality of spatial sampling sites to obtain binarized data of the plurality of spatial sampling sites; and constructing a hypergraph model according to the binarized data of the plurality of spatial sampling sites.

In an embodiment, the data processing module is further configured to determine, according to position information of the plurality of spatial sampling sites, continuity of at least one spatial sampling site corresponding to the superside in a two-dimensional space where the supergraph model is located; and segmenting the supersides of which the continuity of the two-dimensional space does not meet the preset condition, so that the continuity of the space sampling sites corresponding to each segmented superside meets the preset condition, and obtaining the optimized supergraph model.

The undirected graph construction module 300 is further configured to construct a simple undirected graph according to each superside in the optimized supergraph model.

In an embodiment, the undirected graph construction module 300 is further configured to set each superside in the supergraph model to a point in the simple undirected graph; according to the number of the common space sampling sites between two supersides in the supergraph model, the weight of the undirected side between the corresponding two points of the two supersides in the simple undirected graph is set.

In one embodiment, the clustering device of the spatial transcriptome data further includes a determining module, configured to determine a marker gene of each point cluster and a cell type corresponding to each point cluster according to the gene expression clustering result of each point cluster.

In an embodiment, the determining module is further configured to obtain a marker gene of each point cluster by adopting a preset difference analysis algorithm according to a gene expression clustering result of each point cluster; and determining the cell type corresponding to each point cluster according to the marker gene of each point cluster and the corresponding relation between the preset marker gene and the cell type.

The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASICs), or one or more microprocessors, or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGAs), etc. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

An embodiment of the present application further provides a computer device, and fig. 11 is a schematic structural diagram of the computer device according to an embodiment of the present application, where, as shown in fig. 11, the computer device includes: the system comprises a processor 1, a storage medium 2 and a bus 3, wherein the storage medium stores program instructions executable by the processor, and when the computer device runs, the processor communicates with the storage medium through the bus, and the processor executes the program instructions to execute the steps of the clustering method of the space transcriptome data provided by the embodiment.

An embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the method for clustering spatial transcriptome data as provided in the above embodiment.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some of the steps of the methods according to the embodiments of the application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of clustering spatial transcriptome data, comprising:

2. The method of claim 1, wherein prior to constructing the hypergraph model of the predetermined biological sample from the gene expression data of the plurality of spatial sampling sites, the method further comprises:

3. The method of claim 1, wherein constructing a hypergraph model of the predetermined biological sample from the gene expression data of the plurality of spatial sampling sites comprises:

4. The method of claim 1, wherein prior to constructing a simple undirected graph from the hyperedges in the hypergraph model, the method further comprises:

5. The method of claim 1, wherein constructing a simple undirected graph from the hyperedges in the hypergraph model comprises:

6. The method according to claim 1, wherein the method further comprises:

7. The method of claim 6, wherein determining the marker gene of each of the clusters and the cell type corresponding to each of the clusters based on the gene expression clustering result of each of the clusters comprises:

8. A spatial transcriptome data clustering apparatus, comprising:

the hypergraph model construction module is used for constructing a hypergraph model of the preset biological sample according to the gene expression data of the plurality of space sampling sites, wherein the hypergraph model comprises the following components: at least one superside, each superside corresponding to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site;

9. A computer device, comprising: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating via the bus when the computer device is running, the processor executing the program instructions to perform the steps of the method of clustering spatial transcriptome data according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the method of clustering spatial transcriptome data according to any one of claims 1 to 7.